Volkan Cevher (@cevherlions)

It turns out that the algorithm is closely related to the continuous greedy algorithm used in submodular optimization.

13.02.2025 17:04 👍 0 🔁 0 💬 0 📌 0

We also provide the first convergence rate analysis that I'm aware of for stochastic unconstrained Frank-Wolfe (i.e., without weight decay), which directly covers the muon optimizer (and much more)!

13.02.2025 16:59 👍 10 🔁 1 💬 1 📌 0

This is a joint work that I am very grateful to have worked on with the exceptionally talented team of Thomas Pethick, @wanyunxie.bsky.social, Kimon Antonakopoulos, Zhenyu Zhu at LIONS@EPFL and @tonysf.bsky.social from CentraleSupélec.

13.02.2025 16:51 👍 3 🔁 0 💬 2 📌 0

🧑‍🍳 We provide a complete cookbook for choosing the right LMO for your architecture: 📚
- Input layers (1-hot vs image)
- Hidden layers (spectral norms)
- Output layers (flexible norm choices)
All with explicit formulas and guidance for when to use each one.

13.02.2025 16:51 👍 3 🔁 0 💬 1 📌 0

🌟 It turns out many popular optimizers (SignSGD, Muon, etc.) are special cases of our framework - just with different norm choices.
Our unified analysis reveals deep connections between seemingly different approaches and provides new insights into why they work 🤔

13.02.2025 16:51 👍 2 🔁 0 💬 1 📌 0

📝 Check out the preprint: arxiv.org/abs/2502.07529
Worst-case convergence analysis with rates, guarantees for learning rate transfer, and practical advice on how to properly choose norms adapted to network geometry, backed by theory 🎯

13.02.2025 16:51 👍 3 🔁 0 💬 1 📌 0

🕵️ It’s “just” stochastic conditional gradient. The secret sauce? Don't treat your weight matrices like they're flat vectors! SCION adapts to the geometry of matrices using LMOs with respect to the correct norm: the induced operator norm.

13.02.2025 16:51 👍 2 🔁 0 💬 1 📌 0

Hyper-parameter transfer on NanonGPT.

arxiv.org/abs/2502.07529
🚀 Key results:
- Based on conditional gradient method
- Beats Muon+Adam on NanoGPT (tested up to 3B params)
- Zero-shot learning rate transfer across model size
- Uses WAY less memory (just one set of params + half-precision grads)
- Provides explicit norm control

13.02.2025 16:51 👍 4 🔁 1 💬 1 📌 0

🔥 Want to train large neural networks WITHOUT Adam while using less memory and getting better results? ⚡
Check out SCION: a new optimizer that adapts to the geometry of your problem using norm-constrained linear minimization oracles (LMOs): 🧵👇

13.02.2025 16:51 👍 18 🔁 6 💬 3 📌 1

It was a fun panel. Quite informative.

13.02.2025 15:24 👍 1 🔁 0 💬 1 📌 0

Timeo professores machinae discendi et dona ferentes.

05.01.2025 19:09 👍 8 🔁 0 💬 0 📌 0

Timeo professores machinae discendi et dona ferentes.

05.01.2025 19:08 👍 1 🔁 0 💬 0 📌 0

An illustrated guide to never learning anything

25.12.2024 00:26 👍 145 🔁 19 💬 6 📌 3

We'll present "SAMPa: Sharpness-Aware Minimization Parallelized" at #NeurIPS24 on Thursday! This is joint work with Thomas Pethick and Volkan Cevher.
📍 Find us at Poster #5904 from 16:30 in the West Ballroom.

11.12.2024 16:23 👍 1 🔁 1 💬 1 📌 0

Stable model scaling with width-independent dynamics?

Thrilled to present 2 papers at #NeurIPS 🎉 that study width-scaling in Sharpness Aware Minimization (SAM) (Th 16:30, #2104) and in Mamba (Fr 11, #7110). Our scaling rules stabilize training and transfer optimal hyperparams across scales.

🧵 1/10

10.12.2024 07:08 👍 21 🔁 5 💬 1 📌 0

This is joint work with wonderful collaborators @leenacvankadara.bsky.social , @cevherlions.bsky.social and Jin Xu during our time at Amazon.

🧵 10/10

10.12.2024 07:08 👍 3 🔁 1 💬 0 📌 0

Authors Guidelines A peer review platform for the Association for Computational Linguistics

@iclr-conf.bsky.social: Please incorporate this ACL style of feedback for reviewers:

aclrollingreview.org/authors#step...

29.11.2024 17:45 👍 3 🔁 0 💬 0 📌 0

Reviewers take note:
57% of people rejected their own argument when they thought it was someone else's. So take it easy with the criticism.

15.11.2024 22:17 👍 31 🔁 9 💬 0 📌 1

Volkan Cevher

Latest posts by Volkan Cevher @cevherlions