It turns out that the algorithm is closely related to the continuous greedy algorithm used in submodular optimization.
It turns out that the algorithm is closely related to the continuous greedy algorithm used in submodular optimization.
We also provide the first convergence rate analysis that I'm aware of for stochastic unconstrained Frank-Wolfe (i.e., without weight decay), which directly covers the muon optimizer (and much more)!
This is a joint work that I am very grateful to have worked on with the exceptionally talented team of Thomas Pethick, @wanyunxie.bsky.social, Kimon Antonakopoulos, Zhenyu Zhu at LIONS@EPFL and @tonysf.bsky.social from CentraleSupΓ©lec.
π§βπ³ We provide a complete cookbook for choosing the right LMO for your architecture: π
- Input layers (1-hot vs image)
- Hidden layers (spectral norms)
- Output layers (flexible norm choices)
All with explicit formulas and guidance for when to use each one.
π It turns out many popular optimizers (SignSGD, Muon, etc.) are special cases of our framework - just with different norm choices.
Our unified analysis reveals deep connections between seemingly different approaches and provides new insights into why they work π€
π Check out the preprint: arxiv.org/abs/2502.07529
Worst-case convergence analysis with rates, guarantees for learning rate transfer, and practical advice on how to properly choose norms adapted to network geometry, backed by theory π―
π΅οΈ Itβs βjustβ stochastic conditional gradient. The secret sauce? Don't treat your weight matrices like they're flat vectors! SCION adapts to the geometry of matrices using LMOs with respect to the correct norm: the induced operator norm.
Hyper-parameter transfer on NanonGPT.
arxiv.org/abs/2502.07529
π Key results:
- Based on conditional gradient method
- Beats Muon+Adam on NanoGPT (tested up to 3B params)
- Zero-shot learning rate transfer across model size
- Uses WAY less memory (just one set of params + half-precision grads)
- Provides explicit norm control
π₯ Want to train large neural networks WITHOUT Adam while using less memory and getting better results? β‘
Check out SCION: a new optimizer that adapts to the geometry of your problem using norm-constrained linear minimization oracles (LMOs): π§΅π
It was a fun panel. Quite informative.
Timeo professores machinae discendi et dona ferentes.
Timeo professores machinae discendi et dona ferentes.
An illustrated guide to never learning anything
We'll present "SAMPa: Sharpness-Aware Minimization Parallelized" at #NeurIPS24 on Thursday! This is joint work with Thomas Pethick and Volkan Cevher.
π Find us at Poster #5904 from 16:30 in the West Ballroom.
Stable model scaling with width-independent dynamics?
Thrilled to present 2 papers at #NeurIPS π that study width-scaling in Sharpness Aware Minimization (SAM) (Th 16:30, #2104) and in Mamba (Fr 11, #7110). Our scaling rules stabilize training and transfer optimal hyperparams across scales.
π§΅ 1/10
This is joint work with wonderful collaborators @leenacvankadara.bsky.social , @cevherlions.bsky.social and Jin Xu during our time at Amazon.
π§΅ 10/10
@iclr-conf.bsky.social: Please incorporate this ACL style of feedback for reviewers:
aclrollingreview.org/authors#step...
Reviewers take note:
57% of people rejected their own argument when they thought it was someone else's. So take it easy with the criticism.