new extensive evaluation of different optimizers for LLM training
arxiv.org/abs/2509.01440
new extensive evaluation of different optimizers for LLM training
arxiv.org/abs/2509.01440
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection
Using the 'right' data can hugely speed up LLM training, but how to find the best training data in the vast sea of a whole web crawl?
We propose a simple classifier-based selection, enabling multilingual LLMs ๐งต
#ICLR #TrainBetterLM I am at ICLR, come to our posters for improved language model training!
Recycle gradients for faster neural net training with AdEMAmix iclr.cc/virtual/2025... (Fri Apr 25, 10 am).
1/3
I am excited to announce that I will join the University of Zurich as an assistant professor in August this year! I am looking for PhD students and postdocs starting from the fall.
My research interests include optimization, federated learning, machine learning, privacy, and unlearning.
Swiss AI Initiative Logo
The Swiss AI Initiative has launched open calls for disruptive ideas - Democratizing large-scale AI for the benefit of society.
Send your idea by end of March ๐โโ๏ธโโก๏ธ , and run on one of the largest public AI clusters globally. Everyone is eligible to apply!
swiss-ai.org
๐คThanks a lot @haeggee.bsky.social and @mjaggi.bsky.social for having me in the MLO group at EPFL @icepfl.bsky.social to present "Large Language Models as Markov Chains".
Slides are available on my website (link in thread).
๐ New experiments with Llama and Gemma models in the updated paper!
What is the true depth of an LLM?
Together with @danielepal.bsky.social , @matpagliardini.bsky.social, M. Jaggi and @francois.fleuret.org we show that LLMs have a smaller effective depth that can be exploited to increase inference speeds on multi-GPU settings!
arxiv.org/abs/2502.02790
(1/N)
Ok, so I can finally talk about this!
We spent the last year (actually a bit longer) training an LLM with recurrent depth at scale.
The model has an internal latent space in which it can adaptively spend more compute to think longer.
I think the tech report ...๐ฆโโฌ
Congrats! How important is scale for it to work? In your previous maze work it was clear a recurrent algo could solve the task. The recurrent state could be used as a scratchpad, each iteration decreasing the loss further. Language feels different, with many local minima along the recurrent path.
Interesting loss curves. Iโm not familiar enough with the task to know whether the spikes are expected, but would be curious to see the grad norm.
Which task?
Letโs also call on the silent crowdโme includedโto start sharing more. Letโs be the change we want to see. You disagree with the political agenda of X? Protest by sharing your latest work/thoughts on Bsky.
can we scale small, open LMs to o1 level? Using classical probabilistic inference methods, YES!
Particle filtering approach to Improved inference w/o any training!
Check out probabilistic-inference-scaling.github.io
By Aisha Puri et al๐๐ค
Joint MIT-CSAIL & RedHat
new open weights, 24B model, with comparable performance to Llama 3.3 70B ๐ฎ. congrats mistral team!
mistral.ai/news/mistral...
1/ ๐ Could ChatGPT get an engineering degree? Spoiler, yes! In our new @pnas.org article, we explore how AI assistants like GPT-4 perform in STEM university courses โ and on average they pass a staggering 91.7% of core courses. ๐งต #AI #HigherEd #STEM #LLMs #NLProc
In my quick test on a small (120m) model trained on 14B tokens, the difference in the end is not so significant. Maybe the gap widens when training on less data, closer to chinchilla optimal, or for larger modelsโฆ Iโm team ReLUโฆ
New blog post on flow matching: dl.heeere.com/cfm/
Contains some nice visuals too!
Let o1 write a review and ask the non-expert human reviewer to verify its claims/refine the review.
A wise man once told me a paper should not have more than one table. Of course there can be exceptions, but minimizing the number of tables is something I always have in mind when writing. Isolate one or two key messages from the table and convey them with graphs.
๐