Overtrained Language Models Are Harder to Fine-Tune
Large language models are pre-trained on ever-growing token budgets under the assumption that better pre-training performance translates to improved downstream models. In this work, we challenge this ...
We also have so many other interesting details in the paper that have entirely changed the way I think about pre-training!
And thanks to my collaborators!
Sachin Goyal
Kaiyue Wen
Tanishq Kumar
@xiangyue96.bsky.social
@sadhika.bsky.social
@gneubig.bsky.social
@adtraghunathan.bsky.social
10/10
26.03.2025 18:35
π 15
π 4
π¬ 1
π 1
For the theorists in the room: we dive deeper into why this happens using a linear transfer learning setup, revealing that incremental learning leads to catastrophic overtraining.
9/10
26.03.2025 18:35
π 1
π 0
π¬ 1
π 0
Fine-tuning behaves similarly: using a fixed learning rate across different pre-training checkpoints, we see eventual degradation in both task performance and web-data perplexity. This often holds even after hyperparameter tuning. Overtraining = worse fine-tuning outcomes!
8/10
26.03.2025 18:35
π 1
π 0
π¬ 1
π 0
π Early in training: Models have low sensitivity & the base model improves quickly; performance improves π
π Late in training: Models become highly sensitive & the base model improves slowly; performance degrades! π
7/10
26.03.2025 18:35
π 1
π 0
π¬ 1
π 0
What's happening? Beyond Gaussian perturbations, extended pre-training increases model sensitivity to all types of parameter updates π
6/10
26.03.2025 18:35
π 1
π 0
π¬ 1
π 0
πΉ Early checkpoints: Robust to parameter changes.
πΈ Later checkpoints: Highly sensitive, leading to worse performance after perturbation! (Left plot: sensitivity increases over training, Right plot: final performance eventually degrades.)
5/10
26.03.2025 18:35
π 1
π 1
π¬ 1
π 0
Letβs step back and consider a simpler setting: we train our own 30M parameter models and test how Gaussian noise affects model parameters at different pre-training stagesπ
4/10
26.03.2025 18:35
π 0
π 0
π¬ 1
π 0
Example: OLMo-1B trained on 3T tokens performs over 2% *worse* after instruction tuning than its 2.3T-token versionβeven though it saw 30% more data! We see similar observations for many other post-training setups.
Why does extended pre-training hurt fine-tuning performance? π€
3/10
26.03.2025 18:35
π 0
π 0
π¬ 1
π 0
The latest language models are pre-trained on more and more tokens while holding the number of model parameters fixedβand this trend isn't slowing down!
β‘οΈ Better base models? Yes.
β‘οΈ Better starting point for post-training? Letβs check!
2/10
26.03.2025 18:35
π 0
π 0
π¬ 1
π 0
Training with more data = better LLMs, right? π¨
False! Scaling language models by adding more pre-training data can decrease your performance after post-training!
Introducing "catastrophic overtraining." π₯π§΅π
arxiv.org/abs/2503.19206
1/10
26.03.2025 18:35
π 33
π 14
π¬ 1
π 1