Negar Foroutan (@negarforoutan)

Revisiting Multilingual Data Mixtures in Language Model Pretraining The impact of different multilingual data mixtures in pretraining large language models (LLMs) has been a topic of ongoing debate, often raising concerns about potential trade-offs between language co...

8/ 📄 Read the full paper: 👉
arxiv.org/abs/2510.25947

Huge thanks to my collaborators, Paul Teiletche, Ayush Kumar Tarun, @abosselut.bsky.social
#Multilingual #LLMs #NLP

15.12.2025 18:18 👍 1 🔁 0 💬 0 📌 0

7/ 🚀 In short: Multilinguality doesn’t have to be a zero-sum game. With well-balanced and clean data, LLMs can scale across hundreds of languages — without sacrificing English or high-resource performance.

15.12.2025 18:18 👍 2 🔁 0 💬 1 📌 0

6/ 🧠 Takeaways for practitioners:
- Don’t fear English dominance: ensure each language has enough tokens instead.
- Don’t over-engineer curricula: simple mixtures work fine.
- Focus on data quality and scaling, not limiting language count.

15.12.2025 18:18 👍 1 🔁 0 💬 1 📌 0

5/ 📉 Finding #4 — The "curse of multilinguality" isn’t what we thought Performance didn’t drop just because we added more languages (up to 400!). Degradation happens because model's capacity is limited. It’s a curse of capacity, not of multilinguality!

15.12.2025 18:18 👍 1 🔁 0 💬 1 📌 0

4/ 🔀 Finding #3 — Curriculum learning doesn’t help
We tried introducing languages gradually (e.g., English → pivots → all languages).
Result: curriculum changes training dynamics but not final performance.
No reduction in "negative interference."

15.12.2025 18:18 👍 1 🔁 0 💬 1 📌 0

3/ 🌐 Finding #2 — Pivot languages: English works best
We tested whether using a "family-specific" pivot (like Russian for Slavic) helps.
Surprisingly, English outperformed or matched intra-family pivots across the board.
Why? English data tends to be richer and more diverse.

15.12.2025 18:18 👍 1 🔁 0 💬 1 📌 0

2/ 💡 Finding #1 — English ≠ the enemy
Adding more English data doesn’t hurt performance in other languages as long as each language still has enough tokens.
Similarly, adding more multilingual data doesn’t harm English.
➡️ Balance > Proportion.

15.12.2025 18:18 👍 1 🔁 0 💬 1 📌 0

1/ 🌍 How does mixing data from hundreds of languages affect LLM training?
In our new paper "Revisiting Multilingual Data Mixtures in Language Model Pretraining" we revisit core assumptions about multilinguality using 1.1B-3B models trained on up to 400 languages.
🧵👇

15.12.2025 18:18 👍 9 🔁 6 💬 1 📌 0

🌍Introducing BabyBabelLM: A Multilingual Benchmark of Developmentally Plausible Training Data!

LLMs learn from vastly more data than humans ever experience. BabyLM challenges this paradigm by focusing on developmentally plausible data

We extend this effort to 45 new languages!

15.10.2025 10:53 👍 44 🔁 16 💬 1 📌 4

1/🚨 New preprint

How do #LLMs’ inner features change as they train? Using #crosscoders + a new causal metric, we map when features appear, strengthen, or fade across checkpoints—opening a new lens on training dynamics beyond loss curves & benchmarks.

#interpretability

25.09.2025 14:02 👍 15 🔁 6 💬 2 📌 0

Paper: arxiv.org/pdf/2508.04796
Code: github.com/swiss-ai/par...

11.08.2025 12:28 👍 2 🔁 0 💬 0 📌 0

In short, Parity-aware BPE=minimal overhead+clear fairness gains. If you care about multilingual robustness, tokenization is low-hanging fruit.
Joint work with Clara Meister, @debjit-paul.bsky.social @joelniklaus.bsky.social @sinaahmadi.bsky.social @abosselut.bsky.social @ricosennrich.bsky.social

11.08.2025 12:28 👍 3 🔁 1 💬 1 📌 0

What’s even more exciting: low- and medium-resource languages benefit the most. We see better vocabulary utilization and compression rates for these languages, highlighting the effectiveness of our approach in providing fairer language allocation.

11.08.2025 12:28 👍 2 🔁 0 💬 1 📌 0

Empirical results: Gini coefficient of tokenizer disparity (0 indicates a tokenizer's compression rates across languages are equal) improves by ~83% with global compression remaining very similar. On downstream task accuracy, improvements outnumber declines across configurations

11.08.2025 12:28 👍 3 🔁 0 💬 1 📌 0

It’s a drop-in replacement in existing systems that introduces minimal training-time overhead: if you already use a BPE tokenizer, formats and tokenization/detokenization at inference are unchanged. You just need language-labeled multilingual corpora and a multi-parallel dev set.

11.08.2025 12:28 👍 2 🔁 0 💬 1 📌 0

What changes from classical BPE? Only a small part of training. We compute frequency stats per language → when choosing the next merge, we pick it from the stats of the language with the worst compression rate, rather than from global stats. Everything else stays the same!

11.08.2025 12:28 👍 2 🔁 0 💬 1 📌 0

🚨New Preprint!

In multilingual models, the same meaning can take far more tokens in some languages, penalizing users of underrepresented languages with worse performance and higher API costs. Our Parity-aware BPE algorithm is a step toward addressing this issue: 🧵

11.08.2025 12:28 👍 28 🔁 7 💬 3 📌 0

Stop by our poster presentation at @iclr-conf.bsky.social and discuss real multilingual evaluation!
Feel free to reach out anytime during the conference! We’d love to connect!

23.04.2025 05:31 👍 3 🔁 1 💬 0 📌 0

NEW PAPER ALERT: Generating visual narratives to illustrate textual stories remains an open challenge, due to the lack of knowledge to constrain faithful and self-consistent generations. Our #CVPR2025 paper proposes a new benchmark, VinaBench, to address this challenge.

01.04.2025 09:08 👍 6 🔁 5 💬 1 📌 1

Lots of great news out of the EPFL NLP lab these last few weeks. We'll be at @iclr-conf.bsky.social and @naaclmeeting.bsky.social in April / May to present some of our work in training dynamics, model representations, reasoning, and AI democratization. Come chat with us during the conference!

25.02.2025 09:18 👍 25 🔁 12 💬 1 📌 0

🚀 Introducing PICLe: a framework for in-context named-entity detection (NED) using pseudo-annotated demonstrations.
🎯 No human labeling needed—yet it outperforms few-shot learning with human annotations!
#AI #NLProc #LLMs #ICL #NER

17.12.2024 14:51 👍 12 🔁 8 💬 1 📌 1

https://www.pnas.org/doi/full/10.1073/pnas.2414955121

What’s your take on integrating AI into education while maintaining rigor? 🤔
Check out the paper for the key findings and join the discussion on AI’s place in higher education: t.co/tJ8Gg1FRCy

05.12.2024 10:20 👍 0 🔁 0 💬 0 📌 0

AI is reshaping #education, but are we ready? 🚨
Our new
@pnas.org
article explores how #LLMs challenge traditional assessments in higher education.
Instead of banning #AI, we argue for redesigning assessments to emphasize real-world problem-solving and ethical AI use.

05.12.2024 10:20 👍 2 🔁 0 💬 2 📌 0

INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge The performance differential of large language models (LLM) between languages hinders their effective deployment in many regions, inhibiting the potential economic and societal value of generative AI ...

INCLUDE evaluates how well LLMs grasp regional knowledge—local customs, culture, and info users actually need.
With ~200K questions from 52 countries, it's time to build AI that truly includes 🤗

📄Check out our paper for more details:
arxiv.org/abs/2411.19799

02.12.2024 21:25 👍 1 🔁 0 💬 0 📌 0

Excited to share our work on INCLUDE! 🚀
INCLUDE sets a new standard for #LLM benchmarks—spanning 44 languages with a focus on regional knowledge and cultural context 🌍
Time for LLMs to meet the world where it is, not where it’s translated to!
#Multilingual #AI #NLProc

02.12.2024 21:25 👍 1 🔁 0 💬 1 📌 0

Open Faculty Positions -

.@icepfl.bsky.social is hiring for multiple positions in CS (including one open call): www.epfl.ch/about/workin...

Apply to come join us in Beautiful Lausanne!

26.11.2024 11:40 👍 13 🔁 9 💬 0 📌 0

EPFL AI Center Postdoctoral Fellowships The EPFL AI Center Postdoctoral Fellowship call for proposals is now open with a deadline on 29 November 2024 (17:00 CET).Applications are encouraged from researchers at the postdoctoral level with a ...

EPFL's new AI Center has a Call for applications for postdoc fellowships in all AI-related areas. Come join if you're interested in working with me and fantastic AI colleagues!

Extra Perk: We actually do have lots of GPUs !

Deadline: November 29th

More info at:
www.epfl.ch/research/fun...

26.11.2024 23:45 👍 20 🔁 9 💬 0 📌 0

✋🏻

28.11.2024 11:08 👍 0 🔁 0 💬 0 📌 0

✋🏻

28.11.2024 11:08 👍 0 🔁 0 💬 0 📌 0

✋🏻

28.11.2024 11:08 👍 1 🔁 0 💬 1 📌 0

Negar Foroutan

Latest posts by Negar Foroutan @negarforoutan