Matteo Pagliardini (@matpagliardini)

Benchmarking Optimizers for Large Language Model Pretraining The recent development of Large Language Models (LLMs) has been accompanied by an effervescence of novel ideas and methods to better optimize the loss of deep learning models. Claims from those method...

new extensive evaluation of different optimizers for LLM training
arxiv.org/abs/2509.01440

03.09.2025 10:44 👍 3 🔁 2 💬 0 📌 0

Enhancing Multilingual LLM Pretraining with Model-Based Data Selection

Using the 'right' data can hugely speed up LLM training, but how to find the best training data in the vast sea of a whole web crawl?

We propose a simple classifier-based selection, enabling multilingual LLMs 🧵

23.04.2025 05:06 👍 7 🔁 2 💬 1 📌 0

#ICLR #TrainBetterLM I am at ICLR, come to our posters for improved language model training!

Recycle gradients for faster neural net training with AdEMAmix iclr.cc/virtual/2025... (Fri Apr 25, 10 am).

1/3

21.04.2025 23:54 👍 2 🔁 3 💬 1 📌 0

I am excited to announce that I will join the University of Zurich as an assistant professor in August this year! I am looking for PhD students and postdocs starting from the fall.

My research interests include optimization, federated learning, machine learning, privacy, and unlearning.

06.03.2025 02:17 👍 28 🔁 5 💬 1 📌 1

Swiss AI Initiative Logo

The Swiss AI Initiative has launched open calls for disruptive ideas - Democratizing large-scale AI for the benefit of society.

Send your idea by end of March 🏃‍♂️‍➡️ , and run on one of the largest public AI clusters globally. Everyone is eligible to apply!

swiss-ai.org

04.03.2025 23:13 👍 19 🔁 11 💬 0 📌 1

🤗Thanks a lot @haeggee.bsky.social and @mjaggi.bsky.social for having me in the MLO group at EPFL @icepfl.bsky.social to present "Large Language Models as Markov Chains".

Slides are available on my website (link in thread).

🎉 New experiments with Llama and Gemma models in the updated paper!

28.02.2025 13:03 👍 4 🔁 2 💬 1 📌 0

What is the true depth of an LLM?

Together with @danielepal.bsky.social , @matpagliardini.bsky.social, M. Jaggi and @francois.fleuret.org we show that LLMs have a smaller effective depth that can be exploited to increase inference speeds on multi-GPU settings!

arxiv.org/abs/2502.02790
(1/N)

14.02.2025 16:17 👍 13 🔁 3 💬 1 📌 0

Ok, so I can finally talk about this!

We spent the last year (actually a bit longer) training an LLM with recurrent depth at scale.

The model has an internal latent space in which it can adaptively spend more compute to think longer.

I think the tech report ...🐦‍⬛

10.02.2025 16:47 👍 23 🔁 7 💬 1 📌 1

Congrats! How important is scale for it to work? In your previous maze work it was clear a recurrent algo could solve the task. The recurrent state could be used as a scratchpad, each iteration decreasing the loss further. Language feels different, with many local minima along the recurrent path.

11.02.2025 07:09 👍 1 🔁 0 💬 0 📌 0

Interesting loss curves. I’m not familiar enough with the task to know whether the spikes are expected, but would be curious to see the grad norm.

09.02.2025 20:06 👍 0 🔁 0 💬 0 📌 0

Which task?

09.02.2025 18:41 👍 0 🔁 0 💬 1 📌 0

Let’s also call on the silent crowd—me included—to start sharing more. Let’s be the change we want to see. You disagree with the political agenda of X? Protest by sharing your latest work/thoughts on Bsky.

08.02.2025 10:53 👍 7 🔁 0 💬 1 📌 0

Probabilistic Inference Scaling Probabilistic Inference Scaling

can we scale small, open LMs to o1 level? Using classical probabilistic inference methods, YES!

Particle filtering approach to Improved inference w/o any training!
Check out probabilistic-inference-scaling.github.io

By Aisha Puri et al📈🤖
Joint MIT-CSAIL & RedHat

07.02.2025 20:05 👍 48 🔁 5 💬 1 📌 0

new open weights, 24B model, with comparable performance to Llama 3.3 70B 😮. congrats mistral team!
mistral.ai/news/mistral...

30.01.2025 19:01 👍 12 🔁 2 💬 0 📌 0

1/ 📘 Could ChatGPT get an engineering degree? Spoiler, yes! In our new @pnas.org article, we explore how AI assistants like GPT-4 perform in STEM university courses — and on average they pass a staggering 91.7% of core courses. 🧵 #AI #HigherEd #STEM #LLMs #NLProc

04.12.2024 14:53 👍 36 🔁 14 💬 1 📌 5

In my quick test on a small (120m) model trained on 14B tokens, the difference in the end is not so significant. Maybe the gap widens when training on less data, closer to chinchilla optimal, or for larger models… I’m team ReLU…

03.12.2024 08:17 👍 4 🔁 0 💬 0 📌 0

New blog post on flow matching: dl.heeere.com/cfm/

Contains some nice visuals too!

27.11.2024 12:53 👍 70 🔁 8 💬 3 📌 1

Let o1 write a review and ask the non-expert human reviewer to verify its claims/refine the review.

26.11.2024 18:07 👍 1 🔁 0 💬 0 📌 0

A wise man once told me a paper should not have more than one table. Of course there can be exceptions, but minimizing the number of tables is something I always have in mind when writing. Isolate one or two key messages from the table and convey them with graphs.

24.11.2024 16:17 👍 1 🔁 0 💬 0 📌 0

👋

24.11.2024 11:04 👍 0 🔁 0 💬 0 📌 0

Matteo Pagliardini

Latest posts by Matteo Pagliardini @matpagliardini