Valérie Castin (@vcastin)

I asked "on the other platform" what were the most important improvements to the original 2017 transformer.

That was quite popular and here is a synthesis of the responses:

28.04.2025 06:47 👍 204 🔁 43 💬 4 📌 3

Excited to share Soup-of-Experts, a new neural network architecture that, for any given specific task, can instantiate in a flash a small model that is very good on it.

Made with ❤️ at Apple

Thanks to my co-authors David Grangier, Angelos Katharopoulos, and Skyler Seto!

arxiv.org/abs/2502.01804

05.02.2025 09:32 👍 12 🔁 4 💬 0 📌 0

A cute result from Valérie’s work is that Gaussian distributions remain closed under evolution by attentions layers, allowing one to study an ODE in the (mean, covariance) space. In particular, this enables the analysis of the “clustering of tokens” toward low-rank covariances.

01.02.2025 09:54 👍 5 🔁 2 💬 0 📌 0

How do tokens evolve as they are processed by a deep Transformer?

With José A. Carrillo, @gabrielpeyre.bsky.social and @pierreablin.bsky.social, we tackle this in our new preprint: A Unified Perspective on the Dynamics of Deep Transformers arxiv.org/abs/2501.18322

ML and PDE lovers, check it out!

31.01.2025 16:56 👍 95 🔁 16 💬 2 📌 0

Theory, Analysis, and Best Practices for Sigmoid Self-Attention Attention is a key part of the transformer architecture. It is a sequence-to-sequence mapping that transforms each sequence element into a weighted sum of values. The weights are typically obtained as...

Excited to see Sigmoid Attention accepted at ICLR 2025 !!

Make attention ~18% faster with a drop-in replacement 🚀

Code:
github.com/apple/ml-sig...

Paper
arxiv.org/abs/2409.04431

24.01.2025 18:46 👍 28 🔁 5 💬 1 📌 0

The Mathematics of Artificial Intelligence: In this introductory and highly subjective survey, aimed at a general mathematical audience, I showcase some key theoretical concepts underlying recent advancements in machine learning. arxiv.org/abs/2501.10465

22.01.2025 09:11 👍 147 🔁 43 💬 2 📌 1

Machine learning has made incredible breakthroughs, but our theoretical understanding lags behind.

We take a step towards unravelling its mystery by explaining why the phenomenon of disentanglement arises in generative latent variable models.

Blog post: carl-allen.github.io/theory/2024/...

18.12.2024 16:57 👍 18 🔁 4 💬 1 📌 1

OpenAI explores advertising as it steps up revenue drive ChatGPT maker hires advertising talent from big tech rivals

It's like when Google decided to fund itself through ads, but worse, because chatbots are already much more misleading and anthropomorphic than search engines. #AIEthics www.ft.com/content/9350...

08.12.2024 20:47 👍 48 🔁 15 💬 4 📌 5

Valérie Castin

Latest posts by Valérie Castin @vcastin