(@julien-siems)

⚡DeltaProduct update with new results:
- Characterization of DeltaProduct’s state-tracking ability
- Inspection of the hidden state’s effective rank sheds light on why DeltaProduct extrapolates better to longer sequences than DeltaNet.
- Improved scaling analysis
And more!

14.06.2025 08:02 👍 0 🔁 0 💬 0 📌 0

DeltaProduct is here! Achieve better state tracing through highly parallel execution. Explore more!🚀

09.04.2025 10:11 👍 5 🔁 1 💬 0 📌 0

flash-linear-attention/fla/layers/gated_deltaproduct.py at main · fla-org/flash-linear-attention 🚀 Efficient implementations of state-of-the-art linear attention models in Torch and Triton - fla-org/flash-linear-attention

DeltaProduct is now available in the flash-linear-attention library
github.com/fla-org/flas...

08.04.2025 06:09 👍 0 🔁 0 💬 0 📌 0

State Tracking in Scalable Linear RNNs - Riccardo Grazzi & Julien Siems | ASAP Seminar #04 YouTube video by ASAP Seminar Series

9/9 We also discussed state tracking in Linear RNNs at the ASAP Seminar—watch our full talk: www.youtube.com/watch?v=R_0v...
Also take a look at these excellent blog posts:
leloykun.github.io/ponder/block... (by @leloy.bsky.social )
jyopari.github.io/posts/househ... (by Jyothish Pari)

28.03.2025 14:39 👍 1 🔁 0 💬 2 📌 0

DeltaProduct: Improving State-Tracking in Linear RNNs via... Linear Recurrent Neural Networks (linear RNNs) have emerged as competitive alternatives to Transformers for sequence modeling, offering efficient training and linear-time inference. However...

8/9 This was a great project with @timurcarstensen.bsky.social , @arberz.bsky.social , Frank Hutter, Massimiliano Pontil, and @riccardograzzi.bsky.social
Check out our Oral at the FM-Wild Workshop at @ICLR:
openreview.net/forum?id=nvb...

28.03.2025 14:39 👍 0 🔁 0 💬 1 📌 0

7/9 In language modeling tasks, DeltaProduct surpasses DeltaNet across lm-eval-harness benchmarks, with notable gains in length extrapolation performance as we increase nₕ.

28.03.2025 14:39 👍 0 🔁 0 💬 1 📌 0

6/9 Also on modular arithmetic with brackets, a context-free grammar, performance improves as nₕ increases.

28.03.2025 14:39 👍 0 🔁 0 💬 1 📌 0

5/9 To improve state-tracking, increasing the number of Householders nₕ is more effective than increasing the number of layers l: l=1,nₕ=2 (top row) yields much better performance than l=2 nₕ=1 (bottom row) on S₃, S₄, A₅. nₕ=4 gets good performance on S₅. nₕ=1↔DeltaNet

28.03.2025 14:39 👍 0 🔁 0 💬 1 📌 0

4/9 Building on this insight, DeltaProduct performs nₕ gradient steps per token (with different per-step keys and values), yielding a state-transition matrix A(xᵢ) as a product of nₕ generalized Householder transforms—interpolating between a rank-1 update and a dense matrix.

28.03.2025 14:39 👍 0 🔁 0 💬 1 📌 0

3/9 Following @sontaiscute.bsky.social et al. (2024), DeltaNet can be seen as performing one gradient descent step per token on an associative recall loss, resulting in a rank-1 state-transition matrix.

28.03.2025 14:39 👍 0 🔁 0 💬 1 📌 0

2/9 Linear RNNs’ expressivity depends on the state-transition matrix structure. Diagonal linear RNNs (Mamba, GLA, mLSTM) only allow token mixing. DeltaNet and RWKV-7 use a rank-1 update enabling token+channel mixing. DeltaProduct enables adjustable higher-rank updates—but how?

28.03.2025 14:39 👍 0 🔁 0 💬 1 📌 0

1/9 There is a fundamental tradeoff between parallelizability and expressivity of Large Language Models. We propose a new linear RNN architecture, DeltaProduct, that can effectively navigate this tradeoff. Here's how!

28.03.2025 14:39 👍 8 🔁 2 💬 1 📌 2

Latest posts by @julien-siems