⚡DeltaProduct update with new results:
- Characterization of DeltaProduct’s state-tracking ability
- Inspection of the hidden state’s effective rank sheds light on why DeltaProduct extrapolates better to longer sequences than DeltaNet.
- Improved scaling analysis
And more!
14.06.2025 08:02
👍 0
🔁 0
💬 0
📌 0
DeltaProduct is here! Achieve better state tracing through highly parallel execution. Explore more!🚀
09.04.2025 10:11
👍 5
🔁 1
💬 0
📌 0
State Tracking in Scalable Linear RNNs - Riccardo Grazzi & Julien Siems | ASAP Seminar #04
YouTube video by ASAP Seminar Series
9/9 We also discussed state tracking in Linear RNNs at the ASAP Seminar—watch our full talk: www.youtube.com/watch?v=R_0v...
Also take a look at these excellent blog posts:
leloykun.github.io/ponder/block... (by @leloy.bsky.social )
jyopari.github.io/posts/househ... (by Jyothish Pari)
28.03.2025 14:39
👍 1
🔁 0
💬 2
📌 0
7/9 In language modeling tasks, DeltaProduct surpasses DeltaNet across lm-eval-harness benchmarks, with notable gains in length extrapolation performance as we increase nₕ.
28.03.2025 14:39
👍 0
🔁 0
💬 1
📌 0
6/9 Also on modular arithmetic with brackets, a context-free grammar, performance improves as nₕ increases.
28.03.2025 14:39
👍 0
🔁 0
💬 1
📌 0
5/9 To improve state-tracking, increasing the number of Householders nₕ is more effective than increasing the number of layers l: l=1,nₕ=2 (top row) yields much better performance than l=2 nₕ=1 (bottom row) on S₃, S₄, A₅. nₕ=4 gets good performance on S₅. nₕ=1↔DeltaNet
28.03.2025 14:39
👍 0
🔁 0
💬 1
📌 0
4/9 Building on this insight, DeltaProduct performs nₕ gradient steps per token (with different per-step keys and values), yielding a state-transition matrix A(xᵢ) as a product of nₕ generalized Householder transforms—interpolating between a rank-1 update and a dense matrix.
28.03.2025 14:39
👍 0
🔁 0
💬 1
📌 0
3/9 Following @sontaiscute.bsky.social et al. (2024), DeltaNet can be seen as performing one gradient descent step per token on an associative recall loss, resulting in a rank-1 state-transition matrix.
28.03.2025 14:39
👍 0
🔁 0
💬 1
📌 0
2/9 Linear RNNs’ expressivity depends on the state-transition matrix structure. Diagonal linear RNNs (Mamba, GLA, mLSTM) only allow token mixing. DeltaNet and RWKV-7 use a rank-1 update enabling token+channel mixing. DeltaProduct enables adjustable higher-rank updates—but how?
28.03.2025 14:39
👍 0
🔁 0
💬 1
📌 0
1/9 There is a fundamental tradeoff between parallelizability and expressivity of Large Language Models. We propose a new linear RNN architecture, DeltaProduct, that can effectively navigate this tradeoff. Here's how!
28.03.2025 14:39
👍 8
🔁 2
💬 1
📌 2