Chandar Research Lab (@chandar-lab)

As we hope for women to strive in research every year more than the last, we encourage all of them to apply to our lab for internships, Master’s or PhD degrees with
@sarath-chandar.bsky.social !

10.03.2026 15:34 👍 0 🔁 0 💬 0 📌 0

The Chandar Research Lab remains committed to supporting women and other underrepresented communities @mila-quebec.bsky.social and in ML with initiatives such as the graduate application assistance program or a Computer Science summer school for high school students goint to undergrad.👩‍🔬

10.03.2026 15:34 👍 0 🔁 0 💬 1 📌 0

This is only a sneak peek into the actual work they did last year, as much of their research is still under submission. Stay tuned for more interesting papers spanning ML for Biology, model merging, continual learning, etc...

10.03.2026 15:34 👍 0 🔁 0 💬 1 📌 0

Generalization Can Emerge in Tabular Foundation Models From a Single Table Deep tabular modelling increasingly relies on in-context learning where, during inference, a model receives a set of $(x,y)$ pairs as context and predicts labels for new inputs without weight updates....

Generalization Can Emerge in Tabular Foundation Models From a Single Table by Nour Shaheen at the AI for Tabular Data workshop @euripsconf.bsky.social 2025!

arxiv.org/abs/2511.09665

10.03.2026 15:34 👍 0 🔁 0 💬 1 📌 0

ICLR Poster The Expressive Limits of Diagonal SSMs for State-TrackingICLR 2026

The Expressive Limits of Diagonal SSMs for State-Tracking by Behnoush Khavari @iclr-conf.bsky.social 2026.

iclr.cc/virtual/2026...

10.03.2026 15:34 👍 0 🔁 0 💬 1 📌 0

NeoBERT: A Next-Generation BERT Recent innovations in architecture, pre-training, and fine-tuning have led to the remarkable in-context learning and reasoning abilities of large auto-regressive language models such as LLaMA and Deep...

NeoBERT: A Next Generation BERT by @lola-le-breton.bsky.social published @tmlr-pub.bsky.social and @iclr-conf.bsky.social in Rio this year.

arxiv.org/abs/2502.19587

10.03.2026 15:34 👍 0 🔁 0 💬 1 📌 0

Revisiting Replay and Gradient Alignment for Continual Pre-Training of Large Language Models Training large language models (LLMs) typically involves pre-training on massive corpora, only to restart the process entirely when new data becomes available. A more efficient and resource-conserving...

Revisiting Replay and Gradient Alignment for Continual Pre-Training of Large Language Models by Istabrak Abbes @collasconf.bsky.social

arxiv.org/abs/2508.01908

10.03.2026 15:34 👍 0 🔁 0 💬 1 📌 0

Small Encoders Can Rival Large Decoders in Detecting Groundedness Istabrak Abbes, Gabriele Prato, Quentin Fournier, Fernando Rodriguez, Alaa Boukhary, Adam Elwood, Sarath Chandar. Findings of the Association for Computational Linguistics: ACL 2025. 2025.

📜 Small Encoders Can Rival Large Decoders in Detecting Groundedness by Istabrak Abbes published
@aclmeeting.bsky.social 2025.

aclanthology.org/2025.finding...

10.03.2026 15:34 👍 0 🔁 0 💬 1 📌 0

Maryam Hashemzadeh, @lola-le-breton.bsky.social, Istabrak Abbes, Nour Shaheen, Behnoush Khavari, Anabel Tan and @katelobacheva.bsky.social. Give them a follow and look at this list of their publications with our lab in the past year!⬇️

10.03.2026 15:34 👍 0 🔁 0 💬 1 📌 0

This week, as we celebrated International Women’s Right Day for the 115th time on Sunday, the Chandar Lab wanted to pay tribute to all the amazing women doing research👩‍🎓, and to highlight the cutting-edge work they do at our lab everyday...🧵

10.03.2026 15:34 👍 0 🔁 1 💬 1 📌 0

GitHub - chandar-lab/stream-rep-rl: Streaming setup with representation learning for RL Streaming setup with representation learning for RL - chandar-lab/stream-rep-rl

Work done by @nilaksh404.bsky.social, Antoine Clavaud, @mreymond.bsky.social, Francois Rivest, and @sarath-chandar.bsky.social

Checkout the paper at : arxiv.org/abs/2602.09396
Code : github.com/chandar-lab/...

24.02.2026 15:22 👍 1 🔁 0 💬 0 📌 0

Look at the latents! t-SNE analysis shows that our method (top) learns structured, temporally coherent representations faster than standard streaming RL

24.02.2026 15:22 👍 1 🔁 0 💬 1 📌 0

Our method systematically outperforms existing baselines across Atari, MinAtar, and Octax. The best part? It remains efficient enough to train on just a few CPU cores.

24.02.2026 15:22 👍 0 🔁 0 💬 1 📌 0

Streaming data is highly correlated, which usually causes poor training. To fix this, we introduced Orthogonal Gradient Updates. By projecting gradients onto a subspace orthogonal to their history, we keep learning stable and effective.

24.02.2026 15:22 👍 1 🔁 0 💬 1 📌 0

We bring Self-Predictive Representations (SPR) to the streaming pipeline. By predicting future latent states, we force the encoder to learn much richer features from every observed frame: without needing a massive memory footprint of a replay buffer.

24.02.2026 15:22 👍 1 🔁 0 💬 1 📌 0

Without a replay buffer, streaming agents struggle to build meaningful representations. Traditional value-based losses alone can’t exploit the full informational content of transient data before it's gone.

24.02.2026 15:22 👍 1 🔁 0 💬 1 📌 0

Streaming Reinforcement Learning (RL) is a huge challenge: transitions are used once and discarded immediately. This makes agents extremely sample-inefficient. But what if we could "squeeze" more information out of every single frame?

Check out our latest paper!

24.02.2026 15:22 👍 2 🔁 3 💬 1 📌 1

The Markovian Thinker: Architecture-Agnostic Linear Scaling of Reasoning Reinforcement learning (RL) has recently become a strong recipe for training reasoning LLMs that produce long chains of thought (LongCoT). Yet the standard RL "thinking environment", where the state i...

Shoutout to the authors: Kamran Chitsaz, Milad Aghajohari, @a-kazemnejad.bsky.social. Supervised by: @sarath-chandar@bsky.social, @murefil.bsky.social, AaronCourville and @sivareddyg.bsky.social

🔗 Learn more at: arxiv.org/abs/2510.06557
🔗Build with: github.com/McGill-NLP/the-markovian-thinker

17.02.2026 15:54 👍 0 🔁 0 💬 0 📌 0

🧩 Even state-of-the-art models show Markovian Thinking at zero-shot: both GPT-oss-120B and Qwen3-30B-A3B recover/track LongCoT with no special prompting/training required, and lots of in-distribution positives on initialization, so RL with Delethink is primed to scale!!

17.02.2026 15:54 👍 0 🔁 0 💬 1 📌 0

🔥 Further, we scaled DeepSeek R1-1.5B to a thinking budget of 96K in 150 RL steps. Accuracy jumped, with mean trace lengths at around 40K tokens.

17.02.2026 15:54 👍 0 🔁 0 💬 1 📌 0

Markovian Thinking is instantiated by Delethink, an RL enviroment. With it, we trained DeepSeek R1-1.5B and demonstrated:

1️⃣ The same scaling as LongCoT-RL, but at lower costs,
2️⃣ Better test-time scaling, improving past 24K tokens, while LongCoT-RL plateaus.
3️⃣ All this while keeping linear costs!!

17.02.2026 15:54 👍 0 🔁 0 💬 1 📌 0

Markovian Thinking works by:

1️⃣ Making LLMs reason in 8K chunks.
2️⃣ At each boundary, context is reset and a small textual state from the last chunk is carried over.
🔃 Continues from that state.

✅ This decouples thinking length from context size, achieving linear compute and constant memory!

17.02.2026 15:54 👍 0 🔁 0 💬 1 📌 0

‘The Markovian Thinker’, developed by our lab, has been accepted at @iclr-conf.bsky.social   This work achieved long reasoning without the quadratic attention tax by making LLMs reason in chunks with a bounded state, achieving linear compute, constant memory and scaling beyond its training limits! 🔥

17.02.2026 15:54 👍 1 🔁 0 💬 1 📌 0

The Expressive Limits of Diagonal SSMs for State-Tracking State-Space Models (SSMs) have recently been shown to achieve strong empirical performance on a variety of long-range sequence modeling tasks while remaining efficient and highly-parallelizable....

📝 openreview.net/forum?id=5bg...
Joint work of Mehran Shakerinava, Behnoush Khavari, Siamak Ravanbakhsh and @sarath-chandar.bsky.social @mila-quebec.bsky.social .

10.02.2026 16:54 👍 1 🔁 1 💬 0 📌 0

Takeaways for architecture design:
- Diagonal structure imposes a precise group-theoretic ceiling on expressivity
- Depth helps in a principled way (one layer per Abelian factor)
- But training algorithms need to catch up — expressivity alone isn't enough

10.02.2026 16:54 👍 0 🔁 0 💬 1 📌 0

Interestingly, initializing near the analytical solution does help the model learns and generalizes. This suggests solutions sit in a basin of attraction, but training can't find it from random init.
A very different failure mode from what's been observed for Transformers on similar tasks.

10.02.2026 16:54 👍 0 🔁 0 💬 1 📌 0

But there is a catch: expressivity ≠ learnability.

In our experiments, multi-layer diagonal SSMs consistently fail to learn S₃ and A₄ with gradient-based optimization, even though solutions provably exist in the hypothesis class!

10.02.2026 16:54 👍 0 🔁 0 💬 1 📌 0

We give an explicit 2-layer diagonal SSM construction for S₃: the first layer tracks a C₂ parity automaton, the second tracks a C₃ rotation conditioned on the first layer's state — mirroring the semi-direct product decomposition.

10.02.2026 16:54 👍 0 🔁 0 💬 1 📌 0

What this means concretely:
- Parity (C₂), modular counting (Cₙ): 1 layer suffices
- Permutations of 3 elements (S₃): exactly 2 layers needed
- S₄: 3 layers
- A₅ (non-solvable): no number of diagonal layers will ever work
- Rubik’s cube: same

10.02.2026 16:54 👍 0 🔁 0 💬 1 📌 0

Theorem 2: A k-layer Complex Diagonal SSM can track a group G ⟺ G has a subnormal series of length ≤ k with Abelian factor groups.

This characterizes the expressivity of SSMs: depth lets you “peel off” one Abelian layer at a time.

10.02.2026 16:54 👍 1 🔁 0 💬 1 📌 0

Chandar Research Lab

Latest posts by Chandar Research Lab @chandar-lab