Chandar Research Lab's Avatar

Chandar Research Lab

@chandar-lab

Sarath Chandar's research group at @polymtl , @UMontreal and @Mila_Quebec focusing on Machine Learning!

10
Followers
37
Following
54
Posts
19.01.2026
Joined
Posts Following

Latest posts by Chandar Research Lab @chandar-lab

As we hope for women to strive in research every year more than the last, we encourage all of them to apply to our lab for internships, Master’s or PhD degrees with
@sarath-chandar.bsky.social !

10.03.2026 15:34 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

The Chandar Research Lab remains committed to supporting women and other underrepresented communities @mila-quebec.bsky.social and in ML with initiatives such as the graduate application assistance program or a Computer Science summer school for high school students goint to undergrad.πŸ‘©β€πŸ”¬

10.03.2026 15:34 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

This is only a sneak peek into the actual work they did last year, as much of their research is still under submission. Stay tuned for more interesting papers spanning ML for Biology, model merging, continual learning, etc...

10.03.2026 15:34 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Preview
Generalization Can Emerge in Tabular Foundation Models From a Single Table Deep tabular modelling increasingly relies on in-context learning where, during inference, a model receives a set of $(x,y)$ pairs as context and predicts labels for new inputs without weight updates....

Generalization Can Emerge in Tabular Foundation Models From a Single Table by Nour Shaheen at the AI for Tabular Data workshop @euripsconf.bsky.social 2025!

arxiv.org/abs/2511.09665

10.03.2026 15:34 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
ICLR Poster The Expressive Limits of Diagonal SSMs for State-TrackingICLR 2026

The Expressive Limits of Diagonal SSMs for State-Tracking by Behnoush Khavari @iclr-conf.bsky.social 2026.

iclr.cc/virtual/2026...

10.03.2026 15:34 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Preview
NeoBERT: A Next-Generation BERT Recent innovations in architecture, pre-training, and fine-tuning have led to the remarkable in-context learning and reasoning abilities of large auto-regressive language models such as LLaMA and Deep...

NeoBERT: A Next Generation BERT by @lola-le-breton.bsky.social published @tmlr-pub.bsky.social and @iclr-conf.bsky.social in Rio this year.

arxiv.org/abs/2502.19587

10.03.2026 15:34 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Preview
Revisiting Replay and Gradient Alignment for Continual Pre-Training of Large Language Models Training large language models (LLMs) typically involves pre-training on massive corpora, only to restart the process entirely when new data becomes available. A more efficient and resource-conserving...

Revisiting Replay and Gradient Alignment for Continual Pre-Training of Large Language Models by Istabrak Abbes @collasconf.bsky.social

arxiv.org/abs/2508.01908

10.03.2026 15:34 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Preview
Small Encoders Can Rival Large Decoders in Detecting Groundedness Istabrak Abbes, Gabriele Prato, Quentin Fournier, Fernando Rodriguez, Alaa Boukhary, Adam Elwood, Sarath Chandar. Findings of the Association for Computational Linguistics: ACL 2025. 2025.

πŸ“œ Small Encoders Can Rival Large Decoders in Detecting Groundedness by Istabrak Abbes published
@aclmeeting.bsky.social 2025.

aclanthology.org/2025.finding...

10.03.2026 15:34 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

Maryam Hashemzadeh, @lola-le-breton.bsky.social, Istabrak Abbes, Nour Shaheen, Behnoush Khavari, Anabel Tan and @katelobacheva.bsky.social. Give them a follow and look at this list of their publications with our lab in the past year!⬇️

10.03.2026 15:34 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

This week, as we celebrated International Women’s Right Day for the 115th time on Sunday, the Chandar Lab wanted to pay tribute to all the amazing women doing researchπŸ‘©β€πŸŽ“, and to highlight the cutting-edge work they do at our lab everyday...🧡

10.03.2026 15:34 πŸ‘ 0 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0
Preview
GitHub - chandar-lab/stream-rep-rl: Streaming setup with representation learning for RL Streaming setup with representation learning for RL - chandar-lab/stream-rep-rl

Work done by @nilaksh404.bsky.social, Antoine Clavaud, @mreymond.bsky.social, Francois Rivest, and @sarath-chandar.bsky.social

Checkout the paper at : arxiv.org/abs/2602.09396
Code : github.com/chandar-lab/...

24.02.2026 15:22 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Post image

Look at the latents! t-SNE analysis shows that our method (top) learns structured, temporally coherent representations faster than standard streaming RL

24.02.2026 15:22 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

Our method systematically outperforms existing baselines across Atari, MinAtar, and Octax. The best part? It remains efficient enough to train on just a few CPU cores.

24.02.2026 15:22 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

Streaming data is highly correlated, which usually causes poor training. To fix this, we introduced Orthogonal Gradient Updates. By projecting gradients onto a subspace orthogonal to their history, we keep learning stable and effective.

24.02.2026 15:22 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

We bring Self-Predictive Representations (SPR) to the streaming pipeline. By predicting future latent states, we force the encoder to learn much richer features from every observed frame: without needing a massive memory footprint of a replay buffer.

24.02.2026 15:22 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

Without a replay buffer, streaming agents struggle to build meaningful representations. Traditional value-based losses alone can’t exploit the full informational content of transient data before it's gone.

24.02.2026 15:22 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

Streaming Reinforcement Learning (RL) is a huge challenge: transitions are used once and discarded immediately. This makes agents extremely sample-inefficient. But what if we could "squeeze" more information out of every single frame?

Check out our latest paper!

24.02.2026 15:22 πŸ‘ 2 πŸ” 3 πŸ’¬ 1 πŸ“Œ 1
Preview
The Markovian Thinker: Architecture-Agnostic Linear Scaling of Reasoning Reinforcement learning (RL) has recently become a strong recipe for training reasoning LLMs that produce long chains of thought (LongCoT). Yet the standard RL "thinking environment", where the state i...

Shoutout to the authors: Kamran Chitsaz, Milad Aghajohari, @a-kazemnejad.bsky.social. Supervised by: @sarath-chandar@bsky.social, @murefil.bsky.social, AaronCourville and @sivareddyg.bsky.social

πŸ”— Learn more at: arxiv.org/abs/2510.06557
πŸ”—Build with: github.com/McGill-NLP/the-markovian-thinker

17.02.2026 15:54 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Post image

🧩 Even state-of-the-art models show Markovian Thinking at zero-shot: both GPT-oss-120B and Qwen3-30B-A3B recover/track LongCoT with no special prompting/training required, and lots of in-distribution positives on initialization, so RL with Delethink is primed to scale!!

17.02.2026 15:54 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

πŸ”₯ Further, we scaled DeepSeek R1-1.5B to a thinking budget of 96K in 150 RL steps. Accuracy jumped, with mean trace lengths at around 40K tokens.

17.02.2026 15:54 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

Markovian Thinking is instantiated by Delethink, an RL enviroment. With it, we trained DeepSeek R1-1.5B and demonstrated:

1️⃣ The same scaling as LongCoT-RL, but at lower costs,
2️⃣ Better test-time scaling, improving past 24K tokens, while LongCoT-RL plateaus.
3️⃣ All this while keeping linear costs!!

17.02.2026 15:54 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

Markovian Thinking works by:

1️⃣ Making LLMs reason in 8K chunks.
2️⃣ At each boundary, context is reset and a small textual state from the last chunk is carried over.
πŸ”ƒ Continues from that state.

βœ… This decouples thinking length from context size, achieving linear compute and constant memory!

17.02.2026 15:54 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Video thumbnail

β€˜The Markovian Thinker’, developed by our lab, has been accepted at @iclr-conf.bsky.social 

This work achieved long reasoning without the quadratic attention tax by making LLMs reason in chunks with a bounded state, achieving linear compute, constant memory and scaling beyond its training limits! πŸ”₯

17.02.2026 15:54 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
The Expressive Limits of Diagonal SSMs for State-Tracking State-Space Models (SSMs) have recently been shown to achieve strong empirical performance on a variety of long-range sequence modeling tasks while remaining efficient and highly-parallelizable....

πŸ“ openreview.net/forum?id=5bg...
Joint work of Mehran Shakerinava, Behnoush Khavari, Siamak Ravanbakhsh and @sarath-chandar.bsky.social @mila-quebec.bsky.social .

10.02.2026 16:54 πŸ‘ 1 πŸ” 1 πŸ’¬ 0 πŸ“Œ 0

Takeaways for architecture design:
- Diagonal structure imposes a precise group-theoretic ceiling on expressivity
- Depth helps in a principled way (one layer per Abelian factor)
- But training algorithms need to catch up β€” expressivity alone isn't enough

10.02.2026 16:54 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

Interestingly, initializing near the analytical solution does help the model learns and generalizes. This suggests solutions sit in a basin of attraction, but training can't find it from random init.
A very different failure mode from what's been observed for Transformers on similar tasks.

10.02.2026 16:54 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

But there is a catch: expressivity β‰  learnability.

In our experiments, multi-layer diagonal SSMs consistently fail to learn S₃ and Aβ‚„ with gradient-based optimization, even though solutions provably exist in the hypothesis class!

10.02.2026 16:54 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

We give an explicit 2-layer diagonal SSM construction for S₃: the first layer tracks a Cβ‚‚ parity automaton, the second tracks a C₃ rotation conditioned on the first layer's state β€” mirroring the semi-direct product decomposition.

10.02.2026 16:54 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

What this means concretely:
- Parity (Cβ‚‚), modular counting (Cβ‚™): 1 layer suffices
- Permutations of 3 elements (S₃): exactly 2 layers needed
- Sβ‚„: 3 layers
- Aβ‚… (non-solvable): no number of diagonal layers will ever work
- Rubik’s cube: same

10.02.2026 16:54 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

Theorem 2: A k-layer Complex Diagonal SSM can track a group G ⟺ G has a subnormal series of length ≀ k with Abelian factor groups.

This characterizes the expressivity of SSMs: depth lets you β€œpeel off” one Abelian layer at a time.

10.02.2026 16:54 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0