As we hope for women to strive in research every year more than the last, we encourage all of them to apply to our lab for internships, Masterβs or PhD degrees with
@sarath-chandar.bsky.social !
As we hope for women to strive in research every year more than the last, we encourage all of them to apply to our lab for internships, Masterβs or PhD degrees with
@sarath-chandar.bsky.social !
The Chandar Research Lab remains committed to supporting women and other underrepresented communities @mila-quebec.bsky.social and in ML with initiatives such as the graduate application assistance program or a Computer Science summer school for high school students goint to undergrad.π©βπ¬
This is only a sneak peek into the actual work they did last year, as much of their research is still under submission. Stay tuned for more interesting papers spanning ML for Biology, model merging, continual learning, etc...
Generalization Can Emerge in Tabular Foundation Models From a Single Table by Nour Shaheen at the AI for Tabular Data workshop @euripsconf.bsky.social 2025!
arxiv.org/abs/2511.09665
The Expressive Limits of Diagonal SSMs for State-Tracking by Behnoush Khavari @iclr-conf.bsky.social 2026.
iclr.cc/virtual/2026...
NeoBERT: A Next Generation BERT by @lola-le-breton.bsky.social published @tmlr-pub.bsky.social and @iclr-conf.bsky.social in Rio this year.
arxiv.org/abs/2502.19587
Revisiting Replay and Gradient Alignment for Continual Pre-Training of Large Language Models by Istabrak Abbes @collasconf.bsky.social
arxiv.org/abs/2508.01908
π Small Encoders Can Rival Large Decoders in Detecting Groundedness by Istabrak Abbes published
@aclmeeting.bsky.social 2025.
aclanthology.org/2025.finding...
Maryam Hashemzadeh, @lola-le-breton.bsky.social, Istabrak Abbes, Nour Shaheen, Behnoush Khavari, Anabel Tan and @katelobacheva.bsky.social. Give them a follow and look at this list of their publications with our lab in the past year!β¬οΈ
This week, as we celebrated International Womenβs Right Day for the 115th time on Sunday, the Chandar Lab wanted to pay tribute to all the amazing women doing researchπ©βπ, and to highlight the cutting-edge work they do at our lab everyday...π§΅
Work done by @nilaksh404.bsky.social, Antoine Clavaud, @mreymond.bsky.social, Francois Rivest, and @sarath-chandar.bsky.social
Checkout the paper at : arxiv.org/abs/2602.09396
Code : github.com/chandar-lab/...
Look at the latents! t-SNE analysis shows that our method (top) learns structured, temporally coherent representations faster than standard streaming RL
Our method systematically outperforms existing baselines across Atari, MinAtar, and Octax. The best part? It remains efficient enough to train on just a few CPU cores.
Streaming data is highly correlated, which usually causes poor training. To fix this, we introduced Orthogonal Gradient Updates. By projecting gradients onto a subspace orthogonal to their history, we keep learning stable and effective.
We bring Self-Predictive Representations (SPR) to the streaming pipeline. By predicting future latent states, we force the encoder to learn much richer features from every observed frame: without needing a massive memory footprint of a replay buffer.
Without a replay buffer, streaming agents struggle to build meaningful representations. Traditional value-based losses alone canβt exploit the full informational content of transient data before it's gone.
Streaming Reinforcement Learning (RL) is a huge challenge: transitions are used once and discarded immediately. This makes agents extremely sample-inefficient. But what if we could "squeeze" more information out of every single frame?
Check out our latest paper!
Shoutout to the authors: Kamran Chitsaz, Milad Aghajohari, @a-kazemnejad.bsky.social. Supervised by: @sarath-chandar@bsky.social, @murefil.bsky.social, AaronCourville and @sivareddyg.bsky.social
π Learn more at: arxiv.org/abs/2510.06557
πBuild with: github.com/McGill-NLP/the-markovian-thinker
π§© Even state-of-the-art models show Markovian Thinking at zero-shot: both GPT-oss-120B and Qwen3-30B-A3B recover/track LongCoT with no special prompting/training required, and lots of in-distribution positives on initialization, so RL with Delethink is primed to scale!!
π₯ Further, we scaled DeepSeek R1-1.5B to a thinking budget of 96K in 150 RL steps. Accuracy jumped, with mean trace lengths at around 40K tokens.
Markovian Thinking is instantiated by Delethink, an RL enviroment. With it, we trained DeepSeek R1-1.5B and demonstrated:
1οΈβ£ The same scaling as LongCoT-RL, but at lower costs,
2οΈβ£ Better test-time scaling, improving past 24K tokens, while LongCoT-RL plateaus.
3οΈβ£ All this while keeping linear costs!!
Markovian Thinking works by:
1οΈβ£ Making LLMs reason in 8K chunks.
2οΈβ£ At each boundary, context is reset and a small textual state from the last chunk is carried over.
π Continues from that state.
β
This decouples thinking length from context size, achieving linear compute and constant memory!
βThe Markovian Thinkerβ, developed by our lab, has been accepted at @iclr-conf.bsky.social β¨β¨This work achieved long reasoning without the quadratic attention tax by making LLMs reason in chunks with a bounded state, achieving linear compute, constant memory and scaling beyond its training limits! π₯
π openreview.net/forum?id=5bg...
Joint work of Mehran Shakerinava, Behnoush Khavari, Siamak Ravanbakhsh and @sarath-chandar.bsky.social @mila-quebec.bsky.social .
Takeaways for architecture design:
- Diagonal structure imposes a precise group-theoretic ceiling on expressivity
- Depth helps in a principled way (one layer per Abelian factor)
- But training algorithms need to catch up β expressivity alone isn't enough
Interestingly, initializing near the analytical solution does help the model learns and generalizes. This suggests solutions sit in a basin of attraction, but training can't find it from random init.
A very different failure mode from what's been observed for Transformers on similar tasks.
But there is a catch: expressivity β learnability.
In our experiments, multi-layer diagonal SSMs consistently fail to learn Sβ and Aβ with gradient-based optimization, even though solutions provably exist in the hypothesis class!
We give an explicit 2-layer diagonal SSM construction for Sβ: the first layer tracks a Cβ parity automaton, the second tracks a Cβ rotation conditioned on the first layer's state β mirroring the semi-direct product decomposition.
What this means concretely:
- Parity (Cβ), modular counting (Cβ): 1 layer suffices
- Permutations of 3 elements (Sβ): exactly 2 layers needed
- Sβ: 3 layers
- Aβ
(non-solvable): no number of diagonal layers will ever work
- Rubikβs cube: same
Theorem 2: A k-layer Complex Diagonal SSM can track a group G βΊ G has a subnormal series of length β€ k with Abelian factor groups.
This characterizes the expressivity of SSMs: depth lets you βpeel offβ one Abelian layer at a time.