Check it out here: arxiv.org/abs/2508.16496
It's dedicated to the late Barry Sealey CBE and Helen Sealey whose funding of my earlier postgraduate studies opened the door to a PhD. I'm hugely indebted to them for their kindness and generosity.
My PhD thesis--On Zero-Shot Reinforcement Learning--is now on arXiv.
More detail in the paper, at the project page or in the repo!
Paper: arxiv.org/abs/2506.15446
Project Page: enjeeneer.io/projects/bfm...
Code: github.com/enjeeneer/bf...
with Tom Bewley and Jon Cullen.
We explored different sequence models: Transformers, GRUs, LSTMs, S4d, S5.
To our surprise, we found GRUs to be far-and-away the most effective, and Transformers to be disappointingly ineffective.
Why? The combined F^T x B representation seems unstable for all non-GRU methods.
We run experiments on amended ExORL environments with different types of partial observability. In particular, we explore partially observed states, and partially observed changes in dynamics.
In aggregate, we improve performance across all partially observed settings.
We solve both failure modes by replacing BFMs' standard MLPs with sequence models that condition on trajectories of observations and actions.
We call the resultant family of methods: Behaviour Foundation Models with Memory.
When Behaviour Foundation Models are fed unreliable observations, rather than states, they fail in two predictable ways.
We call these failure models *state* misidentification, and *task* misidentification.
Each inhibits performance in isolation; together they kill the model.
BFMs are amazing.
Train them on expressive (s,a,sβ²) data and you'll get the optimal policy for *any* reward function in an env.
But, what if instead of states you have observations, as is almost always the case in practice?
Excited to share our new @rl-conference.bsky.social paper! π§΅
I turned 30 today. Here are some particularly important moments from the last decade.
enjeeneer.io/posts/2025/0...
I wrote down some of my memories and reflections after the passing of my PhD advisor, Horace Yuen: realizable.substack.com/p/horace-p-y...
time is a flat circle
It all feels a bit hacky though, yeh.
- It's probs not doing pure policy exploration in the classical RL sense. The prior provided by pre-training should reduce the effective search space hugely. I could imagine that small amounts of exploration on top of the reasoning traces provided by the base model could be enough to get signal.
I don't disagree, but a couple of possible explanations:
- Fig 3 could imply that it learns to solve questions that require shorter reasoning chains first, before moving to those that require longer reasoning chains.
A brilliant colleague and wonderful soul Felix Hill recently passed away. This was a shock and in an effort to sort some things out, I wrote them down. Maybe this will help someone else, but at the very least it helped me. Rest in peace, Felix, you will be missed. www.janexwang.com/blog/2025/1/...
Thank you for this Jane, it's beautiful and heart-wrenching. I didn't know Felix well, but my few interactions with him always left me awed by his all-round brilliance. My thoughts are with you and everyone who knew him more closely. β€οΈ
#NeurIPS2024 wrapped up last week. I put together a curated reading list for #DeepRL and #reinforcementlearning work. (represents my interests).
Talks and workshops:
third-crowd-c77.notion.site/NeurIPS2024-...
Curated reading list
fracturedplane.notion.site/NeurIPS2024-...
#Holidayreading
NeurIPS revolves around demonstration. This yearβs @rl-conference.bsky.social revolved around conversation. I much prefer the latter.
Hereβs this weekβs cartoon for @theguardian.com
www.theguardian.com/football/pic...
If you think @AIatMeta βs Motivo looks cool in simulation, think how cool itβll be when we make it work in the real-world! Stop by our poster today and Iβll tell you how we do it.
Poster #6008
West Ballroom A-D
4:30-7:30pm
Demo: metamotivo.metademolab.com
#NeurIPS2024
Try π DIAMONDβs Counter Strike world model directly in your browser!
β next.journee.ai/xyz-diamond β
How long can you stay in distribution? Can you beat @snguyen.bsky.socialβs 1000 frames?
@eloialonso.bsky.social and I are at NeurIPS! Poster #6306, Friday 11am-2pm, West Ballroom
My bad for messing up the photo!
First #runconference @neuripsconf.bsky.social #NeurIPS2024 was great! Will share tomorrow's deets later today, join us!
@zacharylipton.bsky.social @adamjelley.bsky.social @random-steve.bsky.social
So excited to share our Google DeepMind team's new Nature paper on GenCast, an ML-based probabilistic weather forecasting model: www.nature.com/articles/s41...
It represents a substantial step forward in how we predict weather and assess the risk of extreme events. πͺοΈπ§΅
Iβm in Whistler/Vancouver for #NeurIPS2024, and Iβll be around all week to chat RL. Swing by our poster on Friday, or hit me up on here and we can find time for a coffee!
Poster #6008
West Ballroom A-D
Friday 13th Dec 4:30-7:30pm
More details: neurips.cc/virtual/2024...
These aren't books, but Michael Nielsen's "Principles of Effective Research" is great (michaelnielsen.org/blog/princip...), as is John Schulman's "Opinionated Guide to ML Research" (joschu.net/blog/opinion...).
I'd be interested to read your own version of this kinda blog Eugene!