Fantastic work by @pardofab.bsky.social, @harrischan.bsky.social, @bonniesjli.bsky.social,
@vladmnih.bsky.social, and Tim Genewein!
All details and many more results in arxiv.org/abs/2412.01441
N/N
Fantastic work by @pardofab.bsky.social, @harrischan.bsky.social, @bonniesjli.bsky.social,
@vladmnih.bsky.social, and Tim Genewein!
All details and many more results in arxiv.org/abs/2412.01441
N/N
As a sanity check, we also evaluate how well frontier models can replay the actions from a single demonstration episode (i.e., teacher-forcing, usually we perform dynamic evaluation).
Most models perform well, with the exception of o1-mini, which fails across most tasks.
5/N
We pressure-test frontier models' in-context imitation learning, using up to 1M context size and up to 10k output ("reasoning") tokens.
For o1-mini/o1-preview, performance crucially depends on having many (at least 8192) output tokens, even in simple decision-making tasks.
4/N
We evaluate most tasks with different multimodal observation formats (e.g., ASCII, RGB images).
On some tasks, certain models show strong in-context imitation learning (e.g., Gemini 1.5 below). On others, the performance is independent of the expert demonstration episodes.
3/N
We evaluate
- Phoenix (Atari)
- chess vs weakest version of Stockfish
- crosswords
- cheetah run (DM Control)
- grid world navigation
- tic-tac-toe vs random actions
We compare against a random baseline and an expert policy and use up to 512 expert demonstration episodes:
2/N
Ever wonder how well frontier models (Claude 3.5 Sonnet, Gemini 1.5 Flash & Pro, GPT-4o, o1-mini & o1-preview) play Atari, chess, or tic-tac-toe?
We present LMAct, an in-context imitation learning benchmark with long multimodal demonstrations (arxiv.org/abs/2412.01441).
🧵 1/N