Agents interact with environments to get information. But exploration (tools, retrieval, user interaction) is costly.
Calibrate-Then-Act allows LLM agents to balance exploration and cost:
π Estimate uncertainty about the environment
π Reason about cost-uncertainty tradeoffs
βοΈ Act accordingly
23.02.2026 16:00
π 17
π 6
π¬ 1
π 1
A figure demonstrating the different aspects of the corpus described in the tweet. There is a main isomorphic 3D view of a level in the Portal 2 co-op game, with some portals, lasers, and the blue and orange players. Inset, there are first-person captures of the blue and orange player views. There is also a box containing the transcribed dialogue with timestamps and labels for the discursive acts. Finally, there is a box containing a task and a list of subtasks. Some subtasks are already crossed out, with the time that they have been completed. The last subtask ("Player 2 places portal 4 on wall 4") is marked incomplete.
The dialogue is as follows:
Blue: Can you put your other portal up here? (tagged as directive)
Orange: Where? (tagged as request for clarification)
Blue: On uh, on this wall. (tagged as directive)
Blue: So that it uh points at the circle. (tagged as directive)
Orange: Okay. (tagged as commit)
The full list of subtasks is:
Task: Redirect lasers
Subtask: Player 1 places portal 1 on wall 1. (completed)
Subtask: Player 1 polaces portal 2 on wall 2 or 3. (completed)
Subtask: Player 2 places portal 3 opposite of portal 2. (completed)
Subtask: Player 2 places portal 4 on wall 4. (incomplete)
A couple years (!) in the making: weβre releasing a new corpus of embodied, collaborative problem solving dialogues. We paid 36 people to play Portal 2βs co-op mode and collected their speech + game recordings.
Paper: arxiv.org/abs/2512.03381
Website: berkeley-nlp.github.io/portal-dialo...
1/n
05.12.2025 18:54
π 102
π 30
π¬ 3
π 8
I'm recruiting my first group of students at TTIC! If you're interested, please apply by December 9th and mention my name in your application
24.11.2025 17:58
π 9
π 6
π¬ 0
π 0
TTIC Faculty Opportunities at TTIC
Two brief advertisements!
TTIC is recruiting both tenure-track and research assistant professors: ttic.edu/faculty-hiri...
NYU is recruiting faculty fellows: apply.interfolio.com/174686
Happy to chat with anyone considering either of these options
23.10.2025 13:57
π 8
π 6
π¬ 0
π 0
CRA changed their interface and it's much harder to browse now for some reason...
Last year, I ended up just making a list of schools/departments that I wanted to apply to and individually searching through each of their websites for job postings
12.10.2025 23:16
π 1
π 0
π¬ 1
π 0
FYI that UChicago CS & Stats is hiring at all levels via the Data Science Institue:
Postdoc: uchicago.infoready4.com#freeformComp...
Assistant Professor: apply.interfolio.com/174766
Associate Professor: apply.interfolio.com/174768
07.10.2025 17:53
π 8
π 3
π¬ 0
π 0
What does it take to build a human-like user simulator?
What does it take to build a human-like user simulator? //
Jessy Lin and I wrote another blogpost on user simulators as a reward function for training interactive models, this time focused on methods + open questions:
jessylin.com/2025/09/25/u...
28.09.2025 15:32
π 3
π 0
π¬ 0
π 0
Eugene Vinitsky
Was talking to a student who wasn't sure about why one would get a PhD. So I wrote up a list of reasons!
www.eugenevinitsky.com/posts/reason...
27.07.2025 19:30
π 51
π 11
π¬ 7
π 0
User simulators bridge RL with real-world interaction
An excellent blog post about a still huge missing gap, models of humans you can actually use to study human-AI interaction: jessylin.com/2025/07/10/u...
10.07.2025 22:15
π 12
π 2
π¬ 1
π 0
Weβre proud to announce three new tenure-track assistant professors joining TTIC in Fall 2026: Yossi Gandelsman, Will Merrill, and Nick Tomlin (@nickatomlin.bsky.social). Meet them here: buff.ly/JH1DFtT
27.06.2025 16:29
π 7
π 2
π¬ 0
π 0
π€ π€π
29.05.2025 04:17
π 4
π 0
π¬ 1
π 0
Haha main reason for using Gym was that we wanted a way to automatically evaluate models against trained RL agents. Doing the full arena-style evaluation on reasoning models gets really expensive
It also helps that current LLMs are really good at generating functional Gym code
14.05.2025 16:36
π 1
π 0
π¬ 1
π 0
I think in the short term thatβs reasonable, e.g., current models can play chess but they definitely canβt understand chess variants
In the long term, I suspect thereβs more risk of over-optimizing to those specific games, so the hope is that our approach is a bit more future-proof
14.05.2025 16:29
π 0
π 0
π¬ 0
π 0
Results table. The best model (o1) wins about 36% of games against the RL baselines.
This is a difficult benchmark: the best non-reasoning LLMs score around 9%, while the best reasoning models score around 36%. In the future, as models get stronger, we anticipate that they'll also be able to generate harder games
13.05.2025 21:30
π 1
π 0
π¬ 1
π 0
Main paper figure showing a three-step pipeline of game description generation, implementation generation, and self-play training of RL agents
We use o1 to generate natural language rulebooks for 1000 two-player games and then implement these games as Gym environments. For each game, we train baseline agents in self-play with RL and then evaluate whether LLMs can beat the RL baselines
13.05.2025 21:30
π 4
π 0
π¬ 2
π 0
Title and abstract of the paper, "Measuring General Intelligence with Generated Games"
I'm particularly fond of this new benchmark paper we wrote, which aims to scalably evaluate whether language models can generalize to arbitrary new tasks. The core idea is to use LLMs to generate new games, and then evaluate whether LLMs can play those games
π: arxiv.org/abs/2505.07215
13.05.2025 21:30
π 33
π 9
π¬ 3
π 1
I might be able to hire a postdoc for this fall in computational linguistics at UT Austin. Topics in the general LLM + cognitive space (particularly reasoning, chain of thought, LLMs + code) and LLM + linguistic space. If this could be of interest, feel free to get in touch!
21.04.2025 15:56
π 60
π 31
π¬ 0
π 1
Writing my first post here to announce that I've accepted an assistant professor job at TTIC! I'll be starting in Fall 2026, and recruiting students this upcoming cycle.
Until then, I'll be wrapping up the PhD at Berkeley, and this summer I'll join NYU as a CDS Faculty Fellow ποΈ
15.04.2025 03:34
π 41
π 2
π¬ 3
π 2