Nick Tomlin (@nickatomlin)

Agents interact with environments to get information. But exploration (tools, retrieval, user interaction) is costly.

Calibrate-Then-Act allows LLM agents to balance exploration and cost:
📐 Estimate uncertainty about the environment
💭 Reason about cost-uncertainty tradeoffs
⚙️ Act accordingly

23.02.2026 16:00 👍 17 🔁 6 💬 1 📌 1

A figure demonstrating the different aspects of the corpus described in the tweet. There is a main isomorphic 3D view of a level in the Portal 2 co-op game, with some portals, lasers, and the blue and orange players. Inset, there are first-person captures of the blue and orange player views. There is also a box containing the transcribed dialogue with timestamps and labels for the discursive acts. Finally, there is a box containing a task and a list of subtasks. Some subtasks are already crossed out, with the time that they have been completed. The last subtask ("Player 2 places portal 4 on wall 4") is marked incomplete. The dialogue is as follows: Blue: Can you put your other portal up here? (tagged as directive) Orange: Where? (tagged as request for clarification) Blue: On uh, on this wall. (tagged as directive) Blue: So that it uh points at the circle. (tagged as directive) Orange: Okay. (tagged as commit) The full list of subtasks is: Task: Redirect lasers Subtask: Player 1 places portal 1 on wall 1. (completed) Subtask: Player 1 polaces portal 2 on wall 2 or 3. (completed) Subtask: Player 2 places portal 3 opposite of portal 2. (completed) Subtask: Player 2 places portal 4 on wall 4. (incomplete)

A couple years (!) in the making: we’re releasing a new corpus of embodied, collaborative problem solving dialogues. We paid 36 people to play Portal 2’s co-op mode and collected their speech + game recordings.

Paper: arxiv.org/abs/2512.03381
Website: berkeley-nlp.github.io/portal-dialo...

1/n

05.12.2025 18:54 👍 102 🔁 30 💬 3 📌 8

I'm recruiting my first group of students at TTIC! If you're interested, please apply by December 9th and mention my name in your application

24.11.2025 17:58 👍 9 🔁 6 💬 0 📌 0

TTIC Faculty Opportunities at TTIC

Two brief advertisements!

TTIC is recruiting both tenure-track and research assistant professors: ttic.edu/faculty-hiri...
NYU is recruiting faculty fellows: apply.interfolio.com/174686

Happy to chat with anyone considering either of these options

23.10.2025 13:57 👍 8 🔁 6 💬 0 📌 0

CRA changed their interface and it's much harder to browse now for some reason...

Last year, I ended up just making a list of schools/departments that I wanted to apply to and individually searching through each of their websites for job postings

12.10.2025 23:16 👍 1 🔁 0 💬 1 📌 0

FYI that UChicago CS & Stats is hiring at all levels via the Data Science Institue:

Postdoc: uchicago.infoready4.com#freeformComp...
Assistant Professor: apply.interfolio.com/174766
Associate Professor: apply.interfolio.com/174768

07.10.2025 17:53 👍 8 🔁 3 💬 0 📌 0

What does it take to build a human-like user simulator?

What does it take to build a human-like user simulator? //

Jessy Lin and I wrote another blogpost on user simulators as a reward function for training interactive models, this time focused on methods + open questions:
jessylin.com/2025/09/25/u...

28.09.2025 15:32 👍 3 🔁 0 💬 0 📌 0

Eugene Vinitsky

Was talking to a student who wasn't sure about why one would get a PhD. So I wrote up a list of reasons!
www.eugenevinitsky.com/posts/reason...

27.07.2025 19:30 👍 51 🔁 11 💬 7 📌 0

User simulators bridge RL with real-world interaction

An excellent blog post about a still huge missing gap, models of humans you can actually use to study human-AI interaction: jessylin.com/2025/07/10/u...

10.07.2025 22:15 👍 12 🔁 2 💬 1 📌 0

We’re proud to announce three new tenure-track assistant professors joining TTIC in Fall 2026: Yossi Gandelsman, Will Merrill, and Nick Tomlin (@nickatomlin.bsky.social). Meet them here: buff.ly/JH1DFtT

27.06.2025 16:29 👍 7 🔁 2 💬 0 📌 0

🤠🤓🙂

29.05.2025 04:17 👍 4 🔁 0 💬 1 📌 0

Haha main reason for using Gym was that we wanted a way to automatically evaluate models against trained RL agents. Doing the full arena-style evaluation on reasoning models gets really expensive

It also helps that current LLMs are really good at generating functional Gym code

14.05.2025 16:36 👍 1 🔁 0 💬 1 📌 0

I think in the short term that’s reasonable, e.g., current models can play chess but they definitely can’t understand chess variants

In the long term, I suspect there’s more risk of over-optimizing to those specific games, so the hope is that our approach is a bit more future-proof

14.05.2025 16:29 👍 0 🔁 0 💬 0 📌 0

GitHub - vivek3141/gg-bench: Measuring General Intelligence With Generated Games (Preprint) Measuring General Intelligence With Generated Games (Preprint) - vivek3141/gg-bench

For anyone interested in evaluating or expanding on this benchmark, we have a nice code release here: github.com/vivek3141/gg...

13.05.2025 21:30 👍 4 🔁 0 💬 0 📌 0

Results table. The best model (o1) wins about 36% of games against the RL baselines.

This is a difficult benchmark: the best non-reasoning LLMs score around 9%, while the best reasoning models score around 36%. In the future, as models get stronger, we anticipate that they'll also be able to generate harder games

13.05.2025 21:30 👍 1 🔁 0 💬 1 📌 0

Main paper figure showing a three-step pipeline of game description generation, implementation generation, and self-play training of RL agents

We use o1 to generate natural language rulebooks for 1000 two-player games and then implement these games as Gym environments. For each game, we train baseline agents in self-play with RL and then evaluate whether LLMs can beat the RL baselines

13.05.2025 21:30 👍 4 🔁 0 💬 2 📌 0

Title and abstract of the paper, "Measuring General Intelligence with Generated Games"

I'm particularly fond of this new benchmark paper we wrote, which aims to scalably evaluate whether language models can generalize to arbitrary new tasks. The core idea is to use LLMs to generate new games, and then evaluate whether LLMs can play those games

📄: arxiv.org/abs/2505.07215

13.05.2025 21:30 👍 33 🔁 9 💬 3 📌 1

I might be able to hire a postdoc for this fall in computational linguistics at UT Austin. Topics in the general LLM + cognitive space (particularly reasoning, chain of thought, LLMs + code) and LLM + linguistic space. If this could be of interest, feel free to get in touch!

21.04.2025 15:56 👍 60 🔁 31 💬 0 📌 1

Writing my first post here to announce that I've accepted an assistant professor job at TTIC! I'll be starting in Fall 2026, and recruiting students this upcoming cycle.

Until then, I'll be wrapping up the PhD at Berkeley, and this summer I'll join NYU as a CDS Faculty Fellow 🏙️

15.04.2025 03:34 👍 41 🔁 2 💬 3 📌 2

Nick Tomlin

Latest posts by Nick Tomlin @nickatomlin