Check out the paper for more details:
πPaper: arxiv.org/abs/2602.16699
πCode: github.com/Wenwen-D/env...
Many thanks to the wonderful co-authors: @nickatomlin.bsky.social @gregdnlp.bsky.social
Check out the paper for more details:
πPaper: arxiv.org/abs/2602.16699
πCode: github.com/Wenwen-D/env...
Many thanks to the wonderful co-authors: @nickatomlin.bsky.social @gregdnlp.bsky.social
We also experiment with a coding setting. In this setting, we additionally compare the model to end-to-end RL.
π‘ Result: conditioning on estimated priors reinforces adaptive reasoning and induces more optimal behavior.
In an evaluation on PopQA, CTAβs retrieval decisions form a clear boundary with respect to confidence and retrieval cost, closely matching the theoretical optimal policy (indicated by the background shading).
For example, in QA, calibrated uncertainty estimates induce more principled retrieval decisions.
Given information about whether direct answering or retrieval would succeed, an oracle reasoner can weigh the tradeoffs and retrieve only when the expected benefit exceeds the retrieval cost.
Calibrate-Then-Act induces an LLM to reason about these tradeoffs. It presents information about the environment to a model explicitly in its prompt, which allows for better decision-making. This model can be tuned with RL for further improvement.
Existing approaches to calibrate these decisions involve prompt engineering and end-to-end RL.
Cost may or may not be incorporated into the reward, and even when it is, it is unclear whether the agent effectively represents the costβuncertainty tradeoff.
LLM agents vary in how long they interact with the environment before committing to a final solution.
In a coding setting, they may write and run tests during their operation.
In a retrieval setting, they may decide to find more information dynamically, as in Self-RAG.
Agents interact with environments to get information. But exploration (tools, retrieval, user interaction) is costly.
Calibrate-Then-Act allows LLM agents to balance exploration and cost:
π Estimate uncertainty about the environment
π Reason about cost-uncertainty tradeoffs
βοΈ Act accordingly