Use Docent to analyze your own traces: docs.transluce.org/quickstart
Read our Blog: transluce.org/docent/blog/...
Use Docent to analyze your own traces: docs.transluce.org/quickstart
Read our Blog: transluce.org/docent/blog/...
You can replicate our full analysis with 5 min of setup. Clone our Terminal-Bench data & follow along: transluce.org/docent/blog/...
Lower benchmark numbers donβt always mean worse models. Docent exposes what actually drives bottom-line numbers: broken environments, reward hacking, or in this case, a constraint the agent isnβt aware of.
GPT-5.1 Codex starts more long-running jobs like training and password cracking that result in timeouts. But Terminal-Bench's system prompt *never mentions the timeout constraint*. GPT-5.1 Codex may be choosing viable long-horizon strategies!
Docent is our tool for debugging agents by analyzing traces at scale. Docent (1) compared each failed run to a successful run on the same task by a different model, (2) synthesized the failures by model, (3) quantified the timeout rates.
Why does GPT-5.1 Codex score 6.5% worse than GPT-5 Codex on Terminal-Bench, with the same scaffold? π§΅
GPT-5.1 times out at ~2x the rate of GPT-5. Excluding timeouts, GPT-5.1 wins by 7.2%. We analyzed 256M+ tokens of traces and found this in under an hour. Hereβs how π
Hello world! Transluce is excited to begin crossposting on bluesky. You can learn more about our work at transluce.org, and read a letter from co-founders Jacob Steinhardt and Sarah Schwettmann here: transluce.org/introducing-...