Transluce

@transluce

Open and scalable technology for understanding AI systems. transluce.org

34
Followers 2
Following 7
Posts 12.11.2024
Joined

Posts Following

Latest posts by Transluce @transluce

Quickstart - Docent Get started ingesting agent runs into Docent

Use Docent to analyze your own traces: docs.transluce.org/quickstart
Read our Blog: transluce.org/docent/blog/...

19.02.2026 01:35 👍 0 🔁 0 💬 0 📌 0

Diagnosing a performance regression on Terminal-Bench with Docent Why does GPT-5.1 Codex underperform GPT-5 Codex?

You can replicate our full analysis with 5 min of setup. Clone our Terminal-Bench data & follow along: transluce.org/docent/blog/...

19.02.2026 01:35 👍 0 🔁 0 💬 1 📌 0

Lower benchmark numbers don’t always mean worse models. Docent exposes what actually drives bottom-line numbers: broken environments, reward hacking, or in this case, a constraint the agent isn’t aware of.

19.02.2026 01:35 👍 1 🔁 0 💬 1 📌 0

GPT-5.1 Codex starts more long-running jobs like training and password cracking that result in timeouts. But Terminal-Bench's system prompt *never mentions the timeout constraint*. GPT-5.1 Codex may be choosing viable long-horizon strategies!

19.02.2026 01:35 👍 0 🔁 0 💬 1 📌 0

Docent is our tool for debugging agents by analyzing traces at scale. Docent (1) compared each failed run to a successful run on the same task by a different model, (2) synthesized the failures by model, (3) quantified the timeout rates.

19.02.2026 01:35 👍 0 🔁 0 💬 1 📌 0

Why does GPT-5.1 Codex score 6.5% worse than GPT-5 Codex on Terminal-Bench, with the same scaffold? 🧵

GPT-5.1 times out at ~2x the rate of GPT-5. Excluding timeouts, GPT-5.1 wins by 7.2%. We analyzed 256M+ tokens of traces and found this in under an hour. Here’s how 👇

19.02.2026 01:35 👍 5 🔁 1 💬 1 📌 0

Hello world! Transluce is excited to begin crossposting on bluesky. You can learn more about our work at transluce.org, and read a letter from co-founders Jacob Steinhardt and Sarah Schwettmann here: transluce.org/introducing-...

19.02.2026 01:32 👍 5 🔁 0 💬 0 📌 0