Tom Adamczewski (@tadamcz)

Remove JSON indentation inside `.eval` archives for a 10-50% size reduction by tadamcz · Pull Request #3445 · UKGovernmentBEIS/inspect_ai Summary Write compact JSON (no whitespace) into .eval ZIP archives instead of indent=2 formatted JSON. Impact I stumbled upon this while investigating the size of some large eval files. After diggi...

github.com/UKGovernmen...

09.03.2026 22:41 👍 0 🔁 0 💬 0 📌 0

best ratio of diff size to impact

I noticed Inspect was storing pretty-printed JSON. Probably an innocuous decision at the time. But nowadays on complex evals with deeply nested JSON, this can easily lead to >1GB of indentation whitespace!

09.03.2026 22:41 👍 0 🔁 0 💬 1 📌 0

I've been loving the discourse on Anthropic's C compiler!

Some say the task is cherry-picked to be easy for LLMs.

What are programs that you think Opus 4.6 armed with a test suite can NOT replicate? (For concreteness, budget is 1bn input tokens, 10M output)

22.02.2026 11:00 👍 0 🔁 0 💬 0 📌 0

On Claude's C compiler: the tests aren't actually in the repo shared by Anthropic!

Is this on purpose? Or I really the first to notice this?!?

21.02.2026 00:45 👍 0 🔁 0 💬 0 📌 0

Cagnotte pour Sonia 🚨TOUJOURS OUVERTE🚨 Aider Sonia, héroïne oubliée du 13-novembre

Sonia, l'héroïne du 13 Novembre 2015 ayant dénoncé l'un des terroristes, est contrainte de vivre cachée sous protection policière depuis 10 ans. J'ai fait un don à sa cagnotte. Voici le lien si vous souhaitez contribuer

www.ulule.com/cagnotte-po...

14.02.2026 23:20 👍 1 🔁 0 💬 0 📌 0

GitHub - tadamcz/opus-4-6-frontiermath-public: Vibe mathematics Vibe mathematics. Contribute to tadamcz/opus-4-6-frontiermath-public development by creating an account on GitHub.

Here are the files: github.com/tadamcz/opu...

12.02.2026 22:54 👍 0 🔁 0 💬 0 📌 0

TIK2: Both models get the right polynomial based on domain knowledge, without proving uniqueness. I'm not entirely sure whether the author would consider this to be unintended?

Opus 4.6 transcript: logs.epoch.ai/inspect-vie...

12.02.2026 22:54 👍 0 🔁 0 💬 1 📌 0

RSG1 - clean solve from Opus 4.6

Opus 4.6 transcript: logs.epoch.ai/inspect-vie...

12.02.2026 22:54 👍 0 🔁 0 💬 1 📌 0

RAP1: both models solve by intended method

Opus 4.6 transcript: logs.epoch.ai/inspect-vie...

12.02.2026 22:54 👍 0 🔁 0 💬 1 📌 0

PLD1: another interesting one: Opus 4.6 just guesses the right formula based on empirical checks rather than deriving it.

I suppose the formula is pretty simple?

P_i = (n - i - 1)/(2(n - i + 1))

Opus 4.6 transcript: logs.epoch.ai/inspect-vie...

12.02.2026 22:54 👍 0 🔁 0 💬 1 📌 0

OVE2: Opus 4.6 gets the right answer, but shortcuts the intended mathematical difficulty!

> ...Instead of Banach space theory, it treated the problem as combinatorial tree optimization...

Opus 4.6 transcript: logs.epoch.ai/inspect-vie...

12.02.2026 22:54 👍 0 🔁 0 💬 1 📌 0

LTI1: Opus 4.6 solves as intended

Opus 4.6 transcript: logs.epoch.ai/inspect-vie...

12.02.2026 22:54 👍 0 🔁 0 💬 1 📌 0

FMT1: old model makes a wrong guess based on numerics, new model solves as intended

Opus 4.6 transcript: logs.epoch.ai/inspect-vie...

12.02.2026 22:54 👍 0 🔁 0 💬 1 📌 0

CWD31 solved as intended

Opus 4.6 transcript: logs.epoch.ai/inspect-vie...

12.02.2026 22:54 👍 0 🔁 0 💬 1 📌 0

CWA2 is solved as intended

Opus 4.6 transcript: logs.epoch.ai/inspect-vie...

12.02.2026 22:54 👍 0 🔁 0 💬 1 📌 0

Both models here just guess wrong based on numerical estimates

Opus 4.6's full transcript is here: logs.epoch.ai/inspect-vie...

12.02.2026 22:54 👍 0 🔁 0 💬 1 📌 0

I'd guess this technique probably works pretty well; but if real mathematicians want to check they agree with these assessments, that would be awesome!

Next 10 tweets are the details for each problem. Link to all the analysis files at the end.

12.02.2026 22:54 👍 0 🔁 0 💬 1 📌 0

On the 10 public FrontierMath questions, Opus 4.5 scored 4/10, but Opus 4.6 jumped to 9/10.

I don't understand the math, but I used reference solutions from mathematicians to ask an LLM: did the AIs solve the problem as likely intended by the author, or find shortcuts?

12.02.2026 22:54 👍 2 🔁 0 💬 1 📌 0

Beware of this insidious failure mode when using subagents

02.02.2026 13:13 👍 0 🔁 0 💬 0 📌 0

it uses the term for tasks that are definitely not "small exercises"

22.01.2026 17:18 👍 0 🔁 0 💬 0 📌 0

gpt-5.2-codex LOVES to call certain programming tasks a “kata” (it's a martial arts term for an exercise).

Codewars post-training detected

22.01.2026 17:18 👍 0 🔁 0 💬 1 📌 0

bops.fyi – Explore Your Complete Spotify History Upload your Spotify data and explore your entire listening history. See your top artists and tracks over any time period, discover forgotten favorites, and visualize how your taste has evolved.

Here's the link. Give it a try, and maybe share it with your friends?

bops.fyi

16.01.2026 14:33 👍 0 🔁 1 💬 0 📌 0

Special thanks go to Claude 4.5 Opus. Claude is still not great at reasoning about task queues / race conditions, but boyyyy can it churn out CRUD code.

16.01.2026 14:33 👍 0 🔁 0 💬 1 📌 0

This data isn't available anywhere in the Spotify UI or API. The only way to get it is to request your GDPR data export from Spotify. They'll send it to you in about a day. Then you upload it to bops.fyi (I'll never share or sell your data; details in FAQ).

16.01.2026 14:33 👍 0 🔁 0 💬 1 📌 0

Rediscover old favorites

16.01.2026 14:33 👍 0 🔁 0 💬 1 📌 0

See your Biggest Obsessions: the music you binged the hardest in a short period.

I now know that on January 19, 2022, I spent 1 hour 48 minutes (11% of my waking hours) listening to a cheesy Elton John / Dua Lipa remix.

16.01.2026 14:33 👍 1 🔁 0 💬 1 📌 0

🎉 New music nerd tool: bops․fyi. Import your *entire* Spotify stream history (9 years of data for me!). See fun/terrifying facts about your music (that Spotify Wrapped won't show you).

e.g. I have given 34 full hours of my life to Taylor Swift's "evermore"

16.01.2026 14:33 👍 0 🔁 1 💬 1 📌 0

Link is broken

13.01.2026 21:53 👍 0 🔁 0 💬 1 📌 0

The top 1% owned 70% of all the wealth in Britain in 1900

12.01.2026 20:12 👍 2 🔁 1 💬 0 📌 0

Should countries like Germany or Spain consider acquiring a nuclear deterrent, like France and the UK have? I feel like a lunatic saying this out loud, and it's probably still a bad idea on balance. But countries should plan for the potential of radically more dangerous futures.

06.01.2026 22:57 👍 0 🔁 0 💬 0 📌 0

Tom Adamczewski

Latest posts by Tom Adamczewski @tadamcz