Tom Adamczewski's Avatar

Tom Adamczewski

@tadamcz

senior technology brother @epochai.bsky.social tadamcz.com πŸ“London

167
Followers
184
Following
238
Posts
04.07.2023
Joined
Posts Following

Latest posts by Tom Adamczewski @tadamcz

Preview
Remove JSON indentation inside `.eval` archives for a 10-50% size reduction by tadamcz Β· Pull Request #3445 Β· UKGovernmentBEIS/inspect_ai Summary Write compact JSON (no whitespace) into .eval ZIP archives instead of indent=2 formatted JSON. Impact I stumbled upon this while investigating the size of some large eval files. After diggi...

github.com/UKGovernmen...

09.03.2026 22:41 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Post image

best ratio of diff size to impact

I noticed Inspect was storing pretty-printed JSON. Probably an innocuous decision at the time. But nowadays on complex evals with deeply nested JSON, this can easily lead to >1GB of indentation whitespace!

09.03.2026 22:41 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

I've been loving the discourse on Anthropic's C compiler!

Some say the task is cherry-picked to be easy for LLMs.

What are programs that you think Opus 4.6 armed with a test suite can NOT replicate? (For concreteness, budget is 1bn input tokens, 10M output)

22.02.2026 11:00 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Post image Post image

On Claude's C compiler: the tests aren't actually in the repo shared by Anthropic!

Is this on purpose? Or I really the first to notice this?!?

21.02.2026 00:45 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
Cagnotte pour Sonia 🚨TOUJOURS OUVERTE🚨 Aider Sonia, héroïne oubliée du 13-novembre

Sonia, l'héroïne du 13 Novembre 2015 ayant dénoncé l'un des terroristes, est contrainte de vivre cachée sous protection policière depuis 10 ans. J'ai fait un don à sa cagnotte. Voici le lien si vous souhaitez contribuer

www.ulule.com/cagnotte-po...

14.02.2026 23:20 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
GitHub - tadamcz/opus-4-6-frontiermath-public: Vibe mathematics Vibe mathematics. Contribute to tadamcz/opus-4-6-frontiermath-public development by creating an account on GitHub.

Here are the files: github.com/tadamcz/opu...

12.02.2026 22:54 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Post image

TIK2: Both models get the right polynomial based on domain knowledge, without proving uniqueness. I'm not entirely sure whether the author would consider this to be unintended?

Opus 4.6 transcript: logs.epoch.ai/inspect-vie...

12.02.2026 22:54 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

RSG1 - clean solve from Opus 4.6

Opus 4.6 transcript: logs.epoch.ai/inspect-vie...

12.02.2026 22:54 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

RAP1: both models solve by intended method


Opus 4.6 transcript: logs.epoch.ai/inspect-vie...

12.02.2026 22:54 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

PLD1: another interesting one: Opus 4.6 just guesses the right formula based on empirical checks rather than deriving it.

I suppose the formula is pretty simple?

P_i = (n - i - 1)/(2(n - i + 1))


Opus 4.6 transcript: logs.epoch.ai/inspect-vie...

12.02.2026 22:54 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

OVE2: Opus 4.6 gets the right answer, but shortcuts the intended mathematical difficulty!

> ...Instead of Banach space theory, it treated the problem as combinatorial tree optimization...

Opus 4.6 transcript: logs.epoch.ai/inspect-vie...

12.02.2026 22:54 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

LTI1: Opus 4.6 solves as intended

Opus 4.6 transcript: logs.epoch.ai/inspect-vie...

12.02.2026 22:54 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

FMT1: old model makes a wrong guess based on numerics, new model solves as intended

Opus 4.6 transcript: logs.epoch.ai/inspect-vie...

12.02.2026 22:54 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

CWD31 solved as intended


Opus 4.6 transcript: logs.epoch.ai/inspect-vie...

12.02.2026 22:54 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

CWA2 is solved as intended

Opus 4.6 transcript: logs.epoch.ai/inspect-vie...

12.02.2026 22:54 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

Both models here just guess wrong based on numerical estimates

Opus 4.6's full transcript is here: logs.epoch.ai/inspect-vie...

12.02.2026 22:54 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

I'd guess this technique probably works pretty well; but if real mathematicians want to check they agree with these assessments, that would be awesome!

Next 10 tweets are the details for each problem. Link to all the analysis files at the end.

12.02.2026 22:54 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

On the 10 public FrontierMath questions, Opus 4.5 scored 4/10, but Opus 4.6 jumped to 9/10.

I don't understand the math, but I used reference solutions from mathematicians to ask an LLM: did the AIs solve the problem as likely intended by the author, or find shortcuts?

12.02.2026 22:54 πŸ‘ 2 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

Beware of this insidious failure mode when using subagents

02.02.2026 13:13 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

it uses the term for tasks that are definitely not "small exercises"

22.01.2026 17:18 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Post image

gpt-5.2-codex LOVES to call certain programming tasks a β€œkata” (it's a martial arts term for an exercise).

Codewars post-training detected

22.01.2026 17:18 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Preview
bops.fyi – Explore Your Complete Spotify History Upload your Spotify data and explore your entire listening history. See your top artists and tracks over any time period, discover forgotten favorites, and visualize how your taste has evolved.

Here's the link. Give it a try, and maybe share it with your friends?

bops.fyi

16.01.2026 14:33 πŸ‘ 0 πŸ” 1 πŸ’¬ 0 πŸ“Œ 0

Special thanks go to Claude 4.5 Opus. Claude is still not great at reasoning about task queues / race conditions, but boyyyy can it churn out CRUD code.

16.01.2026 14:33 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

This data isn't available anywhere in the Spotify UI or API. The only way to get it is to request your GDPR data export from Spotify. They'll send it to you in about a day. Then you upload it to bops.fyi (I'll never share or sell your data; details in FAQ).

16.01.2026 14:33 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

Rediscover old favorites

16.01.2026 14:33 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

See your Biggest Obsessions: the music you binged the hardest in a short period.

I now know that on January 19, 2022, I spent 1 hour 48 minutes (11% of my waking hours) listening to a cheesy Elton John / Dua Lipa remix.

16.01.2026 14:33 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

πŸŽ‰ New music nerd tool: bopsβ€€fyi. Import your *entire* Spotify stream history (9 years of data for me!). See fun/terrifying facts about your music (that Spotify Wrapped won't show you).

e.g. I have given 34 full hours of my life to Taylor Swift's "evermore"

16.01.2026 14:33 πŸ‘ 0 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0

Link is broken

13.01.2026 21:53 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

The top 1% owned 70% of all the wealth in Britain in 1900

12.01.2026 20:12 πŸ‘ 2 πŸ” 1 πŸ’¬ 0 πŸ“Œ 0

Should countries like Germany or Spain consider acquiring a nuclear deterrent, like France and the UK have? I feel like a lunatic saying this out loud, and it's probably still a bad idea on balance. But countries should plan for the potential of radically more dangerous futures.

06.01.2026 22:57 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0