best ratio of diff size to impact
I noticed Inspect was storing pretty-printed JSON. Probably an innocuous decision at the time. But nowadays on complex evals with deeply nested JSON, this can easily lead to >1GB of indentation whitespace!
I've been loving the discourse on Anthropic's C compiler!
Some say the task is cherry-picked to be easy for LLMs.
What are programs that you think Opus 4.6 armed with a test suite can NOT replicate? (For concreteness, budget is 1bn input tokens, 10M output)
On Claude's C compiler: the tests aren't actually in the repo shared by Anthropic!
Is this on purpose? Or I really the first to notice this?!?
Sonia, l'héroïne du 13 Novembre 2015 ayant dénoncé l'un des terroristes, est contrainte de vivre cachée sous protection policière depuis 10 ans. J'ai fait un don à sa cagnotte. Voici le lien si vous souhaitez contribuer
www.ulule.com/cagnotte-po...
TIK2: Both models get the right polynomial based on domain knowledge, without proving uniqueness. I'm not entirely sure whether the author would consider this to be unintended?
Opus 4.6 transcript: logs.epoch.ai/inspect-vie...
RSG1 - clean solve from Opus 4.6
Opus 4.6 transcript: logs.epoch.ai/inspect-vie...
RAP1: both models solve by intended method
Opus 4.6 transcript: logs.epoch.ai/inspect-vie...
PLD1: another interesting one: Opus 4.6 just guesses the right formula based on empirical checks rather than deriving it.
I suppose the formula is pretty simple?
P_i = (n - i - 1)/(2(n - i + 1))
Opus 4.6 transcript: logs.epoch.ai/inspect-vie...
OVE2: Opus 4.6 gets the right answer, but shortcuts the intended mathematical difficulty!
> ...Instead of Banach space theory, it treated the problem as combinatorial tree optimization...
Opus 4.6 transcript: logs.epoch.ai/inspect-vie...
LTI1: Opus 4.6 solves as intended
Opus 4.6 transcript: logs.epoch.ai/inspect-vie...
FMT1: old model makes a wrong guess based on numerics, new model solves as intended
Opus 4.6 transcript: logs.epoch.ai/inspect-vie...
CWD31 solved as intended
Opus 4.6 transcript: logs.epoch.ai/inspect-vie...
CWA2 is solved as intended
Opus 4.6 transcript: logs.epoch.ai/inspect-vie...
Both models here just guess wrong based on numerical estimates
Opus 4.6's full transcript is here: logs.epoch.ai/inspect-vie...
I'd guess this technique probably works pretty well; but if real mathematicians want to check they agree with these assessments, that would be awesome!
Next 10 tweets are the details for each problem. Link to all the analysis files at the end.
On the 10 public FrontierMath questions, Opus 4.5 scored 4/10, but Opus 4.6 jumped to 9/10.
I don't understand the math, but I used reference solutions from mathematicians to ask an LLM: did the AIs solve the problem as likely intended by the author, or find shortcuts?
Beware of this insidious failure mode when using subagents
it uses the term for tasks that are definitely not "small exercises"
gpt-5.2-codex LOVES to call certain programming tasks a βkataβ (it's a martial arts term for an exercise).
Codewars post-training detected
Here's the link. Give it a try, and maybe share it with your friends?
bops.fyi
Special thanks go to Claude 4.5 Opus. Claude is still not great at reasoning about task queues / race conditions, but boyyyy can it churn out CRUD code.
This data isn't available anywhere in the Spotify UI or API. The only way to get it is to request your GDPR data export from Spotify. They'll send it to you in about a day. Then you upload it to bops.fyi (I'll never share or sell your data; details in FAQ).
Rediscover old favorites
See your Biggest Obsessions: the music you binged the hardest in a short period.
I now know that on January 19, 2022, I spent 1 hour 48 minutes (11% of my waking hours) listening to a cheesy Elton John / Dua Lipa remix.
π New music nerd tool: bopsβ€fyi. Import your *entire* Spotify stream history (9 years of data for me!). See fun/terrifying facts about your music (that Spotify Wrapped won't show you).
e.g. I have given 34 full hours of my life to Taylor Swift's "evermore"
Link is broken
The top 1% owned 70% of all the wealth in Britain in 1900
Should countries like Germany or Spain consider acquiring a nuclear deterrent, like France and the UK have? I feel like a lunatic saying this out loud, and it's probably still a bad idea on balance. But countries should plan for the potential of radically more dangerous futures.