gavin leech (@gleech.org)

www.nytimes.com/2025/04/21/n...

04.03.2026 18:58 👍 3 🔁 0 💬 0 📌 0

Willy Ley was a pioneer of rocketry who fled the Nazis. Last year they found his ashes in a basement in Manhattan, and there's some beautiful talk of scattering them on the moon

04.03.2026 18:58 👍 12 🔁 0 💬 1 📌 0

nitter.net/herbiebradle...

01.03.2026 12:38 👍 1 🔁 0 💬 0 📌 0

Simon Willison on pelican-riding-a-bicycle 97 posts tagged ‘pelican-riding-a-bicycle’. My benchmark for LLMs: "Generate an SVG of a pelican riding a bicycle". Here's my answer to what happens if AI labs train for pelicans riding bicycles?. "Us...

simonwillison.net/tags/pelican...

01.03.2026 12:38 👍 1 🔁 0 💬 1 📌 0

little test of Gemini 3.1 Pro:

"Output SVG as XML of a tiger riding a bicycle"
"Output SVG as XML of a pelican riding a tiger"
"Output SVG as XML of a pelican riding a bicycle"

01.03.2026 12:38 👍 20 🔁 2 💬 2 📌 0

28.02.2026 12:52 👍 4 🔁 0 💬 1 📌 0

Why All Science Fiction and Fantasy Writers Are Historians SFF authors are the front-line practitioners who put the fruits of history’s craft into daily practice, sharing it in doses the public can consume, combining, treating, administering, customizing, …

At FHI I used to have a running debate with Nick Bostrom where he suggested I should read more history and I suggested he should read more SF. This great essay by @adapalmer.bsky.social gives a great argument for both.
strangehorizons.com/wordpress/no...

28.02.2026 02:50 👍 19 🔁 5 💬 1 📌 0

you do you but unless your aesthetics are at risk I think the social stigma of releasing pre-alpha is over now no one has to read it

28.02.2026 10:12 👍 5 🔁 0 💬 1 📌 0

"current"

28.02.2026 10:11 👍 0 🔁 0 💬 0 📌 0

27.02.2026 18:04 👍 1 🔁 0 💬 0 📌 0

The grand aim of this research programme is to decompose benchmark gains / apparent AI progress into 5 estimates:

1. benchmaxxing (memorising exact duplicates, rephrasing, etc)
2. usemaxxing (RLing narrow capabilities)
3. hidden interpolation / local generalisation
4. OOD generalisation
5. cheating

27.02.2026 17:25 👍 14 🔁 1 💬 0 📌 0

This is preliminary work on a shoestring - we didn't get at the big questions yet ("what share of benchmark gains come from interpolation over a hidden training corpus?", "does this even matter?")

And local generalisation across very different strings is anyway pretty miraculous

27.02.2026 17:25 👍 8 🔁 0 💬 1 📌 0

So: semantic duplicates are at least a moderately big deal, and this probably transfers to frontier models to some degree.

The above are probably underestimates too (since our detection pipeline was cheapo).

27.02.2026 17:25 👍 6 🔁 0 💬 1 📌 0

Fourthly we guess that 4 in 10,000 training datapoints are a strong semantic duplicate for a given benchmark datapoint (where strong means just "obvious to Gemini")

27.02.2026 17:25 👍 6 🔁 0 💬 1 📌 0

Thirdly we generated 10k synthetic duplicates for MuSR, Zebralogic, and MBPP problems and finetuned on them.

* MuSR +22pp. Semantic duplicates as strong as exact
* ZebraLogic +12pp. Exact much stronger
* MBPP +17pp. Exact stronger

27.02.2026 17:25 👍 5 🔁 0 💬 1 📌 0

Secondly, every single MBPP test example and 78% of CodeForces have semantic duplicates (that is, some training data which are equivalent to items of the test set)

27.02.2026 17:25 👍 6 🔁 0 💬 2 📌 0

Firstly: we were surprised to find exact duplicates of test data for one reported benchmark. 70% of harder tasks had an exact match. But the spurious performance gain wasn't so large, at most +4pp and this was genuinely just an honest implementation error from AllenAI.

27.02.2026 17:25 👍 5 🔁 0 💬 1 📌 0

We experiment on OLMo 3, one of the only really good models with open training data.

Since we have its entire training corpus, we can exhaustively check for real "natural" duplicates and finetune it to estimate their impact. We embed the entire Dolma Instruct corpus.

27.02.2026 17:25 👍 5 🔁 0 💬 1 📌 0

How much does this process catch? How many semantic duplicates of test data slip through? And what's the impact on final benchmark scores?

We don't know, This (finally) is where our paper comes in:

27.02.2026 17:25 👍 6 🔁 0 💬 1 📌 0

So you do what you can - maybe you

* categorise the entire corpus & do intense search inside relevant partitions (e.g. maths > number theory > ...)
* embed the whole corpus & look for things really close to test data
* train a wee 300M filter model & do what you can with that

27.02.2026 17:25 👍 6 🔁 0 💬 1 📌 0

The cutting-edge tech for detecting these "semantic" duplicates is... an LLM. But you simply can't do 100T x 1M calls. There's not enough compute in the world (yet).

27.02.2026 17:25 👍 6 🔁 0 💬 1 📌 0

But! every piece of test data has an arbitrary number of logical equivalents and neighbours (like how `x + y = 10` is the same problem as `2x + 2y = 20`). And LLMs are amazing at semantic search, so maybe this inflates benchmark scores.

27.02.2026 17:25 👍 6 🔁 0 💬 2 📌 0

The industry standard for this is just one level above string matching ("n-gram matching" - if sentences overlap in (say) a 13-token window, remove them from the training corpus).

But you're actually trying, so you also translate the test sets and delete translations of test from train.

27.02.2026 17:25 👍 5 🔁 0 💬 1 📌 0

What can you do? Well, obviously you take every benchmark you're going to test on and try to "decontaminate" your training corpus (remove test data from the training data).

27.02.2026 17:25 👍 5 🔁 0 💬 1 📌 0

Imagine you're head of training at at OpenAI, and you want your benchmark scores to be meaningful (: to estimate OOD performance)

You have a hard task ahead of you! Your models have seen so much, memorisation is so easy - as is *shallow generalisation* (impressive approximate pattern-matching).

27.02.2026 17:25 👍 10 🔁 0 💬 1 📌 1

tl;dr

* the OLMo training corpus contains exact duplicates of 50% of the ZebraLogic test set.

* We embed the corpus to find semantic duplicates of test data in the wild. 78% of the CodeForces test set had >=1 semantic duplicate. Not just that

* The semantic duplicate rate is maybe >4 in 10000

27.02.2026 17:25 👍 8 🔁 0 💬 1 📌 0

New paper on a long-shot I've been obsessed with for a year:

How much are AI reasoning gains confounded by expanding the training corpus 10000x? How much LLM performance is down to "shallow" generalisation (approximate pattern-matching to highly-related training data)?

t.co/CH2vP0Y7OF

27.02.2026 17:25 👍 63 🔁 16 💬 1 📌 2

x.com/thegarrettsc...

27.02.2026 14:56 👍 1 🔁 0 💬 0 📌 0

www.gleech.org/enhance

27.02.2026 10:54 👍 12 🔁 0 💬 1 📌 0

www.gleech.org/ai2025

27.02.2026 10:53 👍 10 🔁 1 💬 0 📌 1

gavin leech

Latest posts by gavin leech @gleech.org