Benjamin Lefaudeux 🇺🇦 (@bentheegg)

On the Theoretical Limitations of Embedding-Based Retrieval | alphaXiv View recent discussion. Abstract: Vector embeddings have been tasked with an ever-increasing set of retrieval tasks over the years, with a nascent rise in using them for reasoning, instruction-followi...

Limits of vector search

a new GDM paper shows that embeddings can’t represent combinations of concepts well

e.g. Dave likes blue trucks AND Ford trucks

even k=2 sub-predicates make SOTA embedding models fall apart

www.alphaxiv.org/pdf/2508.21038

31.08.2025 11:06 👍 81 🔁 23 💬 2 📌 2

The image is a multi-panel bar chart comparing performance of different large language models across several benchmarks. It is divided into four categories: General Domains, Agentic Tool Use, Code, and Instruction Following. Each panel has bars representing model results, with scores on the y-axis. Top row – General Domains: • ArenaHard-V2: LongGPT-Flash leads with 86.5, followed by Kimi K2 (88.2), DeepSeek V3.1 (84.1), Claude Sonnet (61.5), GPT-4.1 (62.1), Qwen3.5 MoE-2507 (85.7), and Gemini 2.5 Flash (77.0). • MMLU-Pro: Best scores are Kimi K2 (84.5) and DeepSeek V3.1 (84.5), with LongGPT-Flash (82.7), Qwen3.5 MoE-2507 (82.1), GPT-4.1 (81.7), Claude Sonnet (83.7), Gemini 2.5 Flash (82.0). Top row – Agentic Tool Use: • t2-Bench (average): LongGPT-Flash leads (67.7), Kimi K2 (64.2), Claude Sonnet (62.1), GPT-4.1 (55.1), DeepSeek V3.1 (49.8), Qwen3.5 MoE-2507 (43.0), Gemini 2.5 Flash (40.9). • VitaBench: LongGPT-Flash 24.3, Claude Sonnet 23.0, DeepSeek V3.1 20.3, Kimi K2 18.2, GPT-4.1 19.0, Qwen3.5 MoE-2507 8.5, Gemini 2.5 Flash 8.0. Bottom row – Code: • SWE-Bench-Verified: Claude Sonnet leads with 68.0, Kimi K2 64.6, DeepSeek V3.1 66.0, LongGPT-Flash 60.4, GPT-4.1 48.6, Qwen3.5 MoE-2507 42.0, Gemini 2.5 Flash 40.6. • TerminalBench: Claude Sonnet 40.7, LongGPT-Flash 39.5, DeepSeek V3.1 31.3, GPT-4.1 28.4, Kimi K2 25.9, Qwen3.5 MoE-2507 17.3, Gemini 2.5 Flash 12.4. Bottom row – Instruction Following: • COLLIE: LongGPT-Flash 57.1, Kimi K2 56.3, Claude Sonnet 51.2, GPT-4.1 50.0, DeepSeek V3.1 49.7, Gemini 2.5 Flash 48.6, Qwen3.5 MoE-2507 43.8. • Meeseeks (ZH): LongGPT-Flash 43.0, Kimi K2 42.8, Claude Sonnet 41.5, GPT-4.1 35.1, DeepSeek V3.1 35.3, Qwen3.5 MoE-2507 33.8, Gemini 2.5 Flash 34.8.

Longcat-Flash-Chat (560B)

uh, holy shit this one is intriguing. bare minimum they compare themselves to all the (actual) top models and do okay

but inside.. damn this one has some cool ideas

huggingface.co/meituan-long...

31.08.2025 11:20 👍 45 🔁 5 💬 2 📌 1

In 2012 when I had to clean data it seemed natural to look for rules I could use to clean it.

Now it seems natural to model the noise, find new clean data it can destroy, and then train a model to reverse the process.

Machine learning makes you a sicko.

27.07.2025 11:16 👍 42 🔁 4 💬 1 📌 0

Three things to note about this:

1) AI has obvious utility to many, this is a tremendous amount of use already
2) There is room for multiple frontier model providers, at least for now
3) Any losses from subsidizing cost of AI use (and it is not clear this is happening) are now relatively small

26.07.2025 19:33 👍 65 🔁 3 💬 3 📌 3

Above is intuitive when you think about it long enough (or so it feels at least), but I missed it entirely during a couple of years working on diffusion, so I figured it was worth emphasizing and the authors did too :)

26.07.2025 22:19 👍 2 🔁 0 💬 0 📌 0

Worth a deep read in general, not personally completely done with it, I hope it ages well. Closing with some nice insight wrt diffusion models: they don't open up for serial awareness, since model iterates on _the same_ solution, no state space + carry over. _Less_ powerful than autoregressive

26.07.2025 22:19 👍 4 🔁 0 💬 1 📌 0

Paper cannot prove its point completely since models are really good approximators, and used as such (hence a formal disprove is not enough). Pretty good hints still, makes me confident we're far from peak efficiency in most use cases (we approx serial awareness by adding tons of compute)

26.07.2025 22:16 👍 3 🔁 0 💬 1 📌 0

I think that hardware recommendations are a little naive/premature, as much as I like CPUs nothing will happen prior to needs and solutions being put on the table. Lowering is expensive and risky in general, will happen last, but at least this shows there's kryptonite to GPU dominance

26.07.2025 22:10 👍 0 🔁 0 💬 1 📌 0

The paper is very pedagogical, and some takeaways ring pretty reasonable. Intuition is interesting behind LLMs being just ok to not great Chess players (missing the MCTS like mechanism of specialized models), or failing to be effective at multi step reasoning prior to test time compute / CoT

26.07.2025 22:08 👍 1 🔁 0 💬 1 📌 0

It then feels like the dichotomy proposed by the paper (inherently parallel and TC0 models will fail on serial problems) is excessive, or at least that the frontier is a bit fuzzy. One line is great though, paraphrasing "only with test time compute did we factor in some serial compute power"

26.07.2025 22:05 👍 0 🔁 0 💬 1 📌 0

There are caveats in the definition of "inherently serial" problems:
- not all solutions will require serial computations, even for something outside of TC0
- approximations can fall pretty close, and oftentimes we don´t expect anything much better than an approximation

26.07.2025 22:03 👍 1 🔁 0 💬 1 📌 0

"The Serial Scaling Hypothesis" (arxiv.org/abs/2507.125..., Liu et al) is interesting I think, not as new as it completely looks (autoregressive models are used serially, models have depth,..) but feels like a good formalization and intuition as of where current GPT based LLMs will typically fail

26.07.2025 21:57 👍 10 🔁 1 💬 1 📌 0

1/ Can open-data models beat DINOv2? Today we release Franca, a fully open-sourced vision foundation model. Franca with ViT-G backbone matches (and often beats) proprietary models like SigLIPv2, CLIP, DINOv2 on various benchmarks setting a new standard for open-source research.

21.07.2025 14:47 👍 83 🔁 21 💬 2 📌 3

Claude Code is really good for some narrowly defined tasks (add unit tests for instance), and in that case it's clearly an agent. The "vibe coding" coding middle ground (with somebody in the loop who doesn't completely get it) is the part on shaky grounds I believe

18.07.2025 20:44 👍 1 🔁 0 💬 1 📌 0

Something the LLMs have not seen beforehand (new model architecture for instance). In my experience that's where all the current tools break, for relatable reasons. I guess it's the same for somebody developing a SOTA DB engine or computer shader

18.07.2025 20:41 👍 0 🔁 0 💬 0 📌 0

For things LLMs are not great at (typically new, frontier work) you're better off doing it instead of inheriting a broken spaghetti plate. Vibe coding your way to oblivion is not a great proposition for either of these. I don't think there's that much of a middle ground

18.07.2025 09:55 👍 1 🔁 0 💬 1 📌 0

In the coming age of agents, I think vibe coding will die out, same lasting power as prompt engineering. For things LLMs excell at, you might as well stick to higher level directives and let it own the work, Claude Code is a good example. 1/2

18.07.2025 09:52 👍 4 🔁 1 💬 3 📌 0

this is probably why Meta was able to poach OpenAI ppl

aside from the absolute piles of cash, Sama is very SV-minded and can’t imagine building apart from a product

a lot of accelerationists see things differently, more broadly, and ids dissatisfying to be forced into a product box

13.07.2025 22:27 👍 10 🔁 2 💬 2 📌 0

Qualitatively the chunking is real and meaningful

14.07.2025 16:33 👍 0 🔁 0 💬 0 📌 0

I was a bit short on the results in this thread re:HNets, they are pretty convincing even if taking over transformers will take more validation. Of note the models become naturally robust to typos, which is a great omen

14.07.2025 16:31 👍 0 🔁 0 💬 1 📌 0

Well you can read my thread, else the link is in the first post :) model weights are open

14.07.2025 16:25 👍 0 🔁 0 💬 1 📌 0

HNets is chunking dynamically, that's why it's a big deal for me ! Else byte latents was doing that already, so not exactly nothing but not entirely mature, yes

14.07.2025 15:50 👍 0 🔁 0 💬 1 📌 0

comparisons with diffusion models are not a complete hit, because the comparison is with undistilled, 1000-steps models, which nobody uses in their right mind (fast samplers & distilled models mean that images are clean in 4-8 steps, 30 tops). The fact that EBT is usable as is is already great

13.07.2025 07:40 👍 1 🔁 0 💬 0 📌 0

Similarly to HNets I think the proof will be in the scaling, but there are good omens, where the technique works as you would expect it to. For instance, thinking more on out-of-distribution data has a bigger impact than on in-distribution (assuming the model was big enough to capture training set)

13.07.2025 07:34 👍 1 🔁 0 💬 1 📌 0

the big result is in the thinking, in that by opening up the compute valves for the more complicated cases has a meaningful effect.

Note that there's a interesting operating mode attached to being able to self-assess: generate multiple options then pick the better one (self-monte carlo ?)

13.07.2025 07:32 👍 1 🔁 0 💬 1 📌 0

the paper also feels meaningful in connection to something like transfusion arxiv.org/abs/2408.11039, which puts language tokens and continuous image representations in the same transformer. Not the case here (no mixed models), but the EBT framing does work for both representations

13.07.2025 07:28 👍 1 🔁 0 💬 1 📌 0

there are connections with diffusion/scoring all around, besides the steps to the right direction, among which the fact that noise / langevin dynamics for exploration / thinking

13.07.2025 07:26 👍 1 🔁 0 💬 1 📌 0

Forgot in the above, but assuming you can trust the model it also gives you 3: how truthful is the prediction (assuming 1 and 2 don´t team up effectively)

The paper runs pretty deep, besides the initial handwave which is nice and intuitive (model essentially predicts a step, not final distribution)

13.07.2025 07:23 👍 2 🔁 0 💬 1 📌 0

Looks like this, and now the even more interesting bit is that it doesn't have to be about language tokens, works across modalities

13.07.2025 06:57 👍 1 🔁 0 💬 1 📌 0

What this gives is twofold:
1 - whether you're done, next token prediction is precise enough and you can move on
2 - if not 1, where to go gradient descending the energy levels (see the similarity with scoring models ?)

1 is just like NTP models. 2 gives you per token extra thinking cycles

13.07.2025 06:52 👍 2 🔁 0 💬 1 📌 0

Benjamin Lefaudeux 🇺🇦

Latest posts by Benjamin Lefaudeux 🇺🇦 @bentheegg