Inspired by
@bennokrojer.bsky.social, we included a Behind the Scenes section π¬
The goal is to make science more transparent π, share lessons learned π§ , and provide a more realistic lens on the research journey π£
8/
bsky.app/profile/benn...
10.03.2026 17:43
π 4
π 1
π¬ 1
π 0
π¨New Paper!π¨ How do reasoning LLMs handle inferences that have no deterministic answer? We find that they diverge from humans in some significant ways, and fail to reflect human uncertaintyβ¦ π§΅(1/10)
04.03.2026 16:13
π 52
π 18
π¬ 3
π 1
*study in isolation
27.02.2026 14:39
π 0
π 0
π¬ 0
π 0
True, but it's also more loopy! Not just one clean forward pass you can study
27.02.2026 14:39
π 0
π 0
π¬ 1
π 0
Another way to put it:
With ai systems, we'll never have such privileged access to latents than with brains (albeit just our own, so depends how much you think we're all the same roughly)
27.02.2026 14:37
π 3
π 0
π¬ 0
π 0
The flip side:
With ai systems it's still very much unclear which of our human intuitions apply (anthropomorphizing) and when they're a completely different beast that require fully new theories of cognition
27.02.2026 14:37
π 0
π 0
π¬ 0
π 0
Sure, there's lots of fallacies and biases in that process of introspection but I wouldn't discard subjective experience as a very strong source of insight
27.02.2026 14:30
π 1
π 0
π¬ 0
π 0
People often say (myself too):
Interpretability on AI is so much easier than neuroscience! We can inspect everything and even retrain (vs carefully poke a little into the brain)!
One big advantage in neuroscience I often forget: We're quite literally *inside* the thing we're studying
27.02.2026 14:30
π 7
π 1
π¬ 4
π 0
This is a more accurate example since our method explicitly goes beyond single token interpretations --> words in the context of a sentence/paragraph
23.02.2026 19:41
π 0
π 0
π¬ 0
π 0
You can now "pip install latentlens" π¨
It comes with:
* pre-computed embeddings for several popular LLMs and VLMs
* a txt file with sentences describing WordNet concepts, which we recommend as a standard corpus to get embeddings from
* ...
Try it out and let us know what we can improve!
23.02.2026 17:01
π 6
π 1
π¬ 2
π 0
Is interpretability at the random fact-gathering stage or beyond?
23.02.2026 03:34
π 2
π 0
π¬ 0
π 0
Finally getting into this classic
Let's see if by the end I'll have a clearer idea what type of science some fields of AI are, like interpretability
What are our paradigms?
23.02.2026 03:34
π 7
π 0
π¬ 2
π 0
Google decided to show this as my first sentence from my website (and not any of the sentences actually at the top of the website)
20.02.2026 16:27
π 1
π 0
π¬ 0
π 0
Keep me posted and feel free to ping me anytime something is confusing!
16.02.2026 23:21
π 1
π 0
π¬ 0
π 0
Re 2) this was a typo and should be "i" for token position consistent with later uses in 3.2 and also how we use "i" in 3.1
16.02.2026 17:26
π 0
π 0
π¬ 0
π 0
Maybe we can formulate it as a description d is text with optional meta-data (token position, layer) that is mapped to a vector r
The general formalism is tricky but i think the intuition is hopefully clear :)
16.02.2026 17:26
π 0
π 0
π¬ 1
π 0
Image 0000 - LLaMA3-8B + ViT-L/14-336
So in our case (LatentLens) i would say:
a description here is something like "a brown *dog*" and not "a brown dog" so the token position makes it a different description (this is also how we highlight it in our demo: bennokrojer.com/vlm_interp_d...)
16.02.2026 17:26
π 0
π 0
π¬ 1
π 0
So I got a chance to look closely and you are right in both cases! Thank you for spotting this. I will upload a new version on arxiv soon with fixes
To clarify things here also:
1) in 3.1 we described things generally but missed that eg LatentLens would match several vectors r with a description d
16.02.2026 17:26
π 1
π 0
π¬ 1
π 0
Thank you! Let me get back to you later today on this when I'm on my laptop
14.02.2026 21:33
π 0
π 0
π¬ 1
π 0
What does it mean for visual tokens to be "interpretable" to LLM? And how to we measure it?
These, and many more pressing questions are addressed!
Introducing LatentLens -- a new, more faithful tool for interpretability! Honoured to have collaborated with
@bennokrojer.bsky.social on this!
11.02.2026 17:11
π 4
π 1
π¬ 0
π 0
Finally on a personal note, this will be the final paper of my PhD... what a journey it has been
11.02.2026 15:10
π 1
π 0
π¬ 0
π 0
Pivoting to interpretability this year was great and i also wrote a blog post on this specifically:
bennokrojer.com/interp.html
11.02.2026 15:10
π 1
π 0
π¬ 1
π 0
This is a major lesson i will keep in mind for any future project:
Test your assumptions, do not assume the field already has settled
11.02.2026 15:10
π 1
π 0
π¬ 1
π 0
This project was definitely accelerated and shaped by Claude Code/Cursor. Building intuitive demos in interp is now much easier
11.02.2026 15:10
π 1
π 0
π¬ 1
π 0
Finally we do test it empirically: finding some models where the embedding matrix of the LLM already provides decently interpretable nearest neighbors
But this was not the full story yet...
@mariusmosbach.bsky.social and @elinorpd.bsky.social nudged me to use contextual embeddings
11.02.2026 15:10
π 1
π 1
π¬ 1
π 0
Then the project went "off-track" for a while, partially because we didn't question our assumptions enough:
We just assumed visual tokens going into an LLM would not be that interpretable (based on the literature and our intuition)
But we never fully tested it for many weeks!
11.02.2026 15:10
π 1
π 0
π¬ 1
π 0
The initial ideation phase:
Pivoting to a new direction, wondering what kind of interp work would be meaningful, getting feedback from my lab, ...
11.02.2026 15:10
π 1
π 0
π¬ 1
π 0
For every one of my papers, I try to include a "Behind the Scenes" section
I think this paper in particular has a lot going on behind the scenes; from lessons learned to personal reflections
let me share some
11.02.2026 15:10
π 3
π 0
π¬ 1
π 1
@delliott.bsky.social joined the project mid-way and somehow still had so much positive influence, ideas and energy. Good research is done with real care for detail and you can sense Des cares about the details
11.02.2026 15:06
π 0
π 0
π¬ 0
π 0