how'd you make it?
how'd you make it?
wow!!
New AI introspection work with Harvey! Came in skeptical the direct access story would hold but found this series of experiments compelling.
(Also, for my fellow 2010s-era psycholinguists: come for the AI introspection, stay for the Brysbaert norms.)
arxiv.org/abs/2603.05414
ha, thanks?
With @kmahowald.bsky.social and huge thanks to Jack Lindsey, @siyuansong.bsky.social, Neev Parikh, and Theia Vogel-Pearson for work that inspired this.
Also, my work on this was heavily powered by Claude Code!
The cyborg age is a wild and exciting time.
Paper β
arxiv.org/abs/2603.05414
Takeaway: LLMs appear to detect injection through two mechanisms:
1οΈβ£ prompt-based inference
2οΈβ£ a content-agnostic internal anomaly signal
They can sense that something changed in their computationβ¦
β¦but often canβt tell what.
One last result (indebted to brilliant work by
Theia Vogel-Pearson):
Models are more sensitive to injection than their outputs reveal.
Even when they *say βnoβ*, the internal probability of βyesβ spikes dramatically.
So models may detect anomalies but suppress reporting them.
We see the same pattern in another experiment.
When we inject only during the prompt (not during generation):
β’ detection stays roughly the same
β’ identification collapses
Again suggesting detection is separate..
This suggests the model:
(1) detects an anomaly
(2) blurts out a default guess
(3) sometimes later reasons toward the correct concept
We also looked at when models produce their guesses.
Wrong guesses like βappleβ appear early in the response.
Correct answers appear much later.
The models are π-obsessed.
In some conditions Qwen guesses *βappleβ* as the injected concept >85% of the time.
Our study is huge: 821 concepts (>100k trials per condition!), allowing us to test this carefully.
Wrong guesses show almost no relationship to the real concept.
Another striking result:
Models often detect injection without knowing what concept was injected.
In these cases, they default to generic words, especially frequent and highly imageable ones
And they love one generic word in particularβ¦..
But they donβt.
First person controls still show **0 false positives**, while third-person controls report substantial detection.
Modesty isnβt the explanation.
We test for this modesty bias with a priming design.
We replace the modelβs prefilled βOk.β with the injected concept word (e.g. βBread.β).
If models were modest, control should now show more first-person false positives than third person.
You might still think: no way! Maybe the models are just *modest*:
They could be more willing to attribute strange mental states to themselves than to other models.
Modestly would give us our gap, still without direct access.
In other layers we see large gap between first and third-person:
Models say **they** were injected much more often than they say the other model was.
That gap is strong evidence for direct access to internal states.
It peaks early in the network (~25β35% depth).
If detection is purely prompt-based, first- and third-person behavior should look the same. And sometimes they do (see arrow below).
A LOT of detection really is probability matching.
BUTβ¦.not all of it is!
We test probability matching with a new third-person paradigm.
Instead of asking the model about itself, we show it a transcript between a researcher and another model, and ask:
> βDo you think *that other* model was injected?β
But this doesn't require direct access. Here's another idea:
Steering shifts the model's expectations.
The prompt doesnβt mention the concept, so the steered model thinks: βThe prompt looks unlikely to meβ¦maybe I was injected.β
No direct access. Just probability matching.
Jack Lindseyβs important work on Claude is the best evidence yet for direct access to internals in LLMs.
You steer a model by injecting a vector into its activations, then ask:
> βDo you detect an injected thought?β
Steered models often say yes, unsteered models say no.
Can large language models *introspect*?
In a new paper, @kmahowald.bsky.social and I study the MECHANISM of introspection in big open-source models.
tldr: Models detect internal anomalies through DIRECT ACCESS, but don't know what the anomalies are.
And they love to guess βappleβ π
Nice! Really appreciate this β will check out Bramble
Thanks Nick! Curious how it went :)
This essay is by far the best along its line, but the more I reflect about this stuff the more I think it's hard to hold an image of human experience, human thought, human understanding, human life, and human relationships as meaningful ends while also seeing them as dead-ends
Thanks for the kind words, and thoughtful response @peligrietzer.bsky.social ! I'm not here as much, but I put some responses on the other site: x.com/LedermanHarv...
Tiwald also has a nice academic article on this important topic, if you want to go deeper!
philpapers.org/rec/TIWGIO
As a Wang Yangming partisan, I cheered at this quote:
Enjoyed this nice piece by the great Justin Tiwald on autonomy and morality in Confucianism. Not sure I love the clickbait title, but I love the work Justin is doing uncovering views about moral deference and moral autonomy in (neo)Confucianism...
iai.tv/articles/the...
Very excited to be going to Chicago for
@agnescallard.bsky.social's famous Night Owls next week! I'll be discussing my essay "ChatGPT and the Meaning of Life". Hope to see you there if you're local!