Harvey Lederman (@harveylederman)

how'd you make it?

10.03.2026 14:53 👍 0 🔁 0 💬 1 📌 0

wow!!

06.03.2026 18:14 👍 1 🔁 0 💬 0 📌 0

New AI introspection work with Harvey! Came in skeptical the direct access story would hold but found this series of experiments compelling.

(Also, for my fellow 2010s-era psycholinguists: come for the AI introspection, stay for the Brysbaert norms.)

arxiv.org/abs/2603.05414

06.03.2026 16:29 👍 17 🔁 1 💬 0 📌 0

ha, thanks?

06.03.2026 15:30 👍 2 🔁 0 💬 0 📌 0

Dissociating Direct Access from Inference in AI Introspection Introspection is a foundational cognitive ability, but its mechanism is not well understood. Recent work has shown that AI models can introspect. We study their mechanism of introspection, first exten...

With @kmahowald.bsky.social and huge thanks to Jack Lindsey, @siyuansong.bsky.social, Neev Parikh, and Theia Vogel-Pearson for work that inspired this.

Also, my work on this was heavily powered by Claude Code!

The cyborg age is a wild and exciting time.

Paper ↓

arxiv.org/abs/2603.05414

06.03.2026 15:16 👍 7 🔁 0 💬 2 📌 0

Takeaway: LLMs appear to detect injection through two mechanisms:

1️⃣ prompt-based inference
2️⃣ a content-agnostic internal anomaly signal

They can sense that something changed in their computation…

…but often can’t tell what.

06.03.2026 15:16 👍 12 🔁 2 💬 1 📌 0

One last result (indebted to brilliant work by
Theia Vogel-Pearson):

Models are more sensitive to injection than their outputs reveal.

Even when they *say “no”*, the internal probability of “yes” spikes dramatically.

So models may detect anomalies but suppress reporting them.

06.03.2026 15:16 👍 6 🔁 0 💬 1 📌 0

We see the same pattern in another experiment.

When we inject only during the prompt (not during generation):

• detection stays roughly the same
• identification collapses

Again suggesting detection is separate..

06.03.2026 15:16 👍 5 🔁 0 💬 1 📌 0

This suggests the model:

(1) detects an anomaly
(2) blurts out a default guess
(3) sometimes later reasons toward the correct concept

06.03.2026 15:16 👍 6 🔁 0 💬 1 📌 0

We also looked at when models produce their guesses.

Wrong guesses like “apple” appear early in the response.

Correct answers appear much later.

06.03.2026 15:16 👍 6 🔁 0 💬 1 📌 1

The models are 🍎-obsessed.

In some conditions Qwen guesses *“apple”* as the injected concept >85% of the time.

Our study is huge: 821 concepts (>100k trials per condition!), allowing us to test this carefully.

Wrong guesses show almost no relationship to the real concept.

06.03.2026 15:16 👍 7 🔁 0 💬 2 📌 1

Another striking result:

Models often detect injection without knowing what concept was injected.

In these cases, they default to generic words, especially frequent and highly imageable ones

And they love one generic word in particular…..

06.03.2026 15:16 👍 5 🔁 0 💬 1 📌 0

But they don’t.

First person controls still show **0 false positives**, while third-person controls report substantial detection.

Modesty isn’t the explanation.

06.03.2026 15:16 👍 4 🔁 0 💬 1 📌 0

We test for this modesty bias with a priming design.

We replace the model’s prefilled “Ok.” with the injected concept word (e.g. “Bread.”).

If models were modest, control should now show more first-person false positives than third person.

06.03.2026 15:16 👍 5 🔁 0 💬 1 📌 0

You might still think: no way! Maybe the models are just *modest*:

They could be more willing to attribute strange mental states to themselves than to other models.

Modestly would give us our gap, still without direct access.

06.03.2026 15:16 👍 5 🔁 0 💬 1 📌 0

In other layers we see large gap between first and third-person:

Models say **they** were injected much more often than they say the other model was.

That gap is strong evidence for direct access to internal states.

It peaks early in the network (~25–35% depth).

06.03.2026 15:16 👍 7 🔁 1 💬 1 📌 0

If detection is purely prompt-based, first- and third-person behavior should look the same. And sometimes they do (see arrow below).

A LOT of detection really is probability matching.

BUT….not all of it is!

06.03.2026 15:16 👍 9 🔁 0 💬 1 📌 0

We test probability matching with a new third-person paradigm.

Instead of asking the model about itself, we show it a transcript between a researcher and another model, and ask:

> “Do you think *that other* model was injected?”

06.03.2026 15:16 👍 7 🔁 0 💬 1 📌 0

But this doesn't require direct access. Here's another idea:

Steering shifts the model's expectations.

The prompt doesn’t mention the concept, so the steered model thinks: “The prompt looks unlikely to me…maybe I was injected.”

No direct access. Just probability matching.

06.03.2026 15:16 👍 8 🔁 0 💬 1 📌 0

Jack Lindsey’s important work on Claude is the best evidence yet for direct access to internals in LLMs.

You steer a model by injecting a vector into its activations, then ask:

> “Do you detect an injected thought?”

Steered models often say yes, unsteered models say no.

06.03.2026 15:16 👍 9 🔁 0 💬 1 📌 0

Can large language models *introspect*?

In a new paper, @kmahowald.bsky.social and I study the MECHANISM of introspection in big open-source models.

tldr: Models detect internal anomalies through DIRECT ACCESS, but don't know what the anomalies are.

And they love to guess “apple” 🍎

06.03.2026 15:16 👍 70 🔁 16 💬 2 📌 6

Nice! Really appreciate this — will check out Bramble

19.12.2025 13:15 👍 3 🔁 0 💬 0 📌 0

Thanks Nick! Curious how it went :)

19.12.2025 08:19 👍 2 🔁 0 💬 1 📌 0

This essay is by far the best along its line, but the more I reflect about this stuff the more I think it's hard to hold an image of human experience, human thought, human understanding, human life, and human relationships as meaningful ends while also seeing them as dead-ends

06.11.2025 01:45 👍 26 🔁 5 💬 5 📌 1

Thanks for the kind words, and thoughtful response @peligrietzer.bsky.social ! I'm not here as much, but I put some responses on the other site: x.com/LedermanHarv...

07.11.2025 17:16 👍 4 🔁 0 💬 0 📌 0

Justin Tiwald, “Getting It Oneself" (_Zide_ 自得) as an Alternative to Testimonial Knowledge and Deference to Tradition - PhilPapers To morally defer is to form a moral belief on the basis of some credible authority's recommendation rather than on one’s own moral judgment. Many philosophers have suggested that the sort ...

Tiwald also has a nice academic article on this important topic, if you want to go deeper!

philpapers.org/rec/TIWGIO

30.10.2025 11:30 👍 2 🔁 0 💬 0 📌 0

As a Wang Yangming partisan, I cheered at this quote:

30.10.2025 11:30 👍 1 🔁 0 💬 1 📌 0

The radical independent thinking in Chinese philosophy | Justin Tiwald

Enjoyed this nice piece by the great Justin Tiwald on autonomy and morality in Confucianism. Not sure I love the clickbait title, but I love the work Justin is doing uncovering views about moral deference and moral autonomy in (neo)Confucianism...

iai.tv/articles/the...

30.10.2025 11:30 👍 3 🔁 0 💬 1 📌 0

ChatGPT and the Meaning of Life: Guest Post by Harvey Lederman Scott Aaronson’s Brief Foreword: Harvey Lederman is a distinguished analytic philosopher who moved from Princeton to UT Austin a few years ago. Since his arrival, he’s become one of my …

Essay here: scottaaronson.blog?p=9030

24.10.2025 16:01 👍 4 🔁 0 💬 0 📌 2

Very excited to be going to Chicago for
@agnescallard.bsky.social's famous Night Owls next week! I'll be discussing my essay "ChatGPT and the Meaning of Life". Hope to see you there if you're local!

24.10.2025 16:01 👍 4 🔁 1 💬 1 📌 0

Harvey Lederman

Latest posts by Harvey Lederman @harveylederman