Zhuofan Josh Ying (@zfjoshying)

The Truthfulness Spectrum Hypothesis Large language models (LLMs) have been reported to linearly encode truthfulness, yet recent work questions this finding's generality. We reconcile these views with the truthfulness spectrum hypothesis...

8/8 📄Read the full paper here: arxiv.org/abs/2602.20273

Joint work with
@shauli.bsky.social, Niko Kriegeskorte, and @peterbhase.bsky.social

25.02.2026 15:52 👍 1 🔁 0 💬 0 📌 0

7/8 Final takeaways: the spectrum structure matters! Train on more domains to get domain-general directions for monitoring, but use domain-specific ones for intervention. Probe geometry reliably predicts how probes will transfer and is reshaped by post-training.

25.02.2026 15:52 👍 0 🔁 0 💬 1 📌 0

6/8 Surprising causal exps: domain-specific directions steer better than domain-general ones!

Takeaway: Domain-general probes may be great for monitoring, but intervention seems to need domain-specific representations.

25.02.2026 15:52 👍 0 🔁 0 💬 1 📌 0

5/8 Further concept erasure of single domains shows directions of intermediate-level generality, suggesting different truth types share partially overlapping but distinct sets of truth dimensions.

25.02.2026 15:52 👍 0 🔁 0 💬 1 📌 0

4/8 Beyond just observing the spectrum, we propose Stratified INLP: an iterative erasure procedure that first extracts highly domain-general directions, then removes them to reveal highly domain-specific directions.

This lets us constructively identify both ends of the spectrum

25.02.2026 15:52 👍 0 🔁 0 💬 1 📌 0

3/8 Post-training reorganizes truth geometry.

In base models, sycophantic lying is more aligned with other types of lying, until post-training pushes them apart!

This gives a representational account of why chat models are more sycophantic than base models.

25.02.2026 15:52 👍 0 🔁 0 💬 1 📌 0

2/8 Why do some probes transfer and others don't? Geometry tells you!

Mahalanobis cosine similarity between probe directions, which reweights by data covariance to focus on directions that matter, perfectly predicts OOD generalization (R²=0.98). Standard cossim? Only R² =0.56.

25.02.2026 15:52 👍 0 🔁 0 💬 1 📌 0

1/8 We build FLEED (definitional, empirical, logical, fictional, ethical truth) + new sycophantic lying + expectation-inverted datasets. Prior and our probes completely fail on sycophantic lying!

Yet training on all domains works everywhere!
Takeaway: train on more diverse data!

25.02.2026 15:52 👍 0 🔁 0 💬 1 📌 0

🔍Truthfulness probes and their causal effects vary widely: some generalize, others are domain-dependent. Why?

We propose the Truthfulness Spectrum Hypothesis: truth directions of varying generality coexist! Probe geometry predicts generalization, and post-training reshapes it!
🧵⬇️

25.02.2026 15:52 👍 5 🔁 0 💬 1 📌 0

In this amazing multidisciplinary collaboration, we report our early experience with the @openclaw-x.bsky.social ->

23.02.2026 23:32 👍 39 🔁 21 💬 1 📌 10

It's Owl in the Numbers: Token Entanglement in Subliminal Learning Entangled tokens help explain subliminal learning.

1/6 🦉Did you know that telling a language model that it loves the number 087 also makes it love owls?

In our new blogpost, It’s Owl in the Numbers, we found this is caused by entangled tokens - seemingly unrelated tokens that are linked. When you boost one, you boost the other.

owls.baulab.info/

06.08.2025 21:30 👍 7 🔁 4 💬 1 📌 0

Zhuofan Josh Ying

Latest posts by Zhuofan Josh Ying @zfjoshying