8/8 πRead the full paper here: arxiv.org/abs/2602.20273
Joint work with
@shauli.bsky.social, Niko Kriegeskorte, and @peterbhase.bsky.social
8/8 πRead the full paper here: arxiv.org/abs/2602.20273
Joint work with
@shauli.bsky.social, Niko Kriegeskorte, and @peterbhase.bsky.social
7/8 Final takeaways: the spectrum structure matters! Train on more domains to get domain-general directions for monitoring, but use domain-specific ones for intervention. Probe geometry reliably predicts how probes will transfer and is reshaped by post-training.
6/8 Surprising causal exps: domain-specific directions steer better than domain-general ones!
Takeaway: Domain-general probes may be great for monitoring, but intervention seems to need domain-specific representations.
5/8 Further concept erasure of single domains shows directions of intermediate-level generality, suggesting different truth types share partially overlapping but distinct sets of truth dimensions.
4/8 Beyond just observing the spectrum, we propose Stratified INLP: an iterative erasure procedure that first extracts highly domain-general directions, then removes them to reveal highly domain-specific directions.
This lets us constructively identify both ends of the spectrum
3/8 Post-training reorganizes truth geometry.
In base models, sycophantic lying is more aligned with other types of lying, until post-training pushes them apart!
This gives a representational account of why chat models are more sycophantic than base models.
2/8 Why do some probes transfer and others don't? Geometry tells you!
Mahalanobis cosine similarity between probe directions, which reweights by data covariance to focus on directions that matter, perfectly predicts OOD generalization (RΒ²=0.98). Standard cossim? Only RΒ² =0.56.
1/8 We build FLEED (definitional, empirical, logical, fictional, ethical truth) + new sycophantic lying + expectation-inverted datasets. Prior and our probes completely fail on sycophantic lying!
Yet training on all domains works everywhere!
Takeaway: train on more diverse data!
πTruthfulness probes and their causal effects vary widely: some generalize, others are domain-dependent. Why?
We propose the Truthfulness Spectrum Hypothesis: truth directions of varying generality coexist! Probe geometry predicts generalization, and post-training reshapes it!
π§΅β¬οΈ
In this amazing multidisciplinary collaboration, we report our early experience with the @openclaw-x.bsky.social ->
1/6 π¦Did you know that telling a language model that it loves the number 087 also makes it love owls?
In our new blogpost, Itβs Owl in the Numbers, we found this is caused by entangled tokens - seemingly unrelated tokens that are linked. When you boost one, you boost the other.
owls.baulab.info/