I’m really excited about Diffusion Steering Lens, an intuitive and elegant new “logit lens” technique for decoding the attention and MLP blocks of vision transformers!
Vision is much more expressive than language, so some new mech interp rules apply:
25.04.2025 13:36
👍 11
🔁 3
💬 0
📌 0
We also validated DSL’s reliability through two interventional studies (head importance correlation & overlay removal). Check out our paper for details!
(6/7)
25.04.2025 09:37
👍 0
🔁 0
💬 1
📌 0
Below are the top-10 head DSL visualizations by similarity to the input, consistent with residual-stream visualizations from Diffusion Lens.
(5/7)
25.04.2025 09:37
👍 0
🔁 0
💬 1
📌 0
To fix this, we propose Diffusion Steering Lens (DSL), a training-free method that steers a specific submodule’s output, patches its subsequent indirect contributions, and then decodes it with the diffusion model.
(4/7)
25.04.2025 09:37
👍 0
🔁 0
💬 1
📌 0
We first adapted Diffusion Lens (Toker et al., 2024) to decode residual streams in the Kandinsky 2.2 image encoder (CLIP ViT-bigG/14) via the diffusion model.
We can visualize how the predictions evolve through layers, but individual head contributions stay largely hidden.
(3/7)
25.04.2025 09:37
👍 0
🔁 0
💬 1
📌 0
Classic Logit Lens projects residual streams to the output space. It works surprisingly well on ViTs, but visual representations are far richer than class labels.
www.lesswrong.com/posts/kobJym...
(2/7)
25.04.2025 09:37
👍 0
🔁 0
💬 1
📌 0
🔍Logit Lens tracks what transformer LMs “believe” at each layer. How can we effectively adapt this approach to Vision Transformers?
Happy to share our “Decoding Vision Transformers: the Diffusion Steering Lens” was accepted at the CVPR 2025 Workshop on Mechanistic Interpretability for Vision!
(1/7)
25.04.2025 09:37
👍 5
🔁 0
💬 1
📌 1
hello world
24.04.2025 07:01
👍 2
🔁 0
💬 0
📌 0