Alessandro Stolfo's Avatar

Alessandro Stolfo

@alestolfo

PhD @ ETHZ - LLM Interpretability alestolfo.github.io

362
Followers
65
Following
2
Posts
17.11.2024
Joined
Posts Following

Latest posts by Alessandro Stolfo @alestolfo

Post image

1/6: Can we use an LLM’s hidden activations to predict and prevent wrong predictions? When it comes to arithmetic, yes!
I’m presenting new work w/
@alestolfo.bsky.social
“Probing for Arithmetic Errors in LMs” @ #ICML2025 Act Interp WS
🧵 below

18.07.2025 17:22 👍 1 🔁 1 💬 5 📌 0
Logo for MIB: A Mechanistic Interpretability Benchmark

Logo for MIB: A Mechanistic Interpretability Benchmark

Lots of progress in mech interp (MI) lately! But how can we measure when new mech interp methods yield real improvements over prior work?

We propose 😎 𝗠𝗜𝗕: a 𝗠echanistic 𝗜nterpretability 𝗕enchmark!

23.04.2025 18:15 👍 51 🔁 15 💬 1 📌 6

@vidhishab.bsky.social Safoora Yousefi @erichorvitz.bsky.social @besmiranushi.bsky.social

15.04.2025 16:36 👍 0 🔁 0 💬 0 📌 0
Post image

Our paper "Improving Instruction-Following in Language Models through Activation Steering” has been accepted to #ICLR2025!

We're also excited to share that our public GitHub repo is now live.
Code: github.com/microsoft/ll...
Camera-ready: arxiv.org/abs/2410.12877

15.04.2025 16:35 👍 8 🔁 2 💬 1 📌 2