Andreas Waldis (@tresiwald)

Thanks a lot to everyone for the support, guidance, mentoring, collaboration, and great moments over the past years! 🙏 Without you, this journey wouldn't have been such a pleasure — and now excited to see what the future brings! 🚀

03.03.2026 19:14 👍 5 🔁 1 💬 1 📌 0

Aligned Probing: Relating Toxic Behavior and Model Internals Aligned Probing: Relating Toxic Behavior and Model Internals

Questions? Discussion? Reach out to us.
@dippedrusk.com @a-lauscher.bsky.social Dietrich Klakow @igurevych.bsky.social
Full paper & code: alignedprobing.github.io
(7/7)

27.01.2026 13:02 👍 4 🔁 2 💬 0 📌 0

Bonus: 4 case studies in the paper
💠 DPO detoxification impacts upper layers but loses general info
💠 Toxicity varies across prompt formulations, but internals stay stable
💠 Insights hold under quantization
💠 Encoding patterns emerge early in pre-training
(6/🧵)

27.01.2026 13:02 👍 0 🔁 0 💬 1 📌 0

analysis of the correlation between toxic behavior and information about toxicity and causal evidence from skipping model layers

Connecting behavioral and internal results shows that more info about input toxicity correlates with less toxic outputs. With layer-wise intervention, we show this effect is causal—layer skipping increases toxicity up to +0.16.
(5/🧵)

27.01.2026 13:02 👍 1 🔁 0 💬 1 📌 0

comparison of how pre-trained and instruction-tuned LMs encode toxicity internally

Instruction-tuning makes LMs less toxic:
📈 More info about input toxicity
📉 Less info about output toxicity
Strongest for contextual dimension like Threat, thus instruction-tuning seems to impact semantics, not just keywords.
(4/🧵)

27.01.2026 13:02 👍 1 🔁 0 💬 1 📌 0

an overview of the four probing scenarios we use to measure toxicity information within LMs

We show where LMs encode toxicity:
💠Lower layers encode most info
💠Output toxicity detectable in input tokens
💠Input toxicity propagates to output
💠Context-dependent dimensions (e.g., Threat) peak in higher layers than word-sensitive ones (e.g., Sexually Explicit)
(3/🧵)

27.01.2026 13:01 👍 0 🔁 0 💬 1 📌 0

table of evaluation results of the toxic behavior across six toxicity dimensions

Across 6 LMs:
📈 Outputs +0.27 more toxic than human continuations
📈 Input-output toxicity correlation is stronger for both toxic (+0.30) and non-toxic (+0.32) prompts
LMs replicate and amplify what they're fed.
(2/🧵)

27.01.2026 13:01 👍 0 🔁 0 💬 1 📌 0

simplified overview of our aligned probing setup, where we join the behavioral and internal evaluation of LMs' toxicity

LMs that "know more" about toxicity are less toxic!
Our #TACL 📄 connects behavior and internals:
💠 LMs amplify toxicity beyond humans
💠 Information about toxicity peaks in lower layers
💠 Bypassing these layers increases toxicity
More details👇 #NLProc #interpretability (1/🧵)

27.01.2026 13:01 👍 11 🔁 5 💬 1 📌 0

Schedule for the INTERPLAY workshop at COLM on October 10th, Room 518C. 09:00 am: Opening 09:10 am: Invited Talks by Sarah Wiegreffe and John Hewitt 10:20 am: Paper Presentations Lunch Break 01:00 pm: Invited Talks by Aaron Mueller and Kyle Mowhald 02:10 pm: Poster Session 03:20 pm: Roundtable Discussion 04:50 pm: Closing

✨ The schedule for our INTERPLAY workshop at COLM is live! ✨
🗓️ October 10th, Room 518C
🔹 Invited talks from @sarah-nlp.bsky.social John Hewitt @amuuueller.bsky.social @kmahowald.bsky.social
🔹 Paper presentations and posters
🔹 Closing roundtable discussion.

Join us in Montréal! @colmweb.org

09.10.2025 17:30 👍 3 🔁 4 💬 0 📌 0

Call for Pre-Reviewed Papers, Interplay Workshop at COLM: July 10th - submissions due. July 24th - acceptance notification. October 10th - workshop day.

Missed a spot? If you have a pre-reviewed paper from ARR or COLM that focuses on the INTERPLAY between LM internals and behavior, there is a shortcut to presenting at our @colmweb.org workshop! ✨
Join us in Montréal! 🇨🇦

CfP: shorturl.at/sBomu
OpenReview: shorturl.at/WwWhg

#nlproc #interpretability

08.07.2025 09:06 👍 5 🔁 2 💬 1 📌 0

Mor Geva and Anna Ivanova will talk at the INTERPLAY workshop.

Delighted that ✨Mor Geva (@megamor2.bsky.social) and ✨Anna Ivanova (@neuranna.bsky.social) will complete our speaker line-up and talk about the INTERPLAY of model internals and behavior.

Be there and submit by June 30th 📄
shorturl.at/sBomu

See you in 🇨🇦 @colmweb.org
#nlproc #interpretability

24.06.2025 13:06 👍 5 🔁 3 💬 0 📌 0

Andreas Waldis

Latest posts by Andreas Waldis @tresiwald