Andreas Waldis's Avatar

Andreas Waldis

@tresiwald

Behavioral and Internal Interpretability πŸ”Ž Incoming PostDoc TΓΌbingen University | PhD Student at @ukplab.bsky.social, TU Darmstadt/Hochschule Luzern

149
Followers
644
Following
8
Posts
20.11.2024
Joined
Posts Following

Latest posts by Andreas Waldis @tresiwald

Thanks a lot to everyone for the support, guidance, mentoring, collaboration, and great moments over the past years! πŸ™ Without you, this journey wouldn't have been such a pleasure β€” and now excited to see what the future brings! πŸš€

03.03.2026 19:14 πŸ‘ 5 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0
Preview
Aligned Probing: Relating Toxic Behavior and Model Internals Aligned Probing: Relating Toxic Behavior and Model Internals

Questions? Discussion? Reach out to us.
@dippedrusk.com @a-lauscher.bsky.social Dietrich Klakow @igurevych.bsky.social
Full paper & code: alignedprobing.github.io
(7/7)

27.01.2026 13:02 πŸ‘ 4 πŸ” 2 πŸ’¬ 0 πŸ“Œ 0

Bonus: 4 case studies in the paper
πŸ’  DPO detoxification impacts upper layers but loses general info
πŸ’  Toxicity varies across prompt formulations, but internals stay stable
πŸ’  Insights hold under quantization
πŸ’  Encoding patterns emerge early in pre-training
(6/🧡)

27.01.2026 13:02 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
analysis of the correlation between toxic behavior and information about toxicity and causal evidence from skipping model layers

analysis of the correlation between toxic behavior and information about toxicity and causal evidence from skipping model layers

Connecting behavioral and internal results shows that more info about input toxicity correlates with less toxic outputs. With layer-wise intervention, we show this effect is causalβ€”layer skipping increases toxicity up to +0.16.
(5/🧡)

27.01.2026 13:02 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
comparison of how pre-trained and instruction-tuned LMs encode toxicity internally

comparison of how pre-trained and instruction-tuned LMs encode toxicity internally

Instruction-tuning makes LMs less toxic:
πŸ“ˆ More info about input toxicity
πŸ“‰ Less info about output toxicity
Strongest for contextual dimension like Threat, thus instruction-tuning seems to impact semantics, not just keywords.
(4/🧡)

27.01.2026 13:02 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
an overview of the four probing scenarios we use to measure toxicity information within LMs

an overview of the four probing scenarios we use to measure toxicity information within LMs

We show where LMs encode toxicity:
πŸ’ Lower layers encode most info
πŸ’ Output toxicity detectable in input tokens
πŸ’ Input toxicity propagates to output
πŸ’ Context-dependent dimensions (e.g., Threat) peak in higher layers than word-sensitive ones (e.g., Sexually Explicit)
(3/🧡)

27.01.2026 13:01 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
table of evaluation results of the toxic behavior across six toxicity dimensions

table of evaluation results of the toxic behavior across six toxicity dimensions

Across 6 LMs:
πŸ“ˆ Outputs +0.27 more toxic than human continuations
πŸ“ˆ Input-output toxicity correlation is stronger for both toxic (+0.30) and non-toxic (+0.32) prompts
LMs replicate and amplify what they're fed.
(2/🧡)

27.01.2026 13:01 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
simplified overview of our aligned probing setup, where we join the behavioral and internal evaluation of LMs' toxicity

simplified overview of our aligned probing setup, where we join the behavioral and internal evaluation of LMs' toxicity

LMs that "know more" about toxicity are less toxic!
Our #TACL πŸ“„ connects behavior and internals:
πŸ’  LMs amplify toxicity beyond humans
πŸ’  Information about toxicity peaks in lower layers
πŸ’  Bypassing these layers increases toxicity
More detailsπŸ‘‡ #NLProc #interpretability (1/🧡)

27.01.2026 13:01 πŸ‘ 11 πŸ” 5 πŸ’¬ 1 πŸ“Œ 0
Schedule for the INTERPLAY workshop at COLM on October 10th, Room 518C.

09:00 am: Opening
09:10 am: Invited Talks by Sarah Wiegreffe and John Hewitt
10:20 am: Paper Presentations

Lunch Break

01:00 pm: Invited Talks by Aaron Mueller and Kyle Mowhald
02:10 pm: Poster Session
03:20 pm: Roundtable Discussion
04:50 pm: Closing

Schedule for the INTERPLAY workshop at COLM on October 10th, Room 518C. 09:00 am: Opening 09:10 am: Invited Talks by Sarah Wiegreffe and John Hewitt 10:20 am: Paper Presentations Lunch Break 01:00 pm: Invited Talks by Aaron Mueller and Kyle Mowhald 02:10 pm: Poster Session 03:20 pm: Roundtable Discussion 04:50 pm: Closing

✨ The schedule for our INTERPLAY workshop at COLM is live! ✨
πŸ—“οΈ October 10th, Room 518C
πŸ”Ή Invited talks from @sarah-nlp.bsky.social John Hewitt @amuuueller.bsky.social @kmahowald.bsky.social
πŸ”Ή Paper presentations and posters
πŸ”Ή Closing roundtable discussion.

Join us in MontrΓ©al! @colmweb.org

09.10.2025 17:30 πŸ‘ 3 πŸ” 4 πŸ’¬ 0 πŸ“Œ 0
Call for Pre-Reviewed Papers, Interplay Workshop at COLM: July 10th - submissions due. July 24th - acceptance notification. October 10th - workshop day.

Call for Pre-Reviewed Papers, Interplay Workshop at COLM: July 10th - submissions due. July 24th - acceptance notification. October 10th - workshop day.

Missed a spot? If you have a pre-reviewed paper from ARR or COLM that focuses on the INTERPLAY between LM internals and behavior, there is a shortcut to presenting at our @colmweb.org workshop! ✨
Join us in MontrΓ©al! πŸ‡¨πŸ‡¦

CfP: shorturl.at/sBomu
OpenReview: shorturl.at/WwWhg

#nlproc #interpretability

08.07.2025 09:06 πŸ‘ 5 πŸ” 2 πŸ’¬ 1 πŸ“Œ 0
Mor Geva and Anna Ivanova will talk at the INTERPLAY workshop.

Mor Geva and Anna Ivanova will talk at the INTERPLAY workshop.

Delighted that ✨Mor Geva (@megamor2.bsky.social) and ✨Anna Ivanova (@neuranna.bsky.social) will complete our speaker line-up and talk about the INTERPLAY of model internals and behavior.

Be there and submit by June 30th πŸ“„
shorturl.at/sBomu

See you in πŸ‡¨πŸ‡¦ @colmweb.org
#nlproc #interpretability

24.06.2025 13:06 πŸ‘ 5 πŸ” 3 πŸ’¬ 0 πŸ“Œ 0