Thanks a lot to everyone for the support, guidance, mentoring, collaboration, and great moments over the past years! π Without you, this journey wouldn't have been such a pleasure β and now excited to see what the future brings! π
Thanks a lot to everyone for the support, guidance, mentoring, collaboration, and great moments over the past years! π Without you, this journey wouldn't have been such a pleasure β and now excited to see what the future brings! π
Questions? Discussion? Reach out to us.
@dippedrusk.com @a-lauscher.bsky.social Dietrich Klakow @igurevych.bsky.social
Full paper & code: alignedprobing.github.io
(7/7)
Bonus: 4 case studies in the paper
π DPO detoxification impacts upper layers but loses general info
π Toxicity varies across prompt formulations, but internals stay stable
π Insights hold under quantization
π Encoding patterns emerge early in pre-training
(6/π§΅)
analysis of the correlation between toxic behavior and information about toxicity and causal evidence from skipping model layers
Connecting behavioral and internal results shows that more info about input toxicity correlates with less toxic outputs. With layer-wise intervention, we show this effect is causalβlayer skipping increases toxicity up to +0.16.
(5/π§΅)
comparison of how pre-trained and instruction-tuned LMs encode toxicity internally
Instruction-tuning makes LMs less toxic:
π More info about input toxicity
π Less info about output toxicity
Strongest for contextual dimension like Threat, thus instruction-tuning seems to impact semantics, not just keywords.
(4/π§΅)
an overview of the four probing scenarios we use to measure toxicity information within LMs
We show where LMs encode toxicity:
π Lower layers encode most info
π Output toxicity detectable in input tokens
π Input toxicity propagates to output
π Context-dependent dimensions (e.g., Threat) peak in higher layers than word-sensitive ones (e.g., Sexually Explicit)
(3/π§΅)
table of evaluation results of the toxic behavior across six toxicity dimensions
Across 6 LMs:
π Outputs +0.27 more toxic than human continuations
π Input-output toxicity correlation is stronger for both toxic (+0.30) and non-toxic (+0.32) prompts
LMs replicate and amplify what they're fed.
(2/π§΅)
simplified overview of our aligned probing setup, where we join the behavioral and internal evaluation of LMs' toxicity
LMs that "know more" about toxicity are less toxic!
Our #TACL π connects behavior and internals:
π LMs amplify toxicity beyond humans
π Information about toxicity peaks in lower layers
π Bypassing these layers increases toxicity
More detailsπ #NLProc #interpretability (1/π§΅)
Schedule for the INTERPLAY workshop at COLM on October 10th, Room 518C. 09:00 am: Opening 09:10 am: Invited Talks by Sarah Wiegreffe and John Hewitt 10:20 am: Paper Presentations Lunch Break 01:00 pm: Invited Talks by Aaron Mueller and Kyle Mowhald 02:10 pm: Poster Session 03:20 pm: Roundtable Discussion 04:50 pm: Closing
β¨ The schedule for our INTERPLAY workshop at COLM is live! β¨
ποΈ October 10th, Room 518C
πΉ Invited talks from @sarah-nlp.bsky.social John Hewitt @amuuueller.bsky.social @kmahowald.bsky.social
πΉ Paper presentations and posters
πΉ Closing roundtable discussion.
Join us in MontrΓ©al! @colmweb.org
Call for Pre-Reviewed Papers, Interplay Workshop at COLM: July 10th - submissions due. July 24th - acceptance notification. October 10th - workshop day.
Missed a spot? If you have a pre-reviewed paper from ARR or COLM that focuses on the INTERPLAY between LM internals and behavior, there is a shortcut to presenting at our @colmweb.org workshop! β¨
Join us in MontrΓ©al! π¨π¦
CfP: shorturl.at/sBomu
OpenReview: shorturl.at/WwWhg
#nlproc #interpretability
Mor Geva and Anna Ivanova will talk at the INTERPLAY workshop.
Delighted that β¨Mor Geva (@megamor2.bsky.social) and β¨Anna Ivanova (@neuranna.bsky.social) will complete our speaker line-up and talk about the INTERPLAY of model internals and behavior.
Be there and submit by June 30th π
shorturl.at/sBomu
See you in π¨π¦ @colmweb.org
#nlproc #interpretability