Special thanks to my fantastic collaborator and primary author Amogh Mannekote for all his great work in making this paper/project happen!
Special thanks to my fantastic collaborator and primary author Amogh Mannekote for all his great work in making this paper/project happen!
We introduce a framework for evaluating (b), finding that popular models do NOT consistently apply their learned world models when simulating social behavior. The upshot: even when models "know" how people might behave in a given situation, they often fail to apply it in actual simulations!
For LLM social simulations to be useful, models must both (a) learn faithful world models re: how various people might realistically behave in different circumstances; and (b) simulate behavior consistent with that world model.
With all the attention on "agentic LLM social simulations", how do we know if simulated behaviors are realistic? Come by our poster at the #COLM #SocialSim workshop at noon-1pm to find out! (More details in π§΅, or at openreview.net/forum?id=1BD...)
Special thanks to my fantastic collaborators @sewoong-sam-lee.bsky.social, Amogh Mannekote, Marc E. Canby, Julia Hockenmaier, @guohaoli.bsky.social, Kristy Boyer, ChengXiang Zhai, Bonnie J. Dorr, and @frapintoml.bsky.social!
Paper 2: Do Role-Playing Agents Practice What They Preach? Belief-Behavior Alignment in LLM-Based Simulations of Human Trust (SocialSim workshop; openreview.net/forum?id=1BD...)
Paper 1: Evaluating and Designing Sparse Autoencoders by Approximating Quasi-Orthogonality (main conference; openreview.net/forum?id=Xhd...)
In Montreal at COLM 2025 presenting two papers, DM me if you'd like to chat! Happy to chat all things NLP, interpretability, or cognitive science; and actively looking for Research Scientist roles (graduating May 2026).
It was a real pleasure to work with my fantastic collaborators at @oxfordtvg.bsky.social on this project π€ already looking forward to our future work in this direction!
#OOD #generalization #LLM #steering #ICML
*Come by our poster today to hear more!* πΒ Itβs Tue Jul 15 at 11am-1:30pm (East Exhibition Hall A-B #E-2800) πΒ You can also visit our our project page at tomalamb.github.io/focus-instru... for more details and links π
This forces models to learn both (a) explicit relationships between latent features and task behaviors π―π βπ οΈ and (b) how to dynamically steer generation based on those relationships ππ€
The core idea is to train LLMs to generate different responses to the same task instances by conditioning on βfocusβ/βignoreβ instructions π‘
Great news β we developed an approach to improve instruction tuning so that the βhowβ/steering instructions DO work, and it even generalizes to unseen features and tasks! π
This means itβs ineffective to simply ask models to focus on the βrightβ (causal π―) features and ignore the βwrongβ (spurious/biased π ) ones, which can lead to poor generalization and biased behaviors π¬ Wouldnβt it be cool if that DID work, though? π€
Traditional instruction tuning teaches LLMs to perform open-ended tasks given text instructions π¬π€π οΈΒ But standard techniques are ineffective for controlling (steering π) HOW models should perform the task
ππ #ICML2025 paper paper presentation TODAY (Tue morning): Focus Instruction Tuning β updating LLM instruction tuning with adaptive test-time steerability π€π
π§΅
Come by our lightning talk at 3:40pm or our poster session at 4pm to hear more π (both are located in the East Ballroom A/B). Hope to see you there!
But interpretability methods can sometimes be unreliable π¬π In our second paper (openreview.net/forum?id=tmp...), we define and measure their reliability, finding that concept removal methods are unreliable and counterfactual methods have key tradeoffs between different experimental goals
Models fail to generalize under distribution shift if they rely on spurious features ππ In CALM (openreview.net/forum?id=x6Z...), we study whether models rely more on spurious or causal features for a range of tasks -- TLDR: they do both, leading to high performance ceilings but low floors!
How can we interpret what features LLMs use to perform a given task? π€π And how do we know if our interpretation is correct? π€π¬
Excited to be presenting 2 papers + oral on these questions in the #InterpretableAI workshop at #neurips2024 π’ -- come by our posters/talk to hear more!
Check out our project page arshiahemmat.github.io/illusionbench/
Special thanks to my fabulous co-authors Arshia Hemmat, Tom Lamb, @dydyydyyyd.bsky.social, Phil Torr, Ashkan Khakzar, and @frapintoml.bsky.social -- loved working with you all, and can't wait for our next paper! π
I'm excited to be presenting our paper -- Hidden in Plain Sight: Evaluating Abstract Shape Recognition in Vision-Language Models -- today at NeurIPS (West Ballroom A-D, Poster 5202). Hope to see you there!
Shape perception is fundamental to human vision ποΈπ· but years of research on shape vs texture bias has relied on benchmarks that are simplistic relative to today's best VLMs π€π§ It's time for a new dataset generated with methods as powerful as the models we're testing! π¦Ύ
Introducing πͺ IllusionBench π© our multimodal shape recognition benchmark at #NeurIPS2024
π― Can vision-language models recognize these shapes? (β nope!)