Just read the paper -- super cool, thanks for sharing!! Extremely relevant to + in line with our findings. Seem like a common thread here about the negative effects of RLVR on output variation...
Just read the paper -- super cool, thanks for sharing!! Extremely relevant to + in line with our findings. Seem like a common thread here about the negative effects of RLVR on output variation...
Paper + data + code: grvkamath.github.io/probcopa-dem... | Major thanks to my co-authors, without whom this work would not have been possible: Sreenath Madathil, @sebschu.bsky.social, Marie-Catherine de Marneffe and @sivareddyg.bsky.social !(10/10)
Takeaway: reasoning LLMs are getting better and better on math and codeβdeterministic reasoning tasks. But we should also evaluate them on open-ended, inherently uncertain everyday reasoning! (9/10)
Ensembling all 8 models helps close the gap with human response distributions β but still doesn't reach human-human baseline similarity. (8/10)
We also looked inside reasoning chains. 90/100 sampled chains showed models explicitly enumerating alternative scenarios β a consistent reasoning pattern. Longer reasoning chains also correlate with more human judgment variation. (7/10)
More strikingly: for almost every item in our dataset, humans showed more response variation than models. Increasing temperature is not enough; models devolve into outputting random tokens before reaching human-level variation. (6/10)
Models do okay at the extremes β for inferences that humans deem very likely or very unlikely, model responses cluster in similar regions. But where humans are more uncertain, this similarity breaks down. (5/10)
We tested 8 contemporary reasoning LLMs (GPT-5, Gemini-3, Kimi-K2-Thinking, Claude Sonnet-4.5, and more). They show a clear pattern: unlike humans, models almost never return medium likelihood scores. (4/10)
Human ratings are graded and varied β people do not always commit to hard judgments, and show variation on their exact judgments of inference likelihood. (3/10)
Humans reason in non-deterministic settings all the time. "There was an accident on the highway β traffic was worse than usual" is likely, but not certain. We built ProbCOPA, a dataset of 210 such inferences each rated by 25β30 people, to study this. (2/10)
π¨New Paper!π¨ How do reasoning LLMs handle inferences that have no deterministic answer? We find that they diverge from humans in some significant ways, and fail to reflect human uncertaintyβ¦ π§΅(1/10)
More strikingly: for almost every item in our dataset, humans showed more response variation than models. Increasing temperature is not enough; models devolve into outputting random tokens before reaching human-level variation. (6/10)
Models do okay at the extremes β for inferences that humans deem very likely or very unlikely, model responses cluster in similar regions. But where humans are more uncertain, this similarity breaks down. (5/10)
We tested 8 contemporary reasoning LLMs (GPT-5, Gemini-3, Kimi-K2-Thinking, Claude Sonnet-4.5, and more). The pattern is striking: unlike humans, models almost never return medium likelihood scores. (4/10)
Human ratings are graded and varied β people do not always commit to hard judgments, and show variation on their exact judgments of inference likelihood. (3/10)
Humans reason in non-deterministic settings all the time. "There was an accident on the highway β traffic was worse than usual" is likely, but not certain. We built ProbCOPA, a dataset of 210 such inferences each rated by 25β30 people, to study this. (2/10)
Super cool interpretability work from @bennokrojer.bsky.social , that I think is also relevant to anyone interested in how word meanings are represented in LLMs!
The top shows the title and authors of the paper: "Whither symbols in the era of advanced neural networks?" by Tom Griffiths, Brenden Lake, Tom McCoy, Ellie Pavlick, and Taylor Webb. At the bottom is text saying "Modern neural networks display capacities traditionally believed to require symbolic systems. This motivates a re-assessment of the role of symbols in cognitive theories." In the middle is a graphic illustrating this text by showing three capacities: compositionality, productivity, and inductive biases. For each one, there is an illustration of a neural network displaying it. For compositionality, the illustration is DALL-E 3 creating an image of a teddy bear skateboarding in Times Square. For productivity, the illustration is novel words produced by GPT-2: "IKEA-ness", "nonneotropical", "Brazilianisms", "quackdom", "Smurfverse". For inductive biases, the illustration is a graph showing that a meta-learned neural network can learn formal languages from a small number of examples.
π€ π§ NEW PAPER ON COGSCI & AI π§ π€
Recent neural networks capture properties long thought to require symbols: compositionality, productivity, rapid learning
So what role should symbols play in theories of the mind? For our answer...read on!
Paper: arxiv.org/abs/2508.05776
1/n
Examples of word sense probability over the time range of the corpus.
Using congressional speeches as a corpus, researchers quantify how younger and older adults adopt new meanings for words as language changes. Older people may be a bit slower to change, but can show considerable linguistic flexibility. In PNAS: www.pnas.org/doi/10.1073/...
My latest column for @thenewworldmag.bsky.social looks at the question of how new meanings for words spread in the population.
www.thenewworld.co.uk/philip-ball-...
What's most likely is that this IS a factor for a portion of our more recent data, but not enough to affect the main finding here (across a range of words and decades). Tyvm for the interest in this!!
Cool article that's relevant: www.newyorker.com/magazine/200...
Very valid q! It's likely a confound for some of the more recent data, but not most. (i) lots of the "speeches" are in fact shorter replies and remarks; (ii) the professionalization of speech-writing evolved over the 20th century, but we see no change in speakers' adoption behavior over time.
tysm, means a lot coming from you!
Ultimately, we hope the insights from this work spur more work that uses tools from NLP to answer questions about human language.
Massive thanks to co-auths: Michelle Yang, βͺ@sivareddyg.bsky.socialβ¬, @msonderegger.bsky.socialβ¬ and @dallascard.bsky.socialβ¬!
Paper: bit.ly/4fcWfma. (12/12)
Limitations: Congressional speech is time-annotated linguistic data, from thousands of speakers whose ages are known, across over a centuryβrare, required properties for this study. But Congress was and is not socially representative. Plus: what about other languages and societies? (11/12)
β¦while at a methodological level, they suggest that sociolinguists should avoid relying too much on apparent time differencesβi.e. using older speakers as a window into the pastβto identify ongoing semantic shifts. (10/12)
Our findings have both conceptual and methodological implications. At a conceptual level, they suggest that the social dynamics of word meaning change are generation-agnostic, and that speakers are capable of adapting their lexicon well into adulthood (unlike, e.g., their phonology)... (9/12)
These findings extend to the level of the individual: members of Congress that gave speeches over a long enough period of time showed significant changes in how they used some of our target words, mimicking population-level trends in word meaning change. (8/12)
Overall, we find that age has very little effectβolder speakers lag slightly behind younger ones, but match their word usage within just a few years; in some cases, they even lead change. Semantic change appears driven almost purely by time, with only minor inter-generational differences. (7/12)
Finally, we use Generalized Additive Mixed-effect Models (GAMMs) to model the likelihood of a word being used in a specific sense, given the year of its use and a speakerβs age at the time, while accounting for other inter-speaker variation. (6/12)