Gaurav Kamath (@grvkamath)

Just read the paper -- super cool, thanks for sharing!! Extremely relevant to + in line with our findings. Seem like a common thread here about the negative effects of RLVR on output variation...

07.03.2026 17:40 👍 1 🔁 0 💬 0 📌 0

ProbCOPA Interactive Explorer

Paper + data + code: grvkamath.github.io/probcopa-dem... | Major thanks to my co-authors, without whom this work would not have been possible: Sreenath Madathil, @sebschu.bsky.social, Marie-Catherine de Marneffe and @sivareddyg.bsky.social !(10/10)

04.03.2026 16:13 👍 5 🔁 1 💬 0 📌 0

Takeaway: reasoning LLMs are getting better and better on math and code—deterministic reasoning tasks. But we should also evaluate them on open-ended, inherently uncertain everyday reasoning! (9/10)

04.03.2026 16:13 👍 8 🔁 2 💬 1 📌 0

Ensembling all 8 models helps close the gap with human response distributions — but still doesn't reach human-human baseline similarity. (8/10)

04.03.2026 16:13 👍 4 🔁 0 💬 1 📌 0

We also looked inside reasoning chains. 90/100 sampled chains showed models explicitly enumerating alternative scenarios — a consistent reasoning pattern. Longer reasoning chains also correlate with more human judgment variation. (7/10)

04.03.2026 16:13 👍 2 🔁 0 💬 1 📌 0

More strikingly: for almost every item in our dataset, humans showed more response variation than models. Increasing temperature is not enough; models devolve into outputting random tokens before reaching human-level variation. (6/10)

04.03.2026 16:13 👍 6 🔁 0 💬 1 📌 0

Models do okay at the extremes — for inferences that humans deem very likely or very unlikely, model responses cluster in similar regions. But where humans are more uncertain, this similarity breaks down. (5/10)

04.03.2026 16:13 👍 1 🔁 0 💬 1 📌 0

We tested 8 contemporary reasoning LLMs (GPT-5, Gemini-3, Kimi-K2-Thinking, Claude Sonnet-4.5, and more). They show a clear pattern: unlike humans, models almost never return medium likelihood scores. (4/10)

04.03.2026 16:13 👍 5 🔁 0 💬 1 📌 1

Human ratings are graded and varied — people do not always commit to hard judgments, and show variation on their exact judgments of inference likelihood. (3/10)

04.03.2026 16:13 👍 1 🔁 0 💬 1 📌 0

Humans reason in non-deterministic settings all the time. "There was an accident on the highway → traffic was worse than usual" is likely, but not certain. We built ProbCOPA, a dataset of 210 such inferences each rated by 25–30 people, to study this. (2/10)

04.03.2026 16:13 👍 3 🔁 0 💬 1 📌 0

🚨New Paper!🚨 How do reasoning LLMs handle inferences that have no deterministic answer? We find that they diverge from humans in some significant ways, and fail to reflect human uncertainty… 🧵(1/10)

04.03.2026 16:13 👍 52 🔁 18 💬 3 📌 1

More strikingly: for almost every item in our dataset, humans showed more response variation than models. Increasing temperature is not enough; models devolve into outputting random tokens before reaching human-level variation. (6/10)

04.03.2026 16:02 👍 0 🔁 0 💬 0 📌 0

Models do okay at the extremes — for inferences that humans deem very likely or very unlikely, model responses cluster in similar regions. But where humans are more uncertain, this similarity breaks down. (5/10)

04.03.2026 16:02 👍 0 🔁 0 💬 1 📌 0

We tested 8 contemporary reasoning LLMs (GPT-5, Gemini-3, Kimi-K2-Thinking, Claude Sonnet-4.5, and more). The pattern is striking: unlike humans, models almost never return medium likelihood scores. (4/10)

04.03.2026 16:02 👍 0 🔁 0 💬 1 📌 0

Human ratings are graded and varied — people do not always commit to hard judgments, and show variation on their exact judgments of inference likelihood. (3/10)

04.03.2026 16:02 👍 0 🔁 0 💬 1 📌 0

Humans reason in non-deterministic settings all the time. "There was an accident on the highway → traffic was worse than usual" is likely, but not certain. We built ProbCOPA, a dataset of 210 such inferences each rated by 25–30 people, to study this. (2/10)

04.03.2026 16:02 👍 0 🔁 0 💬 1 📌 0

Super cool interpretability work from @bennokrojer.bsky.social , that I think is also relevant to anyone interested in how word meanings are represented in LLMs!

11.02.2026 15:13 👍 1 🔁 0 💬 0 📌 0

The top shows the title and authors of the paper: "Whither symbols in the era of advanced neural networks?" by Tom Griffiths, Brenden Lake, Tom McCoy, Ellie Pavlick, and Taylor Webb. At the bottom is text saying "Modern neural networks display capacities traditionally believed to require symbolic systems. This motivates a re-assessment of the role of symbols in cognitive theories." In the middle is a graphic illustrating this text by showing three capacities: compositionality, productivity, and inductive biases. For each one, there is an illustration of a neural network displaying it. For compositionality, the illustration is DALL-E 3 creating an image of a teddy bear skateboarding in Times Square. For productivity, the illustration is novel words produced by GPT-2: "IKEA-ness", "nonneotropical", "Brazilianisms", "quackdom", "Smurfverse". For inductive biases, the illustration is a graph showing that a meta-learned neural network can learn formal languages from a small number of examples.

🤖 🧠 NEW PAPER ON COGSCI & AI 🧠 🤖

Recent neural networks capture properties long thought to require symbols: compositionality, productivity, rapid learning

So what role should symbols play in theories of the mind? For our answer...read on!

Paper: arxiv.org/abs/2508.05776

1/n

15.08.2025 16:27 👍 101 🔁 16 💬 8 📌 3

Examples of word sense probability over the time range of the corpus.

Using congressional speeches as a corpus, researchers quantify how younger and older adults adopt new meanings for words as language changes. Older people may be a bit slower to change, but can show considerable linguistic flexibility. In PNAS: www.pnas.org/doi/10.1073/...

11.08.2025 15:43 👍 4 🔁 1 💬 0 📌 0

Why we need to be more chill about language change It appears that our vocabulary is entrained with the Zeitgeist, whether we like it or not

My latest column for @thenewworldmag.bsky.social looks at the question of how new meanings for words spread in the population.
www.thenewworld.co.uk/philip-ball-...

30.07.2025 13:32 👍 19 🔁 5 💬 4 📌 1

What's most likely is that this IS a factor for a portion of our more recent data, but not enough to affect the main finding here (across a range of words and decades). Tyvm for the interest in this!!

Cool article that's relevant: www.newyorker.com/magazine/200...

30.07.2025 02:12 👍 1 🔁 0 💬 0 📌 0

Very valid q! It's likely a confound for some of the more recent data, but not most. (i) lots of the "speeches" are in fact shorter replies and remarks; (ii) the professionalization of speech-writing evolved over the 20th century, but we see no change in speakers' adoption behavior over time.

30.07.2025 02:12 👍 2 🔁 0 💬 1 📌 0

tysm, means a lot coming from you!

30.07.2025 01:41 👍 1 🔁 0 💬 0 📌 0

PNAS Proceedings of the National Academy of Sciences (PNAS), a peer reviewed journal of the National Academy of Sciences (NAS) - an authoritative source of high-impact, original research that broadly spans...

Ultimately, we hope the insights from this work spur more work that uses tools from NLP to answer questions about human language.

Massive thanks to co-auths: Michelle Yang, ‪@sivareddyg.bsky.social‬, @msonderegger.bsky.social‬ and @dallascard.bsky.social‬!

Paper: bit.ly/4fcWfma. (12/12)

29.07.2025 12:05 👍 1 🔁 0 💬 0 📌 0

Limitations: Congressional speech is time-annotated linguistic data, from thousands of speakers whose ages are known, across over a century—rare, required properties for this study. But Congress was and is not socially representative. Plus: what about other languages and societies? (11/12)

29.07.2025 12:05 👍 1 🔁 0 💬 1 📌 0

…while at a methodological level, they suggest that sociolinguists should avoid relying too much on apparent time differences—i.e. using older speakers as a window into the past—to identify ongoing semantic shifts. (10/12)

29.07.2025 12:05 👍 2 🔁 0 💬 1 📌 0

Our findings have both conceptual and methodological implications. At a conceptual level, they suggest that the social dynamics of word meaning change are generation-agnostic, and that speakers are capable of adapting their lexicon well into adulthood (unlike, e.g., their phonology)... (9/12)

29.07.2025 12:05 👍 2 🔁 0 💬 1 📌 0

These findings extend to the level of the individual: members of Congress that gave speeches over a long enough period of time showed significant changes in how they used some of our target words, mimicking population-level trends in word meaning change. (8/12)

29.07.2025 12:05 👍 3 🔁 0 💬 1 📌 0

Overall, we find that age has very little effect—older speakers lag slightly behind younger ones, but match their word usage within just a few years; in some cases, they even lead change. Semantic change appears driven almost purely by time, with only minor inter-generational differences. (7/12)

29.07.2025 12:05 👍 2 🔁 0 💬 1 📌 0

Finally, we use Generalized Additive Mixed-effect Models (GAMMs) to model the likelihood of a word being used in a specific sense, given the year of its use and a speaker’s age at the time, while accounting for other inter-speaker variation. (6/12)

29.07.2025 12:05 👍 3 🔁 0 💬 1 📌 0

Gaurav Kamath

Latest posts by Gaurav Kamath @grvkamath