Trending
Mario Sanz's Avatar

Mario Sanz

@msanz

PhD student in #NLProc

454
Followers
199
Following
4
Posts
16.11.2024
Joined
Posts Following

Latest posts by Mario Sanz @msanz

Preview
Mind the Gap: A Closer Look at Tokenization for Multiple-Choice Question Answering with LLMs When evaluating large language models (LLMs) with multiple-choice question answering (MCQA), it is common to end the prompt with the string "Answer:" to facilitate automated answer extraction via next...

What looks like a trivial formatting choice can actually alter research conclusions, so mind the gap!

Big thanks to my co-authors @minhducbui.bsky.social & Katharina von der Wense!

๐Ÿ“„ Read the full paper here: arxiv.org/abs/2509.15020

26.09.2025 09:18 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

Surprisingly, this small detail:
โœ… Shifts model accuracy by up to 11%
โœ… Changes which model tops the leaderboard โ€“ raising serious concerns about comparability of LLM leaderboards in prior work
โœ… Affects calibration (reliability of confidence estimates)

26.09.2025 09:18 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

In our #EMNLP2025 paper we study how the space before the answer letter (e.g., "A" vs. "โฃA") is tokenized.

Practice is currently split: no community-wide standard exists, and even popular evaluation frameworks differ.

26.09.2025 09:18 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

๐Ÿง Evaluating your LLM with multiple-choice question answering?

๐Ÿงต A tiny space in the prompt can make accuracy jump by 11% โ€“ and even reshuffle model rankings.

#EMNLP2025 #NLP #AI #LLM #Evaluation

26.09.2025 09:18 ๐Ÿ‘ 2 ๐Ÿ” 0 ๐Ÿ’ฌ 2 ๐Ÿ“Œ 0