What looks like a trivial formatting choice can actually alter research conclusions, so mind the gap!
Big thanks to my co-authors @minhducbui.bsky.social & Katharina von der Wense!
๐ Read the full paper here: arxiv.org/abs/2509.15020
What looks like a trivial formatting choice can actually alter research conclusions, so mind the gap!
Big thanks to my co-authors @minhducbui.bsky.social & Katharina von der Wense!
๐ Read the full paper here: arxiv.org/abs/2509.15020
Surprisingly, this small detail:
โ
Shifts model accuracy by up to 11%
โ
Changes which model tops the leaderboard โ raising serious concerns about comparability of LLM leaderboards in prior work
โ
Affects calibration (reliability of confidence estimates)
In our #EMNLP2025 paper we study how the space before the answer letter (e.g., "A" vs. "โฃA") is tokenized.
Practice is currently split: no community-wide standard exists, and even popular evaluation frameworks differ.
๐ง Evaluating your LLM with multiple-choice question answering?
๐งต A tiny space in the prompt can make accuracy jump by 11% โ and even reshuffle model rankings.
#EMNLP2025 #NLP #AI #LLM #Evaluation