CleverBirds: A Multiple-Choice Benchmark for Fine-grained Human Knowledge Tracing
Mastering fine-grained visual recognition, essential in many expert domains, can require that specialists undergo years of dedicated training. Modeling the progression of such expertize in humans rema...
From medicine to geo-guessing, humans can get incredibly good at solving visual recognition tasks.
But how is this skill learned, and can we model its progression?
We present CleverBirds, accepted #NeurIPS2025, a large-scale benchmark for visual knowledge tracing.
๐ arxiv.org/abs/2511.08512
1/5
12.11.2025 15:29
๐ 7
๐ 2
๐ฌ 1
๐ 4
I will be at EMNLP next week presenting this work on November the 7th! Reach out to me for any questions :))
Work done with my advisor, Mirella Lapata!
Preprint: arxiv.org/pdf/2505.14627
#EMNLP2025 #multimodallearning #scalableoversight #visionlanguagemodels #nlproc
01.11.2025 19:29
๐ 0
๐ 0
๐ฌ 0
๐ 0
As opposed to previous work on debating, where models are assigned to argue for an answer, we only instruct the models to argue for opinions they believe to be true. This is not only efficient but can allow for extracting reasoning data that can update their beliefs.
01.11.2025 19:29
๐ 0
๐ 0
๐ฌ 1
๐ 0
RQ3: Where do debate or consultancy fail?
Our analysis show that judges benefit when the experts are arguing for diverse opinions!
Red quadrant is when the judge is persuaded more often than they should (i.e. they are deceptive).
01.11.2025 19:29
๐ 0
๐ 0
๐ฌ 1
๐ 0
RQ2: Can debate be used as a reliable mechanism for yielding quality reasoning data?
Yes! We show that the reasoning data attained from debate in a completely unsupervised manner imbue reasoning in the expert vision language models.
01.11.2025 19:29
๐ 0
๐ 0
๐ฌ 1
๐ 0
Excited to share my first work as a PhD student at EdinburghNLP that I will be presenting at EMNLP!
RQ1: Can we achieve scalable oversight across modalities via debate?
Yes! We show that debating VLMs lead to better model quality of answers for reasoning tasks.
01.11.2025 19:29
๐ 2
๐ 2
๐ฌ 1
๐ 0