Folks, I don’t know how it’s possible, but it gets funnier.
Folks, I don’t know how it’s possible, but it gets funnier.
I'll be in Suzhou 🇨🇳 at #EMNLP this week presenting "What has been Lost with Synthetic Evaluation?" done with @anamarasovic.bsky.social & @lasha.bsky.social! 🎉
📍Findings Session 1 - Hall C
📅 Wed, November 5, 13:00 - 14:00
arxiv.org/abs/2505.22830
🧠 Can large language models build the very benchmarks used to evaluate them?
In “What Has Been Lost with Synthetic Evaluation”, Ana Marasović (@anamarasovic.bsky.social) and collaborators ask what happens when LLMs start generating the datasets used to test their reasoning. (1/6🧵)
More results and analysis can be found in the paper.
We welcome any discussion, thanks for reading!!
We hope that our work will inspire future research into:
- Can further prompt review improve the difficulty of synthetic data?
- What other axes (representativeness, diversity) are affected when using LLMs to generate benchmarks?
Key takeways:
- While LLM generated evals may be 𝑣𝑎𝑙𝑖𝑑, as a whole they lose crucial aspects in complexity.
- LLMs are promising where complexity is less critical, but human annotators are vital for benchmarks assessing real-world generalization & nuanced scenarios.
But are these instances similarly difficult?
We explore the difficulty of synthetic benchmarks by comparing performance on synthetic & human-written data across a suite of models.
We find that performance is consistently higher on generated versions of the datasets.
We perform a human study and even find that LLM-generated data is preferred!
We ask NLP researchers to act as dataset creators and gather preferences between synthetic and human-authored data.
We examine both the 𝑣𝑎𝑙𝑖𝑑𝑖𝑡𝑦 and 𝑑𝑖𝑓𝑓𝑖𝑐𝑢𝑙𝑡𝑦 of LLM-generated versions of two high-quality reading comprehension datasets: CondaQA & DROP.
We find that validity is not an issue. We are able to get LLMs to generate instances that are highly valid according to our dataset specs.
We are increasingly seeing LLMs being used to create challenging benchmarks that are then used for evaluating LLMs.
Is this a valid approach to evaluation construction? Do we lose anything in this process?
𝐖𝐡𝐚𝐭 𝐇𝐚𝐬 𝐁𝐞𝐞𝐧 𝐋𝐨𝐬𝐭 𝐖𝐢𝐭𝐡 𝐒𝐲𝐧𝐭𝐡𝐞𝐭𝐢𝐜 𝐄𝐯𝐚𝐥𝐮𝐚𝐭𝐢𝐨𝐧?
(arxiv.org/abs/2505.22830)
I'm happy to announce that the preprint release of my first project is online! Developed with the amazing support of @lasha.bsky.social & @anamarasovic.bsky.social