Alex Gill (@agill32)

Folks, I don’t know how it’s possible, but it gets funnier.

21.11.2025 15:19 👍 473 🔁 107 💬 16 📌 31

I'll be in Suzhou 🇨🇳 at #EMNLP this week presenting "What has been Lost with Synthetic Evaluation?" done with @anamarasovic.bsky.social & @lasha.bsky.social! 🎉

📍Findings Session 1 - Hall C
📅 Wed, November 5, 13:00 - 14:00

arxiv.org/abs/2505.22830

03.11.2025 11:03 👍 11 🔁 2 💬 0 📌 1

🧠 Can large language models build the very benchmarks used to evaluate them?
In “What Has Been Lost with Synthetic Evaluation”, Ana Marasović (@anamarasovic.bsky.social) and collaborators ask what happens when LLMs start generating the datasets used to test their reasoning. (1/6🧵)

20.10.2025 16:01 👍 9 🔁 3 💬 2 📌 0

More results and analysis can be found in the paper.

We welcome any discussion, thanks for reading!!

04.06.2025 22:24 👍 0 🔁 0 💬 0 📌 0

We hope that our work will inspire future research into:

- Can further prompt review improve the difficulty of synthetic data?

- What other axes (representativeness, diversity) are affected when using LLMs to generate benchmarks?

04.06.2025 22:24 👍 0 🔁 0 💬 1 📌 0

Key takeways:

- While LLM generated evals may be 𝑣𝑎𝑙𝑖𝑑, as a whole they lose crucial aspects in complexity.

- LLMs are promising where complexity is less critical, but human annotators are vital for benchmarks assessing real-world generalization & nuanced scenarios.

04.06.2025 22:24 👍 0 🔁 0 💬 1 📌 0

But are these instances similarly difficult?

We explore the difficulty of synthetic benchmarks by comparing performance on synthetic & human-written data across a suite of models.

We find that performance is consistently higher on generated versions of the datasets.

04.06.2025 22:24 👍 0 🔁 0 💬 1 📌 0

We perform a human study and even find that LLM-generated data is preferred!

We ask NLP researchers to act as dataset creators and gather preferences between synthetic and human-authored data.

04.06.2025 22:24 👍 0 🔁 0 💬 1 📌 0

We examine both the 𝑣𝑎𝑙𝑖𝑑𝑖𝑡𝑦 and 𝑑𝑖𝑓𝑓𝑖𝑐𝑢𝑙𝑡𝑦 of LLM-generated versions of two high-quality reading comprehension datasets: CondaQA & DROP.

We find that validity is not an issue. We are able to get LLMs to generate instances that are highly valid according to our dataset specs.

04.06.2025 22:24 👍 0 🔁 0 💬 1 📌 0

We are increasingly seeing LLMs being used to create challenging benchmarks that are then used for evaluating LLMs.

Is this a valid approach to evaluation construction? Do we lose anything in this process?

04.06.2025 22:24 👍 0 🔁 0 💬 1 📌 0

What Has Been Lost with Synthetic Evaluation? Large language models (LLMs) are increasingly used for data generation. However, creating evaluation benchmarks raises the bar for this emerging paradigm. Benchmarks must target specific phenomena, pe...

𝐖𝐡𝐚𝐭 𝐇𝐚𝐬 𝐁𝐞𝐞𝐧 𝐋𝐨𝐬𝐭 𝐖𝐢𝐭𝐡 𝐒𝐲𝐧𝐭𝐡𝐞𝐭𝐢𝐜 𝐄𝐯𝐚𝐥𝐮𝐚𝐭𝐢𝐨𝐧?

(arxiv.org/abs/2505.22830)

I'm happy to announce that the preprint release of my first project is online! Developed with the amazing support of @lasha.bsky.social & @anamarasovic.bsky.social

04.06.2025 22:24 👍 11 🔁 4 💬 1 📌 1

Alex Gill

Latest posts by Alex Gill @agill32