Alex Gill's Avatar

Alex Gill

@agill32

NLP researcher at U of U

390
Followers
362
Following
9
Posts
11.11.2024
Joined
Posts Following

Latest posts by Alex Gill @agill32

Post image

Folks, I don’t know how it’s possible, but it gets funnier.

21.11.2025 15:19 👍 473 🔁 107 💬 16 📌 31
Post image

I'll be in Suzhou 🇨🇳 at #EMNLP this week presenting "What has been Lost with Synthetic Evaluation?" done with @anamarasovic.bsky.social & @lasha.bsky.social! 🎉

📍Findings Session 1 - Hall C
📅 Wed, November 5, 13:00 - 14:00

arxiv.org/abs/2505.22830

03.11.2025 11:03 👍 11 🔁 2 💬 0 📌 1
Post image Post image Post image

🧠 Can large language models build the very benchmarks used to evaluate them?
In “What Has Been Lost with Synthetic Evaluation”, Ana Marasović (@anamarasovic.bsky.social) and collaborators ask what happens when LLMs start generating the datasets used to test their reasoning. (1/6🧵)

20.10.2025 16:01 👍 9 🔁 3 💬 2 📌 0

More results and analysis can be found in the paper.

We welcome any discussion, thanks for reading!!

04.06.2025 22:24 👍 0 🔁 0 💬 0 📌 0

We hope that our work will inspire future research into:

- Can further prompt review improve the difficulty of synthetic data?

- What other axes (representativeness, diversity) are affected when using LLMs to generate benchmarks?

04.06.2025 22:24 👍 0 🔁 0 💬 1 📌 0

Key takeways:

- While LLM generated evals may be 𝑣𝑎𝑙𝑖𝑑, as a whole they lose crucial aspects in complexity.

- LLMs are promising where complexity is less critical, but human annotators are vital for benchmarks assessing real-world generalization & nuanced scenarios.

04.06.2025 22:24 👍 0 🔁 0 💬 1 📌 0
Post image

But are these instances similarly difficult?

We explore the difficulty of synthetic benchmarks by comparing performance on synthetic & human-written data across a suite of models.

We find that performance is consistently higher on generated versions of the datasets.

04.06.2025 22:24 👍 0 🔁 0 💬 1 📌 0
Post image

We perform a human study and even find that LLM-generated data is preferred!

We ask NLP researchers to act as dataset creators and gather preferences between synthetic and human-authored data.

04.06.2025 22:24 👍 0 🔁 0 💬 1 📌 0
Post image Post image

We examine both the 𝑣𝑎𝑙𝑖𝑑𝑖𝑡𝑦 and 𝑑𝑖𝑓𝑓𝑖𝑐𝑢𝑙𝑡𝑦 of LLM-generated versions of two high-quality reading comprehension datasets: CondaQA & DROP.

We find that validity is not an issue. We are able to get LLMs to generate instances that are highly valid according to our dataset specs.

04.06.2025 22:24 👍 0 🔁 0 💬 1 📌 0

We are increasingly seeing LLMs being used to create challenging benchmarks that are then used for evaluating LLMs.

Is this a valid approach to evaluation construction? Do we lose anything in this process?

04.06.2025 22:24 👍 0 🔁 0 💬 1 📌 0
Preview
What Has Been Lost with Synthetic Evaluation? Large language models (LLMs) are increasingly used for data generation. However, creating evaluation benchmarks raises the bar for this emerging paradigm. Benchmarks must target specific phenomena, pe...

𝐖𝐡𝐚𝐭 𝐇𝐚𝐬 𝐁𝐞𝐞𝐧 𝐋𝐨𝐬𝐭 𝐖𝐢𝐭𝐡 𝐒𝐲𝐧𝐭𝐡𝐞𝐭𝐢𝐜 𝐄𝐯𝐚𝐥𝐮𝐚𝐭𝐢𝐨𝐧?

(arxiv.org/abs/2505.22830)

I'm happy to announce that the preprint release of my first project is online! Developed with the amazing support of @lasha.bsky.social & @anamarasovic.bsky.social

04.06.2025 22:24 👍 11 🔁 4 💬 1 📌 1