The 5th Generation, Evaluation, and Metrics (GEM) Workshop will be at #ACL2026!
Call for papers is out. Topics include:
🐟 LMs as evaluators
🐠 Living benchmarks
🍣 Eval with humans
and more
New for 2026: Opinion & Statement Papers!
Full CFP: gem-workshop.com/call-for-pap...
27.01.2026 19:17
👍 21
🔁 7
💬 0
📌 1
OpeNLGauge comes in two variants: a prompt-based ensemble and a smaller fine-tuned model, both built exclusively on open-weight LLMs (including training data!).
Thanks @tuetschek.bsky.social and @mlango.bsky.social!
23.08.2025 16:39
👍 1
🔁 0
💬 0
📌 0
We introduce an explainable metric for evaluating a wide range of natural language generation tasks, without any need for reference texts. Given an evaluation criterion, the metric provides fine-grained assessments of the output by highlighting and explaining problematic spans in the text.
23.08.2025 16:37
👍 0
🔁 0
💬 1
📌 0
Our paper "OpeNLGauge: An Explainable Metric for NLG Evaluation with Open-Weights LLMs" has been accepted to #INLG2025 conference!
You can read the preprint here: arxiv.org/abs/2503.11858
23.08.2025 16:36
👍 4
🔁 2
💬 1
📌 0
#ACL2025NLP in Vienna 🇦🇹 starts today with 23 🤯 @ufal-cuni.bsky.social folks presenting their work both at the main conference and workshops. Check out our main conference papers today and on Wednesday 👇
28.07.2025 07:27
👍 22
🔁 8
💬 1
📌 1
Today, @tuetschek.bsky.social shared the work of his team on evaluating LLM text generation with both human annotation frameworks and LLM-based metrics. Their approach tackles the benchmark data leakage problem and how to get unseen data for unbiased LLM testing.
30.04.2025 12:02
👍 8
🔁 3
💬 1
📌 0
Large Language Models as Span Annotators
Website for the paper Large Language Models as Span Annotators
How do LLMs compare to human crowdworkers in annotating text spans? 🧑🤖
And how can span annotation help us with evaluating texts?
Find out in our new paper: llm-span-annotators.github.io
Arxiv: arxiv.org/abs/2504.08697
15.04.2025 11:10
👍 20
🔁 7
💬 1
📌 2