- Preprint: arxiv.org/abs/2502.14429
- Code: github.com/zouharvi/COM...
- Models: huggingface.co/collections/...
Thanks to @maikezufle.bsky.social Béni @juliuscheng.bsky.social @mrinmaya.bsky.social @jan-niehues.bsky.social and EAMT for sponsoring this research.
See you in Rabat for EACL 2026!
09.02.2026 08:06
👍 1
🔁 0
💬 0
📌 0
Where do we need to use so much QE? In search. Generating 1000 hypotheses is cheap. Finding the best one isn't. We treat this as a multi-armed bandit where pulling the arm corresponds to getting a more accurate estimate.
UCB works much better than pruning with LogProb and running full QE always.
09.02.2026 08:06
👍 1
🔁 0
💬 1
📌 0
Sometimes the metric already knows the right answer at layer 4. Why compute the rest? We attach the two regressors to every layer and if the self-confidence* is high, we exit early. Faster evaluation.
*In this case e^ is the difference from its own final prediction.
09.02.2026 08:06
👍 0
🔁 0
💬 1
📌 0
We train COMET with additional regressor that predicts its own expected error, simply: L2(y, y^)+β⋅L2(∣y−y^∣, e^). When it's uncertain, it predicts high absolute error e^. This costs almost nothing and is much faster than MC Dropout.
09.02.2026 08:06
👍 0
🔁 0
💬 1
📌 0
Quality estimation (automated metrics) are amazing. Truly. We would like to use them everywhere. That gets compute-expensive very quickly. We also don't know when they don't know.
In "Early-Exit and Instant Confidence Translation Quality Estimation" (at EACL26) we fix that.
09.02.2026 08:06
👍 15
🔁 3
💬 1
📌 1
Yes, LabelStudio is a great general purpose tool (while pearmut aims at specific workflows).
However LabelStudio has limitations with tutorials/attention checks or when it comes to assigning annotation tasks to lay people (they have to register to annotate I believe?)
28.01.2026 14:55
👍 0
🔁 0
💬 0
📌 1
Thanks! Experience reports should be more common.
Pearmut was created out of the frustration to set up humeval using existing tools with good defaults.
In the paper we have 5 researchers trying to set up humeval using 5 different platforms and reporting on time and ease of use and customizability.
28.01.2026 14:43
👍 2
🔁 0
💬 1
📌 0
🥜 Platform for Efficient Annotation of Natural Utterances and Translation? 😁
28.01.2026 14:23
👍 1
🔁 0
💬 0
📌 0
Thanks to all my friends who helped bring this to life. 🙂
Get in touch if you'd like to help with human evaluation for your paper/work! 🖐️
28.01.2026 13:39
👍 0
🔁 0
💬 2
📌 0
GitHub - zouharvi/pearmut: Platform for Evaluating and Reviewing of Multilingual Tasks
Platform for Evaluating and Reviewing of Multilingual Tasks - zouharvi/pearmut
The CLI gives you magic links: dashboard to monitor progress, and annotation links to distribute to your annotators.
Pearmut is open-source and extensible with many exciting features coming. 🍏
github.com/zouharvi/pea...
28.01.2026 13:39
👍 1
🔁 0
💬 1
📌 0
Get started with the following commands:
pip install pearmut
# Download example campaign
wget raw.githubusercontent.com/zouharvi/pea...
# Load and start
pearmut add esa.json
pearmut run
28.01.2026 13:39
👍 1
🔁 0
💬 1
📌 0
The tool supports multiple annotation protocols of translation and multilingual tasks out of the box:
- direct assessment (with custom sliders),
- ESA, MQM,
- contrastive evaluation, video/audio/image, attention checks, tutorials, statistically sound model comparison, etc.
28.01.2026 13:39
👍 0
🔁 0
💬 1
📌 0
How often is human evaluation skipped in papers/workflows just because it's too difficult to set up? Yet even small humeval can give so much more signal than automatic metrics.
Introducing Pearmut, Human Evaluation of Translation Made Trivial🍐 arxiv.org/pdf/2601.02933
28.01.2026 13:39
👍 18
🔁 0
💬 1
📌 0
Speech Translation Metrics track
Home of the IWSLT conference and SIGSLT.
Join us and build a model that predicts human annotations of quality based on source speech and its textual translation.
iwslt.org/2026/metrics
Effort lead by @maikezufle.bsky.social, @marinecarpuat.bsky.social, @hjhan.bsky.social, @matteo-negri.bsky.social, and others. 🙂
14.01.2026 18:04
👍 2
🔁 0
💬 0
📌 0
Have you ever wondered how speech translation gets evaluated? Sadly, most speech evaluation downgrades to text-based metrics. Let’s do better!
At IWSLT 2026, we’re launching the first-ever ✨Speech Translation Metrics Shared Task ✨!
14.01.2026 18:04
👍 8
🔁 1
💬 1
📌 1
Dissatisfied with EACL paper decisions? Fret not and submit your paper with ARR reviews to Multilingual Multicultural Evaluation workshop at EACL (both archival or nonarchival) until January 5th. 🔍🙂
multilingual-multicultural-evaluation.github.io
03.01.2026 12:53
👍 3
🔁 0
💬 0
📌 1
Why is the self-attention masked only diagonally?
23.12.2025 12:58
👍 3
🔁 0
💬 1
📌 0
From Insights to Impact
Ph.D. Thesis, Center for Language and Cognition (CLCG), University of Groningen
Now onwards to making language models transparent and trustworthy for everyone! 🚀
For those curious to know more about my thesis:
- Web-optimized version: gsarti.com/phd-thesis/
- PDF: research.rug.nl/en/publicati...
- Steal my Quarto template: github.com/gsarti/phd-t...
16.12.2025 12:21
👍 10
🔁 2
💬 0
📌 0
Multilingual Multicultural Evaluation Workshop
LLMs in every language? Prove it. Showcase your work on rigorous, efficient, scalable, culture-aware multilingual benchmarking.
Do you have work on resources, metrics & methodologies for evaluating multilingual systems?
Share it at the MME workshop 🕵️ co-located at EACL.
Direct submission deadline in 10 days (December 19th)!
multilingual-multicultural-evaluation.github.io
10.12.2025 09:42
👍 7
🔁 1
💬 0
📌 0
With great power comes warning underfull vbox badness 10000.
09.12.2025 11:11
👍 1
🔁 0
💬 0
📌 0
I'm abandoning LaTeX and my next ACL paper will by in Typst (I fantasize).
09.12.2025 11:11
👍 2
🔁 0
💬 1
📌 0
Guilty of this. 🤓
ctan.math.illinois.edu/macros/latex...
09.12.2025 00:40
👍 2
🔁 0
💬 1
📌 0
NLP evaluation is often detached from practical applications. Today I extrinsically evaluated one WMT25 translation system on the task of getting hair done without knowing Chinese.
Yes you got 67 BLEU points but is the resulting hair slaying? 💇
See the result on one datapoint (my head) at EMNLP.
03.11.2025 05:49
👍 8
🔁 1
💬 0
📌 0
The inspiration for the subset2evaluate poster comes from Henri Matisse's The Horse, the Rider and the Clown. 🐎🚴♀️🤡
28.10.2025 17:13
👍 4
🔁 0
💬 0
📌 0
- How to Select Datapoints for Efficient Human Evaluation of NLG Models? arxiv.org/abs/2501.18251
- Estimating Machine Translation Difficulty arxiv.org/abs/2508.10175
- COMET-poly: Machine Translation Metric Grounded in Other Candidates arxiv.org/abs/2508.18549
28.10.2025 09:45
👍 2
🔁 0
💬 1
📌 0
Let's talk about eval (automatic or human) and multilinguality at #EMNLP in Suzhou! 🇨🇳
- Efficient evaluation (Nov 5, 16:30, poster session 3)
- MT difficulty (Nov 7, 12:30, findings 3)
- COMET-poly (Nov 8, 11:00, WMT)
(DM to meet 🌿 )
28.10.2025 09:45
👍 18
🔁 2
💬 4
📌 0
...real interesting research problems I was passionate about and planning my research future.
You should apply to these fellowships, even if it's for the exercise of periodically refining your research statement.
24.10.2025 12:32
👍 1
🔁 0
💬 1
📌 0
Google PhD Fellowships 2025
Yutong Chen, Benedict Schlüter and Vilém Zouhar, all three of them doctoral students at the Department of Computer Science, have been awarded the Google PhD Fellowship. The programme was created to re...
Grateful to receive the Google PhD Fellowship in NLP! 🙂
I am not secretive about having applied to 4 similar fellowships during my PhD before and not succeeding. Still, refining my research statement (part of the application) helped me tremendously in finding out the...
inf.ethz.ch/news-and-eve...
24.10.2025 12:32
👍 14
🔁 0
💬 1
📌 0
Congratulations, doctor! 🤓
22.10.2025 16:14
👍 1
🔁 0
💬 0
📌 0