Lastly, get in touch if you're interested in an upcoming project Last Translation Benchmark. 😊
Lastly, get in touch if you're interested in an upcoming project Last Translation Benchmark. 😊
This work already has a follow-up arxiv.org/pdf/2509.26619, where we apply bandits to finding out what are the most difficult to translate domains on the internet, and also upcoming follow-up with a gradient-based approach to the problem.
The result is more natural, diverse, and most importantly more difficult-to-translate texts than what you scrape from the Internet or zeroshot from LLMs. Read more in "Generating Difficult-to-Translate Texts" from my Google internship, to be presented in Moroco in two weeks.
Human experts approach this by playing with the MT system interactively to find its weaknesses. We mimic this in MT breaker, an LLM-based way to find which texts break your MT.
arxiv.org/pdf/2509.26592
Machine translation is tough to evaluate, partly because most of what you throw at is too easy. That doesn't at all mean that translation is solved; we're just not doing a good job finding interesting inputs.
- Preprint: arxiv.org/abs/2502.14429
- Code: github.com/zouharvi/COM...
- Models: huggingface.co/collections/...
Thanks to @maikezufle.bsky.social Béni @juliuscheng.bsky.social @mrinmaya.bsky.social @jan-niehues.bsky.social and EAMT for sponsoring this research.
See you in Rabat for EACL 2026!
Where do we need to use so much QE? In search. Generating 1000 hypotheses is cheap. Finding the best one isn't. We treat this as a multi-armed bandit where pulling the arm corresponds to getting a more accurate estimate.
UCB works much better than pruning with LogProb and running full QE always.
Sometimes the metric already knows the right answer at layer 4. Why compute the rest? We attach the two regressors to every layer and if the self-confidence* is high, we exit early. Faster evaluation.
*In this case e^ is the difference from its own final prediction.
We train COMET with additional regressor that predicts its own expected error, simply: L2(y, y^)+β⋅L2(∣y−y^∣, e^). When it's uncertain, it predicts high absolute error e^. This costs almost nothing and is much faster than MC Dropout.
Quality estimation (automated metrics) are amazing. Truly. We would like to use them everywhere. That gets compute-expensive very quickly. We also don't know when they don't know.
In "Early-Exit and Instant Confidence Translation Quality Estimation" (at EACL26) we fix that.
Yes, LabelStudio is a great general purpose tool (while pearmut aims at specific workflows).
However LabelStudio has limitations with tutorials/attention checks or when it comes to assigning annotation tasks to lay people (they have to register to annotate I believe?)
Thanks! Experience reports should be more common.
Pearmut was created out of the frustration to set up humeval using existing tools with good defaults.
In the paper we have 5 researchers trying to set up humeval using 5 different platforms and reporting on time and ease of use and customizability.
🥜 Platform for Efficient Annotation of Natural Utterances and Translation? 😁
Thanks to all my friends who helped bring this to life. 🙂
Get in touch if you'd like to help with human evaluation for your paper/work! 🖐️
The CLI gives you magic links: dashboard to monitor progress, and annotation links to distribute to your annotators.
Pearmut is open-source and extensible with many exciting features coming. 🍏
github.com/zouharvi/pea...
Get started with the following commands:
pip install pearmut
# Download example campaign
wget raw.githubusercontent.com/zouharvi/pea...
# Load and start
pearmut add esa.json
pearmut run
The tool supports multiple annotation protocols of translation and multilingual tasks out of the box:
- direct assessment (with custom sliders),
- ESA, MQM,
- contrastive evaluation, video/audio/image, attention checks, tutorials, statistically sound model comparison, etc.
How often is human evaluation skipped in papers/workflows just because it's too difficult to set up? Yet even small humeval can give so much more signal than automatic metrics.
Introducing Pearmut, Human Evaluation of Translation Made Trivial🍐 arxiv.org/pdf/2601.02933
Join us and build a model that predicts human annotations of quality based on source speech and its textual translation.
iwslt.org/2026/metrics
Effort lead by @maikezufle.bsky.social, @marinecarpuat.bsky.social, @hjhan.bsky.social, @matteo-negri.bsky.social, and others. 🙂
Have you ever wondered how speech translation gets evaluated? Sadly, most speech evaluation downgrades to text-based metrics. Let’s do better!
At IWSLT 2026, we’re launching the first-ever ✨Speech Translation Metrics Shared Task ✨!
Dissatisfied with EACL paper decisions? Fret not and submit your paper with ARR reviews to Multilingual Multicultural Evaluation workshop at EACL (both archival or nonarchival) until January 5th. 🔍🙂
multilingual-multicultural-evaluation.github.io
Why is the self-attention masked only diagonally?
Now onwards to making language models transparent and trustworthy for everyone! 🚀
For those curious to know more about my thesis:
- Web-optimized version: gsarti.com/phd-thesis/
- PDF: research.rug.nl/en/publicati...
- Steal my Quarto template: github.com/gsarti/phd-t...
Do you have work on resources, metrics & methodologies for evaluating multilingual systems?
Share it at the MME workshop 🕵️ co-located at EACL.
Direct submission deadline in 10 days (December 19th)!
multilingual-multicultural-evaluation.github.io
- Word has been dropped recently. aclrollingreview.org/discontinuat...
- Working on Typst template. DM'd.
github.com/acl-org/acl-...
With great power comes warning underfull vbox badness 10000.
I'm abandoning LaTeX and my next ACL paper will by in Typst (I fantasize).
Guilty of this. 🤓
ctan.math.illinois.edu/macros/latex...
NLP evaluation is often detached from practical applications. Today I extrinsically evaluated one WMT25 translation system on the task of getting hair done without knowing Chinese.
Yes you got 67 BLEU points but is the resulting hair slaying? 💇
See the result on one datapoint (my head) at EMNLP.
The inspiration for the subset2evaluate poster comes from Henri Matisse's The Horse, the Rider and the Clown. 🐎🚴♀️🤡