Vilém Zouhar @ EACL (@zouharvi)

Lastly, get in touch if you're interested in an upcoming project Last Translation Benchmark. 😊

12.03.2026 15:45 👍 0 🔁 0 💬 0 📌 0

This work already has a follow-up arxiv.org/pdf/2509.26619, where we apply bandits to finding out what are the most difficult to translate domains on the internet, and also upcoming follow-up with a gradient-based approach to the problem.

12.03.2026 15:45 👍 0 🔁 0 💬 1 📌 0

The result is more natural, diverse, and most importantly more difficult-to-translate texts than what you scrape from the Internet or zeroshot from LLMs. Read more in "Generating Difficult-to-Translate Texts" from my Google internship, to be presented in Moroco in two weeks.

12.03.2026 15:45 👍 0 🔁 0 💬 1 📌 0

Human experts approach this by playing with the MT system interactively to find its weaknesses. We mimic this in MT breaker, an LLM-based way to find which texts break your MT.

arxiv.org/pdf/2509.26592

12.03.2026 15:45 👍 0 🔁 0 💬 1 📌 0

Machine translation is tough to evaluate, partly because most of what you throw at is too easy. That doesn't at all mean that translation is solved; we're just not doing a good job finding interesting inputs.

12.03.2026 15:45 👍 10 🔁 1 💬 1 📌 0

- Preprint: arxiv.org/abs/2502.14429
- Code: github.com/zouharvi/COM...
- Models: huggingface.co/collections/...

Thanks to @maikezufle.bsky.social Béni @juliuscheng.bsky.social @mrinmaya.bsky.social @jan-niehues.bsky.social and EAMT for sponsoring this research.

See you in Rabat for EACL 2026!

09.02.2026 08:06 👍 1 🔁 0 💬 0 📌 0

Where do we need to use so much QE? In search. Generating 1000 hypotheses is cheap. Finding the best one isn't. We treat this as a multi-armed bandit where pulling the arm corresponds to getting a more accurate estimate.

UCB works much better than pruning with LogProb and running full QE always.

09.02.2026 08:06 👍 1 🔁 0 💬 1 📌 0

Sometimes the metric already knows the right answer at layer 4. Why compute the rest? We attach the two regressors to every layer and if the self-confidence* is high, we exit early. Faster evaluation.

*In this case e^ is the difference from its own final prediction.

09.02.2026 08:06 👍 0 🔁 0 💬 1 📌 0

We train COMET with additional regressor that predicts its own expected error, simply: L2(y, y^)+β⋅L2(∣y−y^∣, e^). When it's uncertain, it predicts high absolute error e^. This costs almost nothing and is much faster than MC Dropout.

09.02.2026 08:06 👍 0 🔁 0 💬 1 📌 0

Quality estimation (automated metrics) are amazing. Truly. We would like to use them everywhere. That gets compute-expensive very quickly. We also don't know when they don't know.

In "Early-Exit and Instant Confidence Translation Quality Estimation" (at EACL26) we fix that.

09.02.2026 08:06 👍 15 🔁 3 💬 1 📌 1

Yes, LabelStudio is a great general purpose tool (while pearmut aims at specific workflows).

However LabelStudio has limitations with tutorials/attention checks or when it comes to assigning annotation tasks to lay people (they have to register to annotate I believe?)

28.01.2026 14:55 👍 0 🔁 0 💬 0 📌 1

Thanks! Experience reports should be more common.

Pearmut was created out of the frustration to set up humeval using existing tools with good defaults.

In the paper we have 5 researchers trying to set up humeval using 5 different platforms and reporting on time and ease of use and customizability.

28.01.2026 14:43 👍 2 🔁 0 💬 1 📌 0

🥜 Platform for Efficient Annotation of Natural Utterances and Translation? 😁

28.01.2026 14:23 👍 1 🔁 0 💬 0 📌 0

Thanks to all my friends who helped bring this to life. 🙂

Get in touch if you'd like to help with human evaluation for your paper/work! 🖐️

28.01.2026 13:39 👍 0 🔁 0 💬 2 📌 0

GitHub - zouharvi/pearmut: Platform for Evaluating and Reviewing of Multilingual Tasks Platform for Evaluating and Reviewing of Multilingual Tasks - zouharvi/pearmut

The CLI gives you magic links: dashboard to monitor progress, and annotation links to distribute to your annotators.

Pearmut is open-source and extensible with many exciting features coming. 🍏

github.com/zouharvi/pea...

28.01.2026 13:39 👍 1 🔁 0 💬 1 📌 0

Get started with the following commands:

pip install pearmut
# Download example campaign
wget raw.githubusercontent.com/zouharvi/pea...
# Load and start
pearmut add esa.json
pearmut run

28.01.2026 13:39 👍 1 🔁 0 💬 1 📌 0

The tool supports multiple annotation protocols of translation and multilingual tasks out of the box:
- direct assessment (with custom sliders),
- ESA, MQM,
- contrastive evaluation, video/audio/image, attention checks, tutorials, statistically sound model comparison, etc.

28.01.2026 13:39 👍 0 🔁 0 💬 1 📌 0

How often is human evaluation skipped in papers/workflows just because it's too difficult to set up? Yet even small humeval can give so much more signal than automatic metrics.

Introducing Pearmut, Human Evaluation of Translation Made Trivial🍐 arxiv.org/pdf/2601.02933

28.01.2026 13:39 👍 18 🔁 0 💬 1 📌 0

Speech Translation Metrics track Home of the IWSLT conference and SIGSLT.

Join us and build a model that predicts human annotations of quality based on source speech and its textual translation.

iwslt.org/2026/metrics

Effort lead by @maikezufle.bsky.social, @marinecarpuat.bsky.social, @hjhan.bsky.social, @matteo-negri.bsky.social, and others. 🙂

14.01.2026 18:04 👍 2 🔁 0 💬 0 📌 0

Have you ever wondered how speech translation gets evaluated? Sadly, most speech evaluation downgrades to text-based metrics. Let’s do better!

At IWSLT 2026, we’re launching the first-ever ✨Speech Translation Metrics Shared Task ✨!

14.01.2026 18:04 👍 8 🔁 1 💬 1 📌 1

Dissatisfied with EACL paper decisions? Fret not and submit your paper with ARR reviews to Multilingual Multicultural Evaluation workshop at EACL (both archival or nonarchival) until January 5th. 🔍🙂

multilingual-multicultural-evaluation.github.io

03.01.2026 12:53 👍 3 🔁 0 💬 0 📌 1

Why is the self-attention masked only diagonally?

23.12.2025 12:58 👍 3 🔁 0 💬 1 📌 0

From Insights to Impact Ph.D. Thesis, Center for Language and Cognition (CLCG), University of Groningen

Now onwards to making language models transparent and trustworthy for everyone! 🚀

For those curious to know more about my thesis:
- Web-optimized version: gsarti.com/phd-thesis/
- PDF: research.rug.nl/en/publicati...
- Steal my Quarto template: github.com/gsarti/phd-t...

16.12.2025 12:21 👍 10 🔁 2 💬 0 📌 0

Multilingual Multicultural Evaluation Workshop LLMs in every language? Prove it. Showcase your work on rigorous, efficient, scalable, culture-aware multilingual benchmarking.

Do you have work on resources, metrics & methodologies for evaluating multilingual systems?

Share it at the MME workshop 🕵️ co-located at EACL.

Direct submission deadline in 10 days (December 19th)!
multilingual-multicultural-evaluation.github.io

10.12.2025 09:42 👍 7 🔁 1 💬 0 📌 0

Typst template · Issue #58 · acl-org/acl-style-files This is to open discussion for creating a Typst ACL template and supporting it in ACL anthology. Introduction. Typst is a typesetting system which fills lots of the niche of LaTeX but is modern and...

- Word has been dropped recently. aclrollingreview.org/discontinuat...
- Working on Typst template. DM'd.
github.com/acl-org/acl-...

09.12.2025 11:30 👍 0 🔁 0 💬 0 📌 0

With great power comes warning underfull vbox badness 10000.

09.12.2025 11:11 👍 1 🔁 0 💬 0 📌 0

I'm abandoning LaTeX and my next ACL paper will by in Typst (I fantasize).

09.12.2025 11:11 👍 2 🔁 0 💬 1 📌 0

Guilty of this. 🤓
ctan.math.illinois.edu/macros/latex...

09.12.2025 00:40 👍 2 🔁 0 💬 1 📌 0

NLP evaluation is often detached from practical applications. Today I extrinsically evaluated one WMT25 translation system on the task of getting hair done without knowing Chinese.

Yes you got 67 BLEU points but is the resulting hair slaying? 💇

See the result on one datapoint (my head) at EMNLP.

03.11.2025 05:49 👍 8 🔁 1 💬 0 📌 0

The inspiration for the subset2evaluate poster comes from Henri Matisse's The Horse, the Rider and the Clown. 🐎🚴‍♀️🤡

28.10.2025 17:13 👍 4 🔁 0 💬 0 📌 0

Vilém Zouhar @ EACL

Latest posts by Vilém Zouhar @ EACL @zouharvi