Vilém Zouhar @ EACL's Avatar

Vilém Zouhar @ EACL

@zouharvi

PhD @ ETH Zürich | working on (multilingual) evaluation of NLP | on the academic job market | go #vegan | https://vilda.net

3,584
Followers
1,443
Following
288
Posts
25.07.2023
Joined
Posts Following

Latest posts by Vilém Zouhar @ EACL @zouharvi

- Preprint: arxiv.org/abs/2502.14429
- Code: github.com/zouharvi/COM...
- Models: huggingface.co/collections/...

Thanks to @maikezufle.bsky.social Béni @juliuscheng.bsky.social @mrinmaya.bsky.social @jan-niehues.bsky.social and EAMT for sponsoring this research.

See you in Rabat for EACL 2026!

09.02.2026 08:06 👍 1 🔁 0 💬 0 📌 0
Post image

Where do we need to use so much QE? In search. Generating 1000 hypotheses is cheap. Finding the best one isn't. We treat this as a multi-armed bandit where pulling the arm corresponds to getting a more accurate estimate.

UCB works much better than pruning with LogProb and running full QE always.

09.02.2026 08:06 👍 1 🔁 0 💬 1 📌 0
Post image

Sometimes the metric already knows the right answer at layer 4. Why compute the rest? We attach the two regressors to every layer and if the self-confidence* is high, we exit early. Faster evaluation.

*In this case e^ is the difference from its own final prediction.

09.02.2026 08:06 👍 0 🔁 0 💬 1 📌 0
Post image

We train COMET with additional regressor that predicts its own expected error, simply: L2​(y, y^​)+β⋅L2​(∣y−y^​∣, e^). When it's uncertain, it predicts high absolute error e^. This costs almost nothing and is much faster than MC Dropout.

09.02.2026 08:06 👍 0 🔁 0 💬 1 📌 0
Post image

Quality estimation (automated metrics) are amazing. Truly. We would like to use them everywhere. That gets compute-expensive very quickly. We also don't know when they don't know.

In "Early-Exit and Instant Confidence Translation Quality Estimation" (at EACL26) we fix that.

09.02.2026 08:06 👍 15 🔁 3 💬 1 📌 1

Yes, LabelStudio is a great general purpose tool (while pearmut aims at specific workflows).

However LabelStudio has limitations with tutorials/attention checks or when it comes to assigning annotation tasks to lay people (they have to register to annotate I believe?)

28.01.2026 14:55 👍 0 🔁 0 💬 0 📌 1
Post image

Thanks! Experience reports should be more common.

Pearmut was created out of the frustration to set up humeval using existing tools with good defaults.

In the paper we have 5 researchers trying to set up humeval using 5 different platforms and reporting on time and ease of use and customizability.

28.01.2026 14:43 👍 2 🔁 0 💬 1 📌 0

🥜 Platform for Efficient Annotation of Natural Utterances and Translation? 😁

28.01.2026 14:23 👍 1 🔁 0 💬 0 📌 0

Thanks to all my friends who helped bring this to life. 🙂

Get in touch if you'd like to help with human evaluation for your paper/work! 🖐️

28.01.2026 13:39 👍 0 🔁 0 💬 2 📌 0
Preview
GitHub - zouharvi/pearmut: Platform for Evaluating and Reviewing of Multilingual Tasks Platform for Evaluating and Reviewing of Multilingual Tasks - zouharvi/pearmut

The CLI gives you magic links: dashboard to monitor progress, and annotation links to distribute to your annotators.

Pearmut is open-source and extensible with many exciting features coming. 🍏

github.com/zouharvi/pea...

28.01.2026 13:39 👍 1 🔁 0 💬 1 📌 0
Post image

Get started with the following commands:

pip install pearmut
# Download example campaign
wget raw.githubusercontent.com/zouharvi/pea...
# Load and start
pearmut add esa.json
pearmut run

28.01.2026 13:39 👍 1 🔁 0 💬 1 📌 0
Post image

The tool supports multiple annotation protocols of translation and multilingual tasks out of the box:
- direct assessment (with custom sliders),
- ESA, MQM,
- contrastive evaluation, video/audio/image, attention checks, tutorials, statistically sound model comparison, etc.

28.01.2026 13:39 👍 0 🔁 0 💬 1 📌 0
Post image

How often is human evaluation skipped in papers/workflows just because it's too difficult to set up? Yet even small humeval can give so much more signal than automatic metrics.

Introducing Pearmut, Human Evaluation of Translation Made Trivial🍐 arxiv.org/pdf/2601.02933

28.01.2026 13:39 👍 18 🔁 0 💬 1 📌 0
Speech Translation Metrics track Home of the IWSLT conference and SIGSLT.

Join us and build a model that predicts human annotations of quality based on source speech and its textual translation.

iwslt.org/2026/metrics

Effort lead by @maikezufle.bsky.social, @marinecarpuat.bsky.social, @hjhan.bsky.social, @matteo-negri.bsky.social, and others. 🙂

14.01.2026 18:04 👍 2 🔁 0 💬 0 📌 0
Post image

Have you ever wondered how speech translation gets evaluated? Sadly, most speech evaluation downgrades to text-based metrics. Let’s do better!

At IWSLT 2026, we’re launching the first-ever ✨Speech Translation Metrics Shared Task ✨!

14.01.2026 18:04 👍 8 🔁 1 💬 1 📌 1
Post image

Dissatisfied with EACL paper decisions? Fret not and submit your paper with ARR reviews to Multilingual Multicultural Evaluation workshop at EACL (both archival or nonarchival) until January 5th. 🔍🙂

multilingual-multicultural-evaluation.github.io

03.01.2026 12:53 👍 3 🔁 0 💬 0 📌 1

Why is the self-attention masked only diagonally?

23.12.2025 12:58 👍 3 🔁 0 💬 1 📌 0
From Insights to Impact Ph.D. Thesis, Center for Language and Cognition (CLCG), University of Groningen

Now onwards to making language models transparent and trustworthy for everyone! 🚀

For those curious to know more about my thesis:
- Web-optimized version: gsarti.com/phd-thesis/
- PDF: research.rug.nl/en/publicati...
- Steal my Quarto template: github.com/gsarti/phd-t...

16.12.2025 12:21 👍 10 🔁 2 💬 0 📌 0
Multilingual Multicultural Evaluation Workshop LLMs in every language? Prove it. Showcase your work on rigorous, efficient, scalable, culture-aware multilingual benchmarking.

Do you have work on resources, metrics & methodologies for evaluating multilingual systems?

Share it at the MME workshop 🕵️ co-located at EACL.

Direct submission deadline in 10 days (December 19th)!
multilingual-multicultural-evaluation.github.io

10.12.2025 09:42 👍 7 🔁 1 💬 0 📌 0
Typst template · Issue #58 · acl-org/acl-style-files This is to open discussion for creating a Typst ACL template and supporting it in ACL anthology. Introduction. Typst is a typesetting system which fills lots of the niche of LaTeX but is modern and...

- Word has been dropped recently. aclrollingreview.org/discontinuat...
- Working on Typst template. DM'd.
github.com/acl-org/acl-...

09.12.2025 11:30 👍 0 🔁 0 💬 0 📌 0

With great power comes warning underfull vbox badness 10000.

09.12.2025 11:11 👍 1 🔁 0 💬 0 📌 0

I'm abandoning LaTeX and my next ACL paper will by in Typst (I fantasize).

09.12.2025 11:11 👍 2 🔁 0 💬 1 📌 0

Guilty of this. 🤓
ctan.math.illinois.edu/macros/latex...

09.12.2025 00:40 👍 2 🔁 0 💬 1 📌 0
Post image

NLP evaluation is often detached from practical applications. Today I extrinsically evaluated one WMT25 translation system on the task of getting hair done without knowing Chinese.

Yes you got 67 BLEU points but is the resulting hair slaying? 💇

See the result on one datapoint (my head) at EMNLP.

03.11.2025 05:49 👍 8 🔁 1 💬 0 📌 0
Post image

The inspiration for the subset2evaluate poster comes from Henri Matisse's The Horse, the Rider and the Clown. 🐎🚴‍♀️🤡

28.10.2025 17:13 👍 4 🔁 0 💬 0 📌 0

- How to Select Datapoints for Efficient Human Evaluation of NLG Models? arxiv.org/abs/2501.18251

- Estimating Machine Translation Difficulty arxiv.org/abs/2508.10175

- COMET-poly: Machine Translation Metric Grounded in Other Candidates arxiv.org/abs/2508.18549

28.10.2025 09:45 👍 2 🔁 0 💬 1 📌 0
Post image Post image Post image

Let's talk about eval (automatic or human) and multilinguality at #EMNLP in Suzhou! 🇨🇳

- Efficient evaluation (Nov 5, 16:30, poster session 3)
- MT difficulty (Nov 7, 12:30, findings 3)
- COMET-poly (Nov 8, 11:00, WMT)

(DM to meet 🌿 )

28.10.2025 09:45 👍 18 🔁 2 💬 4 📌 0

...real interesting research problems I was passionate about and planning my research future.

You should apply to these fellowships, even if it's for the exercise of periodically refining your research statement.

24.10.2025 12:32 👍 1 🔁 0 💬 1 📌 0
Preview
Google PhD Fellowships 2025 Yutong Chen, Benedict Schlüter and Vilém Zouhar, all three of them doctoral students at the Department of Computer Science, have been awarded the Google PhD Fellowship. The programme was created to re...

Grateful to receive the Google PhD Fellowship in NLP! 🙂

I am not secretive about having applied to 4 similar fellowships during my PhD before and not succeeding. Still, refining my research statement (part of the application) helped me tremendously in finding out the...

inf.ethz.ch/news-and-eve...

24.10.2025 12:32 👍 14 🔁 0 💬 1 📌 0

Congratulations, doctor! 🤓

22.10.2025 16:14 👍 1 🔁 0 💬 0 📌 0