Ivan Kartáč (@ivankartac)

How Generative and Agentic AI Shift Concern from Technical Debt to Cognitive Debt This piece by Margaret-Anne Storey is the best explanation of the term cognitive debt I've seen so far. Cognitive debt, a term gaining traction recently, instead communicates the notion that …

Short musings on "cognitive debt" - I'm seeing this in my own work, where excessive unreviewed AI-generated code leads me to lose a firm mental model of what I've built, which then makes it harder to confidently make future decisions simonwillison.net/2026/Feb/15/...

15.02.2026 05:22 👍 465 🔁 88 💬 42 📌 20

The 5th Generation, Evaluation, and Metrics (GEM) Workshop will be at #ACL2026!

Call for papers is out. Topics include:
🐟 LMs as evaluators
🐠 Living benchmarks
🍣 Eval with humans
and more

New for 2026: Opinion & Statement Papers!

Full CFP: gem-workshop.com/call-for-pap...

27.01.2026 19:17 👍 21 🔁 7 💬 0 📌 1

On Evaluating Cognitive Capabilities in Machines (and Other "Alien" Intelligences) (Apologies for the length of this post, which means it gets cut off in the email version.

My latest on Substack -- a write-up of the talk I gave at NeurIPS in December.

aiguide.substack.com/p/on-evaluat...

14.01.2026 18:43 👍 122 🔁 36 💬 0 📌 4

OpeNLGauge comes in two variants: a prompt-based ensemble and a smaller fine-tuned model, both built exclusively on open-weight LLMs (including training data!).

Thanks @tuetschek.bsky.social and @mlango.bsky.social!

23.08.2025 16:39 👍 1 🔁 0 💬 0 📌 0

We introduce an explainable metric for evaluating a wide range of natural language generation tasks, without any need for reference texts. Given an evaluation criterion, the metric provides fine-grained assessments of the output by highlighting and explaining problematic spans in the text.

23.08.2025 16:37 👍 0 🔁 0 💬 1 📌 0

Our paper "OpeNLGauge: An Explainable Metric for NLG Evaluation with Open-Weights LLMs" has been accepted to #INLG2025 conference!

You can read the preprint here: arxiv.org/abs/2503.11858

23.08.2025 16:36 👍 4 🔁 2 💬 1 📌 0

#ACL2025NLP in Vienna 🇦🇹 starts today with 23 🤯 @ufal-cuni.bsky.social folks presenting their work both at the main conference and workshops. Check out our main conference papers today and on Wednesday 👇

28.07.2025 07:27 👍 22 🔁 8 💬 1 📌 1

Ondrej Dusek MLPrague 2025 Evaluating LLM outputs with humans and LLMs Ondřej Dušek MLPrague 30 April 2025 These slides: https://bit.ly/mlprague25-od

Slides and links to papers at bit.ly/mlprague25-od 🤓

02.05.2025 19:25 👍 2 🔁 2 💬 0 📌 0

Today, @tuetschek.bsky.social shared the work of his team on evaluating LLM text generation with both human annotation frameworks and LLM-based metrics. Their approach tackles the benchmark data leakage problem and how to get unseen data for unbiased LLM testing.

30.04.2025 12:02 👍 8 🔁 3 💬 1 📌 0

Large Language Models as Span Annotators Website for the paper Large Language Models as Span Annotators

How do LLMs compare to human crowdworkers in annotating text spans? 🧑🤖

And how can span annotation help us with evaluating texts?

Find out in our new paper: llm-span-annotators.github.io

Arxiv: arxiv.org/abs/2504.08697

15.04.2025 11:10 👍 20 🔁 7 💬 1 📌 2

Ivan Kartáč

Latest posts by Ivan Kartáč @ivankartac