I start my last-ever course today, MSc course on Natural Language Generation. My last lecture (on NLG evaluation) will be on 20 April. Hard to believe...
I start my last-ever course today, MSc course on Natural Language Generation. My last lecture (on NLG evaluation) will be on 20 April. Hard to believe...
New blog: Questions from readers of my book
A group who is reading my book sent me many questions, some of which we discussed in a call last week. I thought I would share the questions and my responses.
ehudreiter.com/2026/03/03/q...
Great to see that my student Jawwad Baig has submitted his PhD! One of my main goals for 2025-26 is to help 6 PhD students submit before I retire. Halfway through the academic year, and three of the six have now submitted, so on track.
If you're not on the SIGGEN mailing list or in the NLG Discord server, you might not have seen that Barkavi Sundarajan has been leading a reading group about @ehudreiter.bsky.social's new book "Natural Language Generation".
Join us Friday, 27 Feb, at 2pm UK time: discord.gg/hysgkK7Q?eve...
My PhD student Adarsa Sivaprasad is looking for people who have lived experience of IVF to help evaluate an AI chatbot which explains IVF outcome predictions.
What is involved: 45 min online MS Teams call.
Read details and sign up at: tinyurl.com/cc2aepf5
Id love to see old friends and meet new colleaugues at my retirement symposium!
New blog: Dont ignore omissions!
Evaluation of LLMs focuses on accuracy and hallucination. Completeness and omission also important; does the text include all the key information? Omissions are a huge problem in medical NLG, and in other NLG tasks as well.
ehudreiter.com/2026/02/11/d...
Fascinating paper on problems using LLMs to respond to health queries. LLMs do well on standard medical benchmarks but struggle (for example) to understand symptoms presented in a confused way
www.nature.com/articles/s41...
Really interesting paper on AI coding assistants, which may be getting worse becuase of lower quality training data
spectrum.ieee.org/ai-coding-de...
This is the first conference which my daughter Naomi (who is doing a PhD in medieval history) has helped to organise. Good luck to her and fellow organisers!
Nice talk by my colleague Jakub Zbrzeżny (Aberdeen Divinity Dept) on using LLM to translate biblical texts into and out of local Arabic dialect (n Hebron). Basically LLMs understand dialect, but cannot produce it. Will this encourage more young people to abandon their dialect?
Very excited to have a retirement symposium on NLG evaluation! Looking forward to seeing old friends and meeting new people!
Friend asked what I am focusing on in last half-year before retirement. Largely getting 5 PhD students to submit PhDs and have vivas. Fortunately all seem to be on track!
Wrote blog recently that authors should do sanity checks on papers. Readers should as well! Recently read interesting paper, but sanity check showed claims in paper did not match data. Either paper hallucinated or authors sloppy.
ehudreiter.com/2025/12/22/d...
Interesting paper about different techniques to eval performance of medical decision support. Concludes that F1 is the worst technique, shame that is so heavily used in NLP and AOI..
doi.org/10.1016/j.la...
I liked recent Economist article "How to avoid common AI pitfalls in the workplace" (paywall). Little mention of benchmarks or AGI (but does mention contrast between rising benchmark scores and limited real-world impact). Instead focuses on pragmatic issues such as workflow.
The 5th Generation, Evaluation, and Metrics (GEM) Workshop will be at #ACL2026!
Call for papers is out. Topics include:
🐟 LMs as evaluators
🐠 Living benchmarks
🍣 Eval with humans
and more
New for 2026: Opinion & Statement Papers!
Full CFP: gem-workshop.com/call-for-pap...
New blog: My Eureka moments in research
The most exciting moments of my career were discovering something new and exciting about NLG, etc. I describe a few of my “Eureka” moments. They are what I remember best, much more than acceptance of papers.
ehudreiter.com/2026/01/30/m...
Realised that a lot of the URLs for older papers do not work in my publications page (ehudreiter.com/publications/) . For journal papers, I am replacing dead links with DOIs, and will use only DOIs in future!
New blog: Lets use AI to help people manage illness
I am excited by the idea of using AI to help people manage ilness and health conditions. This isnt very sexy, but I think there is real potential to improve health outcomes and quality of life.
ehudreiter.com/2026/01/19/l...
Other CS academics I know have done very different things in retirement: remained active in academia as emeritus, joined a startup, charitable work, moved to remote spot in Scot Highlands, write novels, etc. We did similar things as academics (research and teaching), but very diff in retirement!
AI hallucination is in the UK political news. Israeli fans were banned from a football match, and this ban was based on a report which included hallucinated material made up by MS Copilot
www.theguardian.com/uk-news/2026...
Health experts: Your synthetic text "AI" overviews are misleading, for example see this about liver function tests.
Google: Okay, we'll block "AI" overviews on that query.
The product is fundamentally flawed and cannot be "fixed" by patching query by query.
A short 🧵>>
Nice chat with some of my soon-to-submit PhD students. They all know how to conduct and write up research, have lots of ideas for future work, and have developed networks of collaborators. So they are ready to "leave the nest", which is good feeling for me as supervisor
New blog (personal): Retirement Plans: Travel and some academics
I hope to retire soon, and many people are asking about my plans. Basically I want to do lots of travel, say involved in academia, and perhaps do some writing.
ehudreiter.com/2026/01/06/r...
One nice thing about 2025 was that the two publications I was proudest of were single-author! Also many good papers with my students, but I get a special buzz from single-author papers
New blog: Do a sanity check on your experiments
Researchers should do a “sanity” check on experiments. That is, manually inspect some (A) test/train data, (B) model/system output, and (C) evaluation output, looking for anything that seems strange.
ehudreiter.com/2025/12/22/d...
One of main goals for 2025-26 is to get 6 PhD students to submit before I retire in summer 2026. So very happy that Nikolay Babakov has submitted and passed his viva, and Iniakpokeikiye Thompson has submitted. Getting there...
Colleague has discovered many bugs (eg incorrect annotations) in a respected 8-year old dataset he is using. Nobody warned him, and hard for him to warn others. Maybe most people just dont care if dataset is deeply flawed, as long as they can compute numbers and beat SOTA...
Making good LLM benchmark is hard. Avoid
data contamination, reward hacking, saturation; ensure construct validity; rigorously test and validate, etc.
Unfortunately, community places little value on above. Want to beat SOTA or competitors, dont care if BM used mean anything...