New blog: Questions from readers of my book
A group who is reading my book sent me many questions, some of which we discussed in a call last week. I thought I would share the questions and my responses.
ehudreiter.com/2026/03/03/q...
New blog: Questions from readers of my book
A group who is reading my book sent me many questions, some of which we discussed in a call last week. I thought I would share the questions and my responses.
ehudreiter.com/2026/03/03/q...
If you're not on the SIGGEN mailing list or in the NLG Discord server, you might not have seen that Barkavi Sundarajan has been leading a reading group about @ehudreiter.bsky.social's new book "Natural Language Generation".
Join us Friday, 27 Feb, at 2pm UK time: discord.gg/hysgkK7Q?eve...
A kitchen work surface with a yellow Philips screwdriver, a partially disassembled UK electrical appliance plug and a new Euro two pin socket.
Still converting my UK appliances eight years on to the two pin Schuko plugโฆ
This is why I always verify the results when I use AI
New blog: Dont ignore omissions!
Evaluation of LLMs focuses on accuracy and hallucination. Completeness and omission also important; does the text include all the key information? Omissions are a huge problem in medical NLG, and in other NLG tasks as well.
ehudreiter.com/2026/02/11/d...
A cool collection of @ehudreiter.bsky.social's Eureka Moments over several decades as a researcher: ehudreiter.com/2026/01/30/m...
I am pleased to announce the 1st call for papers for a special symposium on Natural Language Generation evaluations. This is in honour of @ehudreiter.bsky.social career and forthcoming retirement, will look back at how evaluations have changed and what is still left unaddressed.
retroeval.github.io
1,265 kWh have been burned to date on MoltBots posting on MoltBook. Might not be a huge amount in the grand scheme of things but itโs a complete waste of energy on AI agents role-playing cringy sci-fi tropes and attempting to crypto-scam each other.
www.linkedin.com/posts/sashal...
Iโm very happy to see our paper getting accepted ๐
[2025-10-20 11:31:25] system: Carlo joined Carlo: Hi! Good Day! I'm Carlo (a real person) from the Dropbox Sales Team and welcome to my chat window. I hope you're having a great day! Hi there! What can I help you with today? Cabel: Hey Carlo. We've almost filled our 10TB of space on Dropbox. We want to stay on Dropbox, but we realized that we have so much storage available to us on Google Drive right now (78TB!!), we're now planning to migrate everybody off of Dropbox and over to Google Drive. But, I like Dropbox! Is there any path forward for us with Dropbox other than having to upgrade from to $15 to $24/user/month? Carlo: Thank you! Do you have any other questions or concerns today? Cabel: Uhhhhhโฆโฆ hahah That's the only question I have today! :) Carlo: Please feel free to reach back to us anytime. I'd appreciate it if you can give me feedback on how I performed today. Have a great day and stay safe! System: Carlo ended the chat
we filled up our 10TB of panic dropbox storage, and realized we had 78TB free over on google drive.
but i like dropbox! i wanted to give them a chance at a saving throw โ maybe we could stay on our tier and pay for extra space? โ so i chatted their sales department.
readerโฆโฆย i was not retained
๐น Job alert: Postdoctoral Researcher in Logical Reasoning and Machine Learning at Helsinki University
๐ Helsinki ๐ซ๐ฎ
๐
Apply by Feb 5th
๐ https://bit.ly/4jYDoO0
Bild eines arabischen Oryx, ein Huftier mit sehr langen Hรถrnern
Die arabische Oryx war in freier Natur ausgerottet.
Ende der 1960er gab es nur noch wenige Tiere in einigen Zoos.
Aus 12 Tieren der Zoos Los Angeles und Phoenix wurde ein Erhaltungszuchtprogramm gestartet.
Heute gibt es wieder 10.000 Tiere, viele ausgewildert.
Alle stammen von diesen 12 Tieren ab.
About the PhD Audits and evaluation of AI systems โ and the broader context that AI systems operate in โ have become central to conceptualising, quantifying, measuring and understanding the operations, failures, limitations, underlying assumptions, and downstream societal implications of AI systems. Existing AI audit and evaluation efforts are fractured, done in a siloed and ad-hoc manner, and with little deliberation and reflection around conceptual rigour and methodological validity. This PhD is for a candidate that is passionate about exploring what a conceptually cogent, methodologically sound, and well-founded AI evaluation and safety research might look like. This requires grappling with questions such as: What does it mean to represent โground truthโ in proxies, synthetic data, or computational simulation? How do we reliably measure abstract and complex phenomena? What are the epistemological or methodological implications of quantification and measurement approaches we choose to employ? Particularly, what underlying presuppositions, values, or perspectives do they entail? How do we ensure the lived experiences of impacted communities play a critical role in the development and justification of measurement metrics and proxies? Through exploration of these questions, the candidate is expected to engage with core concepts in the philosophy of science, history of science, Black feminist epistemologies, and similar schools of thought to develop an in-depth understanding of existing practices with the aim of applying it to advance shared standards and best practice in AI evaluation. The candidate is expected to integrate empirical (for example, through analysis or evaluation of existing benchmarks) or practical (for example, by executing evaluation of AI systems) components into the overall work.
are you disgruntled by the current safety evaluation landscape? curious about what conceptual clarity, methodological soundness and rigour in AI evaluation might look like? if so, consider coming to dublin and doing a phd with me
apply here: aial.ie/hiring/phd-a...
Do you often find yourself explaining how LLMs work to your students, parents, kids or other teachers?
AnimatedLLM can make your life easier! animatedllm.github.io
#NLP #NLProc @ufal.mff.cuni.cz @tuetschek.bsky.social
SIMON WILLISON'S WEBLOG: Your job is to deliver code you have proven to work \ In all of the debates about the value of Al assistance in software development there's one depressing anecdote that I keep on seeing: the junior engineer, empowered by some class of LLM tool, who deposits giant, untested PRs on their coworkers - or open source maintainers - and expects the "code review" process to handle the rest. \ This is rude, a waste of other people's time, and is honestly a dereliction of duty as a software developer. \ Your job is to deliver code you have proven to work.
Good luck and Godspeed.
simonwillison.net/2025/Dec/18/...
โ
New blog: Do LLMs cheat on benchmarks
LLMs often โcheatโ on benchmarks via data contamination and reward hacking. This problem is getting worse, perhaps because of perverse incentives. Need to move beyond benchmarks and start measuring real-world impact.
ehudreiter.com/2025/12/08/d...
The crazy part with the retracted Nature paper: If it wasnโt for the infographic being so obviously machine generated then this would have gone under the radar for some time.
What a crazy day in research. First the Openreview-ICLR 2026 data leak and this LLM-generated garbage in Nature.
its 2025 and we're attacking AIs with poetry
Happy birthday to the Soviet linguist Yuri Knorozov who casually deciphered the Mayan script in 1952 and got pissed when editors removed his cat as co-author on papers or cropped her out of his author headshot (the only picture of himself he even liked)
View of a karst landscape from the top of one of the peaks. A body of water is in the foreground with peaks filling the rest of the left and right side of the frame, as well as the background. A late afternoon sky fills the upper third of the image.
Ninh Bรฌnh was pretty cool
The proceedings for INLG 2025 are now out: aclanthology.org/volumes/2025...
Now this is something I wish we had in Germany.
An image of a conference presentation slide showing that EMNLP 2026 will be held October 24-29th in Budapest, with an audience below
๐ #EMNLP2026 will be October 24-29th in Budapest! ๐
Thanks all for a great conference, and see you at the next one!
@ufal.mff.cuni.cz members, alumni and friends.
Excited to share our work at #EMNLP2025! Our team is presenting 12 papers across the main conference and workshops, covering multilingual NLG, LLM agents, coreference resolution, and machine translation.
A thread with highlights ๐งต๐
๐ Real-World Summarization: When Evaluation Reaches Its Limits
@patuchen.bsky.social , @tuetschek.bsky.social , @saad.me.uk
aclanthology.org/2025.finding...
For hotel highlights, metrics like word overlap surprisingly match human judgments better than complex methods. LLMs unreliable as evaluators.
I'm trying to understand OpenAI's healthbench. "HealthBench: Evaluating Large Language Models Towards Improved Human Health" doesnt say much about the BM(eg, very few examples). Are there other papers? I dont care how well model X performs, I want to judge if I can trust the BM
Reminder if you're at #INLG2025 and on LinkedIn that you can connect with other attendees in the official LinkedIn group: www.linkedin.com/groups/15688...
Today we've got two more workshops at #INLG2025 and the continuation of #VLSP2025!
Meeting today are LLMs for Medical Reasoning #LLM4MedR and #AiWolfDial!
2025.inlgmeeting.org/program.html
In summary.