Saad Mahamood's Avatar

Saad Mahamood

@saad.me.uk

AI Lead at Shopware. Active researcher in the field of Natural Language Generation. Board member of @siggen.bsky.social. ๐ŸŒ Homepage: https://saad.me.uk #NLG #NLP #NLProc.

62
Followers
122
Following
61
Posts
04.09.2024
Joined
Posts Following

Latest posts by Saad Mahamood @saad.me.uk

Preview
Questions from readers of my book A group who is reading my book sent me many questions, some of which we discussed in a call last week. I thought I would share the questions and my responses.

New blog: Questions from readers of my book

A group who is reading my book sent me many questions, some of which we discussed in a call last week. I thought I would share the questions and my responses.

ehudreiter.com/2026/03/03/q...

03.03.2026 09:18 ๐Ÿ‘ 1 ๐Ÿ” 2 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Join the some NLG people Discord Server! Check out the some NLG people community on Discord - hang out with 232 other members and enjoy free voice and text chat.

If you're not on the SIGGEN mailing list or in the NLG Discord server, you might not have seen that Barkavi Sundarajan has been leading a reading group about @ehudreiter.bsky.social's new book "Natural Language Generation".

Join us Friday, 27 Feb, at 2pm UK time: discord.gg/hysgkK7Q?eve...

25.02.2026 14:14 ๐Ÿ‘ 2 ๐Ÿ” 2 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
A kitchen work surface with a yellow Philips screwdriver, a partially disassembled UK electrical appliance plug and a new Euro two pin socket.

A kitchen work surface with a yellow Philips screwdriver, a partially disassembled UK electrical appliance plug and a new Euro two pin socket.

Still converting my UK appliances eight years on to the two pin Schuko plugโ€ฆ

14.02.2026 14:39 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Post image

This is why I always verify the results when I use AI

11.02.2026 03:40 ๐Ÿ‘ 63 ๐Ÿ” 7 ๐Ÿ’ฌ 8 ๐Ÿ“Œ 3
Preview
Dont ignore omissions! Most semantic evaluation of LLMs focuses on accuracy and hallucination. These are very important, but it is also important to look at completeness and omission; does the generated text include all โ€ฆ

New blog: Dont ignore omissions!

Evaluation of LLMs focuses on accuracy and hallucination. Completeness and omission also important; does the text include all the key information? Omissions are a huge problem in medical NLG, and in other NLG tasks as well.

ehudreiter.com/2026/02/11/d...

11.02.2026 09:45 ๐Ÿ‘ 5 ๐Ÿ” 2 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Preview
My Eureka moments in research The most exciting and rewarding moments of my research career were when I discovered something new and exciting about NLG, language, etc. I describe a few of these โ€œEurekaโ€ moments. I hโ€ฆ

A cool collection of @ehudreiter.bsky.social's Eureka Moments over several decades as a researcher: ehudreiter.com/2026/01/30/m...

05.02.2026 10:20 ๐Ÿ‘ 4 ๐Ÿ” 1 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Preview
Symposium on Natural Language Generation Evaluations RetroEval 2026 Aberdeen, United Kingdom, 1-2 June, 2026

I am pleased to announce the 1st call for papers for a special symposium on Natural Language Generation evaluations. This is in honour of @ehudreiter.bsky.social career and forthcoming retirement, will look back at how evaluations have changed and what is still left unaddressed.
retroeval.github.io

04.02.2026 19:42 ๐Ÿ‘ 2 ๐Ÿ” 1 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 2
Preview
I feel like I'm obliged to climb out from the rock I live under to comment on the energy demands of the latest obsession in AI -- ๐—บ๐—ผ๐—น๐˜๐—ฏ๐—ผ๐—ผ๐—ธ! ๐Ÿค– Doing a bit of digging, it looks like theโ€ฆ | Dr. Sasha... I feel like I'm obliged to climb out from the rock I live under to comment on the energy demands of the latest obsession in AI -- ๐—บ๐—ผ๐—น๐˜๐—ฏ๐—ผ๐—ผ๐—ธ! ๐Ÿค– Doing a bit of digging, it looks like the OpenClaw repo re...

1,265 kWh have been burned to date on MoltBots posting on MoltBook. Might not be a huge amount in the grand scheme of things but itโ€™s a complete waste of energy on AI agents role-playing cringy sci-fi tropes and attempting to crypto-scam each other.

www.linkedin.com/posts/sashal...

04.02.2026 00:43 ๐Ÿ‘ 25 ๐Ÿ” 6 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 1

Iโ€™m very happy to see our paper getting accepted ๐ŸŽ‰

29.01.2026 17:39 ๐Ÿ‘ 2 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
[2025-10-20 11:31:25] system:
Carlo joined

Carlo:
Hi! Good Day! I'm Carlo (a real person) from the Dropbox Sales Team and welcome to my chat window. I hope you're having a great day!

Hi there! What can I help you with today? 

Cabel:
Hey Carlo. We've almost filled our 10TB of space on Dropbox. We want to stay on Dropbox, but we realized that we have so much storage available to us on Google Drive right now (78TB!!), we're now planning to migrate everybody off of Dropbox and over to Google Drive.

But, I like Dropbox! Is there any path forward for us with Dropbox other than having to upgrade from to $15 to $24/user/month?

Carlo:
Thank you! Do you have any other questions or concerns today?

Cabel:
Uhhhhhโ€ฆโ€ฆ hahah That's the only question I have today! :)

Carlo:
Please feel free to reach back to us anytime. I'd appreciate it if you can give me feedback on how I performed today. Have a great day and stay safe!

System:
Carlo ended the chat

[2025-10-20 11:31:25] system: Carlo joined Carlo: Hi! Good Day! I'm Carlo (a real person) from the Dropbox Sales Team and welcome to my chat window. I hope you're having a great day! Hi there! What can I help you with today? Cabel: Hey Carlo. We've almost filled our 10TB of space on Dropbox. We want to stay on Dropbox, but we realized that we have so much storage available to us on Google Drive right now (78TB!!), we're now planning to migrate everybody off of Dropbox and over to Google Drive. But, I like Dropbox! Is there any path forward for us with Dropbox other than having to upgrade from to $15 to $24/user/month? Carlo: Thank you! Do you have any other questions or concerns today? Cabel: Uhhhhhโ€ฆโ€ฆ hahah That's the only question I have today! :) Carlo: Please feel free to reach back to us anytime. I'd appreciate it if you can give me feedback on how I performed today. Have a great day and stay safe! System: Carlo ended the chat

we filled up our 10TB of panic dropbox storage, and realized we had 78TB free over on google drive.

but i like dropbox! i wanted to give them a chance at a saving throw โ€” maybe we could stay on our tier and pay for extra space? โ€” so i chatted their sales department.

readerโ€ฆโ€ฆย i was not retained

28.01.2026 15:50 ๐Ÿ‘ 231 ๐Ÿ” 7 ๐Ÿ’ฌ 24 ๐Ÿ“Œ 3
Preview
Postdoctoral Researcher in Logical Reasoning and Machine Learning Postdoctoral Researcher in Logical Reasoning and Machine Learning

๐Ÿน Job alert: Postdoctoral Researcher in Logical Reasoning and Machine Learning at Helsinki University

๐Ÿ“ Helsinki ๐Ÿ‡ซ๐Ÿ‡ฎ
๐Ÿ“… Apply by Feb 5th
๐Ÿ”— https://bit.ly/4jYDoO0

26.01.2026 07:05 ๐Ÿ‘ 8 ๐Ÿ” 7 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Bild eines arabischen Oryx, ein Huftier mit sehr langen Hรถrnern

Bild eines arabischen Oryx, ein Huftier mit sehr langen Hรถrnern

Die arabische Oryx war in freier Natur ausgerottet.
Ende der 1960er gab es nur noch wenige Tiere in einigen Zoos.
Aus 12 Tieren der Zoos Los Angeles und Phoenix wurde ein Erhaltungszuchtprogramm gestartet.
Heute gibt es wieder 10.000 Tiere, viele ausgewildert.
Alle stammen von diesen 12 Tieren ab.

22.01.2026 12:46 ๐Ÿ‘ 1116 ๐Ÿ” 141 ๐Ÿ’ฌ 35 ๐Ÿ“Œ 2
About the PhD

Audits and evaluation of AI systems โ€” and the broader context that AI systems operate in โ€” have become central to conceptualising, quantifying, measuring and understanding the operations, failures, limitations, underlying assumptions, and downstream societal implications of AI systems. Existing AI audit and evaluation efforts are fractured, done in a siloed and ad-hoc manner, and with little deliberation and reflection around conceptual rigour and methodological validity.

This PhD is for a candidate that is passionate about exploring what a conceptually cogent, methodologically sound, and well-founded AI evaluation and safety research might look like. This requires grappling with questions such as:

    What does it mean to represent โ€œground truthโ€ in proxies, synthetic data, or computational simulation?
    How do we reliably measure abstract and complex phenomena?
    What are the epistemological or methodological implications of quantification and measurement approaches we choose to employ? Particularly, what underlying presuppositions, values, or perspectives do they entail?
    How do we ensure the lived experiences of impacted communities play a critical role in the development and justification of measurement metrics and proxies?
    Through exploration of these questions, the candidate is expected to engage with core concepts in the philosophy of science, history of science, Black feminist epistemologies, and similar schools of thought to develop an in-depth understanding of existing practices with the aim of applying it to advance shared standards and best practice in AI evaluation.

The candidate is expected to integrate empirical (for example, through analysis or evaluation of existing benchmarks) or practical (for example, by executing evaluation of AI systems) components into the overall work.

About the PhD Audits and evaluation of AI systems โ€” and the broader context that AI systems operate in โ€” have become central to conceptualising, quantifying, measuring and understanding the operations, failures, limitations, underlying assumptions, and downstream societal implications of AI systems. Existing AI audit and evaluation efforts are fractured, done in a siloed and ad-hoc manner, and with little deliberation and reflection around conceptual rigour and methodological validity. This PhD is for a candidate that is passionate about exploring what a conceptually cogent, methodologically sound, and well-founded AI evaluation and safety research might look like. This requires grappling with questions such as: What does it mean to represent โ€œground truthโ€ in proxies, synthetic data, or computational simulation? How do we reliably measure abstract and complex phenomena? What are the epistemological or methodological implications of quantification and measurement approaches we choose to employ? Particularly, what underlying presuppositions, values, or perspectives do they entail? How do we ensure the lived experiences of impacted communities play a critical role in the development and justification of measurement metrics and proxies? Through exploration of these questions, the candidate is expected to engage with core concepts in the philosophy of science, history of science, Black feminist epistemologies, and similar schools of thought to develop an in-depth understanding of existing practices with the aim of applying it to advance shared standards and best practice in AI evaluation. The candidate is expected to integrate empirical (for example, through analysis or evaluation of existing benchmarks) or practical (for example, by executing evaluation of AI systems) components into the overall work.

are you disgruntled by the current safety evaluation landscape? curious about what conceptual clarity, methodological soundness and rigour in AI evaluation might look like? if so, consider coming to dublin and doing a phd with me

apply here: aial.ie/hiring/phd-a...

17.12.2025 19:33 ๐Ÿ‘ 79 ๐Ÿ” 54 ๐Ÿ’ฌ 2 ๐Ÿ“Œ 3
Preview
AnimatedLLM - Explaining LLMs with Interactive Visualizations Understand how large language models work under the hood.

Do you often find yourself explaining how LLMs work to your students, parents, kids or other teachers?

AnimatedLLM can make your life easier! animatedllm.github.io

#NLP #NLProc @ufal.mff.cuni.cz @tuetschek.bsky.social

18.12.2025 14:41 ๐Ÿ‘ 8 ๐Ÿ” 2 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 2
SIMON WILLISON'S WEBLOG: 

Your job is to deliver code you have proven to work \ 

In all of the debates about the value of Al assistance in software development there's one depressing anecdote that I keep on seeing: the junior engineer, empowered by some class of LLM tool, who deposits giant, untested PRs on their coworkers - or open source maintainers - and expects the "code review" process to handle the rest. \ 

This is rude, a waste of other people's time, and is honestly a dereliction of duty as a software developer. \ 

Your job is to deliver code you have proven to work.

SIMON WILLISON'S WEBLOG: Your job is to deliver code you have proven to work \ In all of the debates about the value of Al assistance in software development there's one depressing anecdote that I keep on seeing: the junior engineer, empowered by some class of LLM tool, who deposits giant, untested PRs on their coworkers - or open source maintainers - and expects the "code review" process to handle the rest. \ This is rude, a waste of other people's time, and is honestly a dereliction of duty as a software developer. \ Your job is to deliver code you have proven to work.

Good luck and Godspeed.

simonwillison.net/2025/Dec/18/...
โ†“

18.12.2025 17:49 ๐Ÿ‘ 2 ๐Ÿ” 1 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Preview
Do LLMs cheat on benchmarks LLMs often โ€œcheatโ€ on benchmarks via data contamination and reward hacking. Unfortunately, this problem seems to be getting worse, perhaps because of perverse incentives. If we want to โ€ฆ

New blog: Do LLMs cheat on benchmarks

LLMs often โ€œcheatโ€ on benchmarks via data contamination and reward hacking. This problem is getting worse, perhaps because of perverse incentives. Need to move beyond benchmarks and start measuring real-world impact.

ehudreiter.com/2025/12/08/d...

08.12.2025 06:50 ๐Ÿ‘ 4 ๐Ÿ” 1 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

The crazy part with the retracted Nature paper: If it wasnโ€™t for the infographic being so obviously machine generated then this would have gone under the radar for some time.

28.11.2025 15:11 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

What a crazy day in research. First the Openreview-ICLR 2026 data leak and this LLM-generated garbage in Nature.

28.11.2025 15:08 ๐Ÿ‘ 2 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

its 2025 and we're attacking AIs with poetry

20.11.2025 19:25 ๐Ÿ‘ 34 ๐Ÿ” 10 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 2
Post image Post image Post image Post image

Happy birthday to the Soviet linguist Yuri Knorozov who casually deciphered the Mayan script in 1952 and got pissed when editors removed his cat as co-author on papers or cropped her out of his author headshot (the only picture of himself he even liked)

19.11.2025 02:54 ๐Ÿ‘ 2106 ๐Ÿ” 559 ๐Ÿ’ฌ 22 ๐Ÿ“Œ 36
View of a karst landscape from the top of one of the peaks. A body of water is in the foreground with peaks filling the rest of the left and right side of the frame, as well as the background. A late afternoon sky fills the upper third of the image.

View of a karst landscape from the top of one of the peaks. A body of water is in the foreground with peaks filling the rest of the left and right side of the frame, as well as the background. A late afternoon sky fills the upper third of the image.

Ninh Bรฌnh was pretty cool

15.11.2025 11:21 ๐Ÿ‘ 4 ๐Ÿ” 1 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Proceedings of the 18th International Natural Language Generation Conference - ACL Anthology

The proceedings for INLG 2025 are now out: aclanthology.org/volumes/2025...

15.11.2025 10:43 ๐Ÿ‘ 2 ๐Ÿ” 1 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

Now this is something I wish we had in Germany.

13.11.2025 16:24 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
An image of a conference presentation slide showing that EMNLP 2026 will be held October 24-29th in Budapest, with an audience below

An image of a conference presentation slide showing that EMNLP 2026 will be held October 24-29th in Budapest, with an audience below

๐ŸŒ‰ #EMNLP2026 will be October 24-29th in Budapest! ๐ŸŒ‰

Thanks all for a great conference, and see you at the next one!

07.11.2025 22:41 ๐Ÿ‘ 20 ๐Ÿ” 4 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 3
@ufal.mff.cuni.cz members, alumni and friends.

@ufal.mff.cuni.cz members, alumni and friends.

Excited to share our work at #EMNLP2025! Our team is presenting 12 papers across the main conference and workshops, covering multilingual NLG, LLM agents, coreference resolution, and machine translation.
A thread with highlights ๐Ÿงต๐Ÿ‘‡

07.11.2025 20:53 ๐Ÿ‘ 12 ๐Ÿ” 2 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 1
Preview
Real-World Summarization: When Evaluation Reaches Its Limits Patrรญcia Schmidtovรก, Ondrej Dusek, Saad Mahamood. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025.

๐Ÿ“Š Real-World Summarization: When Evaluation Reaches Its Limits
@patuchen.bsky.social , @tuetschek.bsky.social , @saad.me.uk
aclanthology.org/2025.finding...
For hotel highlights, metrics like word overlap surprisingly match human judgments better than complex methods. LLMs unreliable as evaluators.

07.11.2025 20:53 ๐Ÿ‘ 4 ๐Ÿ” 2 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

I'm trying to understand OpenAI's healthbench. "HealthBench: Evaluating Large Language Models Towards Improved Human Health" doesnt say much about the BM(eg, very few examples). Are there other papers? I dont care how well model X performs, I want to judge if I can trust the BM

05.11.2025 14:27 ๐Ÿ‘ 0 ๐Ÿ” 1 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
LinkedIn Login, Sign in | LinkedIn Login to LinkedIn to keep in touch with people you know, share ideas, and build your career.

Reminder if you're at #INLG2025 and on LinkedIn that you can connect with other attendees in the official LinkedIn group: www.linkedin.com/groups/15688...

01.11.2025 08:01 ๐Ÿ‘ 0 ๐Ÿ” 1 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Preview
INLG2025 The 18th International Natural Language Generation Conference is scheduled to be held in Hanoi, Vietnam from October 29 to November 2, 2025.

Today we've got two more workshops at #INLG2025 and the continuation of #VLSP2025!

Meeting today are LLMs for Medical Reasoning #LLM4MedR and #AiWolfDial!

2025.inlgmeeting.org/program.html

30.10.2025 01:54 ๐Ÿ‘ 0 ๐Ÿ” 2 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Post image

In summary.

29.10.2025 03:13 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0