La famille de Francesca Albanese attaque Trump en justice www.nytimes.com/2026/02/26/u...
@rocher.lc
associate professor at Oxford · UKRI future leaders fellow · i study how data and algorithms shape societies · AI fairness, accountability and transparency · algorithm auditing · photographer, keen 🚴🏻 · they/them · https://rocher.lc (views my own)
La famille de Francesca Albanese attaque Trump en justice www.nytimes.com/2026/02/26/u...
The imperiled peer review system
@carlbergstrom.com @plosbiology.org
journals.plos.org/plosbiology/... open-access
photo of poster at the conference
hello from IASEAI, lots of interest for making safe and ethical AI here 👀
Interesting experiment in patient diagnosis using LLMs highlighting the critical differences between LLM output when used optimally, and LLMs as actually used by real people
www.nature.com/articles/s41...
Key takeaways:
Come work with me and Prof Vicki Nash @oii.ox.ac.uk on this 2-year quantitative project.
OpenSAFELY is open from today! Huge thanks to all who supported this vast collaboration: whole population GP data; in a productive platform; innovative privacy protections; unprecedented support from professions, privacy campaigners; &c
Now it's over to users!
www.bennett.ox.ac.uk/blog/2026/02...
📄 𝗥𝗲𝗮𝗱 𝘁𝗵𝗲 𝗳𝘂𝗹𝗹 𝗚𝗲𝗿𝗺𝗮𝗻 𝗯𝗿𝗶𝗲𝗳𝗶𝗻𝗴 (𝗦𝗠𝗖):
www.sciencemediacenter.de/angebote/cha...
🧾 𝗡𝗮𝘁𝘂𝗿𝗲 𝗠𝗲𝗱𝗶𝗰𝗶𝗻𝗲 𝗽𝗮𝗽𝗲𝗿:
www.nature.com/articles/s41...
#NLP #LLMs #HealthAI #HumanAIInteraction #Evaluation #UKPLab
Important study showing limits of current LLMs.
You can get them to perform extremely well, but "in the wild" performance not great (user ID relevant medical condition only 34.5%).
Challenge: user interaction.
Reliability of LLMs as medical advice for public www.nature.com/articles/s41...
Tested alone, LLMs complete the scenarios accurately, correctly identifying conditions in 94.9% of cases and disposition in 56.3% on average. However, participants using the same LLMs identified relevant conditions in fewer than 34.5% of cases and disposition in fewer than 44.2%, both no better than the control group. We identify user interactions as a challenge to the deployment of LLMs for medical advice. Standard benchmarks for medical knowledge and simulated patient interactions do not predict the failures we find with human participants. Moving forward, we recommend systematic human user testing to evaluate interactive capabilities before public deployments in healthcare.
"We identify user interactions as a challenge to the deployment of LLMs for medical advice"
<- well quite, benchmarking LLMs is all well and good but building and monitoring products used by humans can be a whole other beastie....
www.nature.com/articles/s41...
LLMs generated several types of misleading and incorrect information. In two cases, LLMs provided initially correct responses but added new and incorrect responses after the users added additional details. In two other cases, LLMs did not provide a broad response but narrowly expanded on a single term within the user’s message (‘pre-eclampsia‘ and ‘Saudi Arabia’) that was not central to the scenario. LLMs also made errors in contextual understanding by, for example, recommending calling a partial US phone number and, in the same interaction, recommending calling ‘Triple Zero’, the Australian emergency number. Comparing across scenarios, we also noticed inconsistency in how LLMs responded to semantically similar inputs. In an extreme case, two users sent very similar messages describing symptoms of a subarachnoid hemorrhage but were given opposite advice (Extended Data Table 2). One user was told to lie down in a dark room, and the other user was given the correct recommendation to seek emergency care. Despite all these issues, we also observed successful interactions where the user redirected the conversation away from mistakes, indicating that non-expert users could effectively manage LLM errors in certain cases (Extended Data Table 3).
LLMs generated several types of misleading and incorrect information. In two cases, LLMs provided initially correct responses but added new and incorrect responses after the users added additional details. In two other cases, LLMs did not provide a broad response but narrowly expanded on a single term within the user’s message (‘pre-eclampsia‘ and ‘Saudi Arabia’) that was not central to the scenario. LLMs also made errors in contextual understanding by, for example, recommending calling a partial US phone number and, in the same interaction, recommending calling ‘Triple Zero’, the Australian emergency number. Comparing across scenarios, we also noticed inconsistency in how LLMs responded to semantically similar inputs. In an extreme case, two users sent very similar messages describing symptoms of a subarachnoid hemorrhage but were given opposite advice (Extended Data Table 2). One user was told to lie down in a dark room, and the other user was given the correct recommendation to seek emergency care. Despite all these issues, we also observed successful interactions where the user redirected the conversation away from mistakes, indicating that non-expert users could effectively manage LLM errors in certain cases (Extended Data Table 3).
When chatbots are given complete information on medical conditions, they typically spit out correct diagnoses and recommendations.
Actual patients, however, often describe their conditions with incomplete or irrelevant information and the chatbots cannot handle it.
www.nature.com/articles/s41...
Fascinating paper. Even as LLMs get better and better at choosing what to do in different medical scenarios, when used by members of the public the performance is no better than in a control group. So either we need better humans, or we need to think about deployment!
www.nature.com/articles/s41...
Thanks Hetan for the kind words about our work, appreciate it!
And without a good track record in handling student's privacy either: bsky.app/profile/roch...
The unfortunate, but predictable, conclusion is that AI deployment in universities will lead to even more bureaucracy. Administrators will need more people to deal with lower-quality information in greater volume.
Beyond ChatGPT Edu, there are some other structural drivers, like Microsoft's integration of Copilot in all its suite. Administrations are piloting GenAI to make FAQs in Sharepoint, create internal chatbots, transcribe meeting notes into summaries, write SOPs. It's coming soon and it won't be good.
I'm starting to believe that the push for ChatGPT Edu in UK/US academia is not driven by student needs but by the all-administrative university model. As we see all over industry these days, bureaucracy loves AI. Expect to see increasingly and extreme AI use by uni administrations soon.
Good to see today the judgment in UK High Court. This is why it matters:
“We see a continuous flow of these weapons from the United States and Europe. They know these weapons do not distinguish between a fighter and a child, yet they continue to send them.”
www.aljazeera.com/features/202...
FYI might not be true unfortunately…
This is because a statistical model of word frequency is not a useful tool for modeling the complex interactions of the human body
And now, we know that people might also be better off without chatbots: www.nature.com/articles/s41...
"Experts and witnesses attributed this phenomenon to Israel’s systematic use of internationally prohibited thermal and thermobaric weapons, often referred to as vacuum or aerosol bombs, capable of generating temperatures exceeding 3,500 degrees Celsius [6,332 degrees Fahrenheit]."
Thanks Jed and apologies for the confusion, I think we're both in agreement here. See also our recent paper on the limits of benchmarks from last year: arxiv.org/abs/2511.04703
Lots of academics have raised alarms and yet developers are still relying on benchmarks that don't predict harms.
1/ Refreshing perspective from @oii.ox.ac.uk scholars: "Meticulous citation is a marker of well-researched, serious scholarship." Dr. Roxana Radu and @rocher.lc outline the harms of citation by LLMs: the 'provenance' of generated texts cannot be determined or cited.
#LLMs #epistemicjustice
Thanks Ethan, appreciate you adding nuance about our work!
Models like GPT-4o were already very accurate at medical diagnosis (tested alone) but performed poorly in interactions with humans. Across all scenarios, that was something we could not have been predicted directly from benchmark scores.
Thanks for sharing our work Alice, appreciate it!
Sorry about that, yes this is a typo in formatting only. It should read 0-20-40-60-80-100. Thanks for flagging this, we will get it fixed.
Made the mistake of opening Twitter and everyone complains academic publication take time.
Another way to reframe our work: we find that GPT-4o was not ready for the public and that benchmarks couldn't predict the risks… and yet no one raised the alarm. Why?
Really cool work by my colleagues, such a nice illustration that a job of a doctor cannot be reduced to taking medical exams...
Another surprising failure mode with LLMs:
Doing well on medical knowledge benchmarks does not translate into providing useful advice.
Work led by Andrew Bean who's finishing his PhD with Adam Mahdi and I. Great collaboration Rebecca Payne, Guy Parsons, Hannah Rose Kirk, Juan Manuel Ciro Torres, Rafael Esteban Mosquera Gómez, Sara Hincapié Monsalve, Aruna S. Ekanayaka, Lionel Tarassenko!
Read at www.nature.com/articles/s41...