Luc Rocher's Avatar

Luc Rocher

@rocher.lc

associate professor at Oxford · UKRI future leaders fellow · i study how data and algorithms shape societies · AI fairness, accountability and transparency · algorithm auditing · photographer, keen 🚴🏻 · they/them · https://rocher.lc (views my own)

1,762
Followers
338
Following
174
Posts
09.07.2023
Joined
Posts Following

Latest posts by Luc Rocher @rocher.lc

Preview
Family of U.N. Expert Critical of Israel Sues Trump Over Sanctions

La famille de Francesca Albanese attaque Trump en justice www.nytimes.com/2026/02/26/u...

26.02.2026 17:47 👍 96 🔁 39 💬 4 📌 0
Post image

The imperiled peer review system
@carlbergstrom.com @plosbiology.org
journals.plos.org/plosbiology/... open-access

24.02.2026 21:31 👍 90 🔁 29 💬 3 📌 1
photo of poster at the conference

photo of poster at the conference

hello from IASEAI, lots of interest for making safe and ethical AI here 👀

24.02.2026 14:28 👍 2 🔁 0 💬 0 📌 0
Post image

Interesting experiment in patient diagnosis using LLMs highlighting the critical differences between LLM output when used optimally, and LLMs as actually used by real people

www.nature.com/articles/s41...

Key takeaways:

20.02.2026 14:47 👍 3 🔁 3 💬 1 📌 0

Come work with me and Prof Vicki Nash @oii.ox.ac.uk on this 2-year quantitative project.

23.02.2026 16:56 👍 5 🔁 7 💬 0 📌 0
Preview
OpenSAFELY news: you can apply to do non-COVID research, from today! | Bennett Institute for Applied Data Science We are delighted to announce that - from today - you can submit applications to the OpenSAFELY service for non-COVID-19 studies.

OpenSAFELY is open from today! Huge thanks to all who supported this vast collaboration: whole population GP data; in a productive platform; innovative privacy protections; unprecedented support from professions, privacy campaigners; &c

Now it's over to users!

www.bennett.ox.ac.uk/blog/2026/02...

23.02.2026 16:17 👍 181 🔁 86 💬 6 📌 17
Preview
Chatbots: fehlerhafte Kommunikation bei Gesundheitsfragen Studie: Sprachmodelle liefern bei Gesundheitsinfos keinen Mehrwert. Ursache scheint das Nutzungsverhalten zu sein. Expertinnen: Studie solide, aber begrenzt übertragbar.

📄 𝗥𝗲𝗮𝗱 𝘁𝗵𝗲 𝗳𝘂𝗹𝗹 𝗚𝗲𝗿𝗺𝗮𝗻 𝗯𝗿𝗶𝗲𝗳𝗶𝗻𝗴 (𝗦𝗠𝗖):
www.sciencemediacenter.de/angebote/cha...

🧾 𝗡𝗮𝘁𝘂𝗿𝗲 𝗠𝗲𝗱𝗶𝗰𝗶𝗻𝗲 𝗽𝗮𝗽𝗲𝗿:
www.nature.com/articles/s41...

#NLP #LLMs #HealthAI #HumanAIInteraction #Evaluation #UKPLab

18.02.2026 09:07 👍 2 🔁 1 💬 0 📌 0
Preview
Reliability of LLMs as medical assistants for the general public: a randomized preregistered study - Nature Medicine In a randomized controlled study involving 1,298 participants from a general sample, performance of humans when assisted by a large language model (LLM) was sensibly inferior to that of the LLM alone ...

Important study showing limits of current LLMs.

You can get them to perform extremely well, but "in the wild" performance not great (user ID relevant medical condition only 34.5%).

Challenge: user interaction.

Reliability of LLMs as medical advice for public www.nature.com/articles/s41...

18.02.2026 23:55 👍 18 🔁 6 💬 0 📌 0
Tested alone, LLMs complete the scenarios accurately, correctly identifying conditions in 94.9% of cases and disposition in 56.3% on average. However, participants using the same LLMs identified relevant conditions in fewer than 34.5% of cases and disposition in fewer than 44.2%, both no better than the control group. We identify user interactions as a challenge to the deployment of LLMs for medical advice. Standard benchmarks for medical knowledge and simulated patient interactions do not predict the failures we find with human participants. Moving forward, we recommend systematic human user testing to evaluate interactive capabilities before public deployments in healthcare.

Tested alone, LLMs complete the scenarios accurately, correctly identifying conditions in 94.9% of cases and disposition in 56.3% on average. However, participants using the same LLMs identified relevant conditions in fewer than 34.5% of cases and disposition in fewer than 44.2%, both no better than the control group. We identify user interactions as a challenge to the deployment of LLMs for medical advice. Standard benchmarks for medical knowledge and simulated patient interactions do not predict the failures we find with human participants. Moving forward, we recommend systematic human user testing to evaluate interactive capabilities before public deployments in healthcare.

"We identify user interactions as a challenge to the deployment of LLMs for medical advice"

<- well quite, benchmarking LLMs is all well and good but building and monitoring products used by humans can be a whole other beastie....

www.nature.com/articles/s41...

17.02.2026 13:36 👍 4 🔁 1 💬 2 📌 0
LLMs generated several types of misleading and incorrect information. In two cases, LLMs provided initially correct responses but added new and incorrect responses after the users added additional details. In two other cases, LLMs did not provide a broad response but narrowly expanded on a single term within the user’s message (‘pre-eclampsia‘ and ‘Saudi Arabia’) that was not central to the scenario. LLMs also made errors in contextual understanding by, for example, recommending calling a partial US phone number and, in the same interaction, recommending calling ‘Triple Zero’, the Australian emergency number. Comparing across scenarios, we also noticed inconsistency in how LLMs responded to semantically similar inputs. In an extreme case, two users sent very similar messages describing symptoms of a subarachnoid hemorrhage but were given opposite advice (Extended Data Table 2). One user was told to lie down in a dark room, and the other user was given the correct recommendation to seek emergency care. Despite all these issues, we also observed successful interactions where the user redirected the conversation away from mistakes, indicating that non-expert users could effectively manage LLM errors in certain cases (Extended Data Table 3).

LLMs generated several types of misleading and incorrect information. In two cases, LLMs provided initially correct responses but added new and incorrect responses after the users added additional details. In two other cases, LLMs did not provide a broad response but narrowly expanded on a single term within the user’s message (‘pre-eclampsia‘ and ‘Saudi Arabia’) that was not central to the scenario. LLMs also made errors in contextual understanding by, for example, recommending calling a partial US phone number and, in the same interaction, recommending calling ‘Triple Zero’, the Australian emergency number. Comparing across scenarios, we also noticed inconsistency in how LLMs responded to semantically similar inputs. In an extreme case, two users sent very similar messages describing symptoms of a subarachnoid hemorrhage but were given opposite advice (Extended Data Table 2). One user was told to lie down in a dark room, and the other user was given the correct recommendation to seek emergency care. Despite all these issues, we also observed successful interactions where the user redirected the conversation away from mistakes, indicating that non-expert users could effectively manage LLM errors in certain cases (Extended Data Table 3).

LLMs generated several types of misleading and incorrect information. In two cases, LLMs provided initially correct responses but added new and incorrect responses after the users added additional details. In two other cases, LLMs did not provide a broad response but narrowly expanded on a single term within the user’s message (‘pre-eclampsia‘ and ‘Saudi Arabia’) that was not central to the scenario. LLMs also made errors in contextual understanding by, for example, recommending calling a partial US phone number and, in the same interaction, recommending calling ‘Triple Zero’, the Australian emergency number. Comparing across scenarios, we also noticed inconsistency in how LLMs responded to semantically similar inputs. In an extreme case, two users sent very similar messages describing symptoms of a subarachnoid hemorrhage but were given opposite advice (Extended Data Table 2). One user was told to lie down in a dark room, and the other user was given the correct recommendation to seek emergency care. Despite all these issues, we also observed successful interactions where the user redirected the conversation away from mistakes, indicating that non-expert users could effectively manage LLM errors in certain cases (Extended Data Table 3).

LLMs generated several types of misleading and incorrect information. In two cases, LLMs provided initially correct responses but added new and incorrect responses after the users added additional details. In two other cases, LLMs did not provide a broad response but narrowly expanded on a single term within the user’s message (‘pre-eclampsia‘ and ‘Saudi Arabia’) that was not central to the scenario. LLMs also made errors in contextual understanding by, for example, recommending calling a partial US phone number and, in the same interaction, recommending calling ‘Triple Zero’, the Australian emergency number. Comparing across scenarios, we also noticed inconsistency in how LLMs responded to semantically similar inputs. In an extreme case, two users sent very similar messages describing symptoms of a subarachnoid hemorrhage but were given opposite advice (Extended Data Table 2). One user was told to lie down in a dark room, and the other user was given the correct recommendation to seek emergency care. Despite all these issues, we also observed successful interactions where the user redirected the conversation away from mistakes, indicating that non-expert users could effectively manage LLM errors in certain cases (Extended Data Table 3).

When chatbots are given complete information on medical conditions, they typically spit out correct diagnoses and recommendations.

Actual patients, however, often describe their conditions with incomplete or irrelevant information and the chatbots cannot handle it.
www.nature.com/articles/s41...

11.02.2026 14:16 👍 723 🔁 133 💬 27 📌 23
Preview
Reliability of LLMs as medical assistants for the general public: a randomized preregistered study - Nature Medicine In a randomized controlled study involving 1,298 participants from a general sample, performance of humans when assisted by a large language model (LLM) was sensibly inferior to that of the LLM alone ...

Fascinating paper. Even as LLMs get better and better at choosing what to do in different medical scenarios, when used by members of the public the performance is no better than in a control group. So either we need better humans, or we need to think about deployment!
www.nature.com/articles/s41...

16.02.2026 17:34 👍 26 🔁 10 💬 3 📌 0

Thanks Hetan for the kind words about our work, appreciate it!

17.02.2026 15:26 👍 1 🔁 0 💬 0 📌 0

And without a good track record in handling student's privacy either: bsky.app/profile/roch...

14.02.2026 23:05 👍 3 🔁 0 💬 0 📌 0

The unfortunate, but predictable, conclusion is that AI deployment in universities will lead to even more bureaucracy. Administrators will need more people to deal with lower-quality information in greater volume.

14.02.2026 22:48 👍 5 🔁 2 💬 0 📌 0

Beyond ChatGPT Edu, there are some other structural drivers, like Microsoft's integration of Copilot in all its suite. Administrations are piloting GenAI to make FAQs in Sharepoint, create internal chatbots, transcribe meeting notes into summaries, write SOPs. It's coming soon and it won't be good.

14.02.2026 22:43 👍 1 🔁 0 💬 1 📌 0

I'm starting to believe that the push for ChatGPT Edu in UK/US academia is not driven by student needs but by the all-administrative university model. As we see all over industry these days, bureaucracy loves AI. Expect to see increasingly and extreme AI use by uni administrations soon.

14.02.2026 22:41 👍 5 🔁 3 💬 1 📌 1
Preview
Israel used weapons in Gaza that made thousands of Palestinians evaporate US-made thermal weapons burning at 3,500C caused 2,842 people to "evaporate" in Gaza, Al Jazeera investigation finds.

Good to see today the judgment in UK High Court. This is why it matters:

“We see a continuous flow of these weapons from the United States and Europe. They know these weapons do not distinguish between a fighter and a child, yet they continue to send them.”

www.aljazeera.com/features/202...

13.02.2026 12:35 👍 0 🔁 0 💬 0 📌 0

FYI might not be true unfortunately…

13.02.2026 00:36 👍 1 🔁 0 💬 0 📌 0

This is because a statistical model of word frequency is not a useful tool for modeling the complex interactions of the human body

10.02.2026 14:27 👍 511 🔁 167 💬 11 📌 6
Preview
Reliability of LLMs as medical assistants for the general public: a randomized preregistered study - Nature Medicine In a randomized controlled study involving 1,298 participants from a general sample, performance of humans when assisted by a large language model (LLM) was sensibly inferior to that of the LLM alone ...

And now, we know that people might also be better off without chatbots: www.nature.com/articles/s41...

11.02.2026 18:53 👍 0 🔁 0 💬 0 📌 0

"Experts and witnesses attributed this phenomenon to Israel’s systematic use of internationally prohibited thermal and thermobaric weapons, often referred to as vacuum or aerosol bombs, capable of generating temperatures exceeding 3,500 degrees Celsius [6,332 degrees Fahrenheit]."

11.02.2026 01:29 👍 611 🔁 320 💬 10 📌 15
Preview
Measuring what Matters: Construct Validity in Large Language Model Benchmarks Evaluating large language models (LLMs) is crucial for both assessing their capabilities and identifying safety or robustness issues prior to deployment. Reliably measuring abstract and complex phenom...

Thanks Jed and apologies for the confusion, I think we're both in agreement here. See also our recent paper on the limits of benchmarks from last year: arxiv.org/abs/2511.04703

Lots of academics have raised alarms and yet developers are still relying on benchmarks that don't predict harms.

11.02.2026 01:22 👍 1 🔁 0 💬 1 📌 0

1/ Refreshing perspective from @oii.ox.ac.uk scholars: "Meticulous citation is a marker of well-researched, serious scholarship." Dr. Roxana Radu and @rocher.lc outline the harms of citation by LLMs: the 'provenance' of generated texts cannot be determined or cited.

#LLMs #epistemicjustice

10.02.2026 18:12 👍 7 🔁 2 💬 1 📌 0

Thanks Ethan, appreciate you adding nuance about our work!

Models like GPT-4o were already very accurate at medical diagnosis (tested alone) but performed poorly in interactions with humans. Across all scenarios, that was something we could not have been predicted directly from benchmark scores.

10.02.2026 18:15 👍 3 🔁 0 💬 1 📌 0

Thanks for sharing our work Alice, appreciate it!

10.02.2026 18:11 👍 1 🔁 0 💬 0 📌 0

Sorry about that, yes this is a typo in formatting only. It should read 0-20-40-60-80-100. Thanks for flagging this, we will get it fixed.

10.02.2026 14:23 👍 2 🔁 0 💬 0 📌 0

Made the mistake of opening Twitter and everyone complains academic publication take time.

Another way to reframe our work: we find that GPT-4o was not ready for the public and that benchmarks couldn't predict the risks… and yet no one raised the alarm. Why?

09.02.2026 21:01 👍 6 🔁 2 💬 0 📌 1

Really cool work by my colleagues, such a nice illustration that a job of a doctor cannot be reduced to taking medical exams...

09.02.2026 19:20 👍 7 🔁 1 💬 0 📌 0

Another surprising failure mode with LLMs:

Doing well on medical knowledge benchmarks does not translate into providing useful advice.

09.02.2026 17:50 👍 8 🔁 1 💬 1 📌 0
Preview
Reliability of LLMs as medical assistants for the general public: a randomized preregistered study - Nature Medicine In a randomized controlled study involving 1,298 participants from a general sample, performance of humans when assisted by a large language model (LLM) was sensibly inferior to that of the LLM alone ...

Work led by Andrew Bean who's finishing his PhD with Adam Mahdi and I. Great collaboration Rebecca Payne, Guy Parsons, Hannah Rose Kirk, Juan Manuel Ciro Torres, Rafael Esteban Mosquera Gómez, Sara Hincapié Monsalve, Aruna S. Ekanayaka, Lionel Tarassenko!

Read at www.nature.com/articles/s41...

09.02.2026 17:07 👍 38 🔁 9 💬 0 📌 0