Lucy Li's Avatar

Lucy Li

@lucy3

Postdoc at UW NLP ๐Ÿ”๏ธ. #NLProc, computational social science, cultural analytics, responsible AI. she/her. Previously at Berkeley, Ai2, MSR, Stanford. Incoming assistant prof at Wisconsin CS. lucy3.github.io

5,794
Followers
660
Following
357
Posts
17.05.2023
Joined
Posts Following

Latest posts by Lucy Li @lucy3

Preview
๐Ÿ—„ history of NLP and the ACL | Are.na

I'm lecturing about the "History of NLP" this week. What should I include? Any favorite anecdotes, images, people, methods? Slides, books, papers, or talks for inspiration or grounding?

I've been maintaining a small collection here: www.are.na/maria-antoni...

10.03.2026 17:41 ๐Ÿ‘ 16 ๐Ÿ” 2 ๐Ÿ’ฌ 10 ๐Ÿ“Œ 0
How to win a best paper award (or, an opinionated take on how to do important research) An opinionated perspective on how to do important research that makes a difference (and sometimes win awards).

Good post on how to think about honing your skills as an (academic) researcher by Carlini

nicholas.carlini.com/writing/2026...

10.03.2026 19:15 ๐Ÿ‘ 5 ๐Ÿ” 2 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Preview
What If Readers Like A.I.-Generated Fiction? If economic and technological transformations have changed our relationship with literature before, they could do so again.

You can find similar but more interesting experiments in Vauhini Vara's recent New Yorker piece, and/or @tuhinchakr.bsky.social's work, and lots of other places!

www.newyorker.com/culture/the-...

arxiv.org/abs/2601.18353

10.03.2026 16:31 ๐Ÿ‘ 5 ๐Ÿ” 2 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
A person at a keyboard faced with a chatbot prompt window. Stock illustration.

A person at a keyboard faced with a chatbot prompt window. Stock illustration.

Only 0.1% of academic papers published since 2023 have explicitly disclosed the use of AI for writing assistance, yet textual analysis suggests that the actual rate of AI use is 40 times higher. The studyโ€™s authors call for a policy rethink. In PNAS: https://ow.ly/HPb250YrlyB

09.03.2026 23:00 ๐Ÿ‘ 6 ๐Ÿ” 3 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 1

I agree that itโ€™s silly to claim that LLMs Canโ€™t Do Anything (obviously they can do many things). I also think itโ€™s silly to claim LLMs Can Do Everything (obviously they canโ€™t).

regardless of how one feels about that, this is a very scary time to need a job and people are reacting accordingly

08.03.2026 10:08 ๐Ÿ‘ 574 ๐Ÿ” 52 ๐Ÿ’ฌ 10 ๐Ÿ“Œ 7
Preview
CU faculty, staff and students push back against university-controlled AI rollout Hundreds have signed a letter of dissent arguing that an AI rollout lacked transparency and technical oversight. Others say campus leaders havenโ€™t adequately addressed concerns about student privacy, ...

โ€œFaculty on CU Denver and Boulder campuses say the decision was reached without consulting campus experts in AI, ethics or education.โ€

09.03.2026 16:12 ๐Ÿ‘ 89 ๐Ÿ” 29 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 1
OSF

When you collect data online, are the results from humans or AI? In a project led by Booth PhD student Grace Zhang, we estimate the prevalence of AI agents on commonly used survey platforms:
osf.io/preprints/ps...
๐Ÿงต

07.03.2026 20:22 ๐Ÿ‘ 108 ๐Ÿ” 50 ๐Ÿ’ฌ 4 ๐Ÿ“Œ 3

When shaping your research agenda, your objective is to find the weirdest niche possible that still has the potential to change everything.

05.03.2026 01:38 ๐Ÿ‘ 76 ๐Ÿ” 9 ๐Ÿ’ฌ 4 ๐Ÿ“Œ 1
NLP4DH 2026 Conference Welcome to the OpenReview homepage for NLP4DH 2026 Conference

๐Ÿšจ NLP4DH 2026 deadline has been extended to March 13! Submission link here: openreview.net/group?id=NLP...

03.03.2026 19:33 ๐Ÿ‘ 7 ๐Ÿ” 6 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 2
Preview
Iโ€™m writing an HCI paper about an AI-powered system. What should I report? Eight Guidelines to Improve Research Quality and Enhance Chance of Acceptance

Writing an HCI paper about an AI-powered system to a venue like UIST 2026 or CHI 2027? Wondering what reviewers expect you to report, and how to approach paper framing and writing? Check out our reporting guidelines: medium.com/p/7c3ae86341...

03.03.2026 16:29 ๐Ÿ‘ 1 ๐Ÿ” 1 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Preview
The Aftermath of DrawEduMath: Vision Language Models Underperform with Struggling Students and Misdiagnose Errors Effective mathematics education requires identifying and responding to students' mistakes. For AI to support pedagogical applications, models must perform well across different levels of student profi...

Without such eval, rushed integration of AI into classrooms may exacerbate existing academic achievement gaps.
See our paper for more (inc. a study where I redrew 300+ images by hand): arxiv.org/abs/2603.00925
@ai2.bsky.social @kylelo.bsky.social

03.03.2026 03:11 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

We argue that eval around AI for education should be disaggregated in a manner that pinpoints whether models can discern when a student may need pedagogical support, and whether models equitably serve students across different levels of proficiency.

03.03.2026 03:10 ๐Ÿ‘ 1 ๐Ÿ” 1 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Question: How many dots did the student include in their array?
For an erroneous student response: Model answer: 12. True answer: The student didn't include an array. True answer for a non-erroneous student response: The student included 12 dots in their array.
Question: How many squares did the student draw to show the number of cups of red paint?
For an erroneous student response: Model answer: The student drew 9 squares to show the number of cups of red paint. True answer: The student drew 12 squares to represent the cups of red paint.
True answer for a non-erroneous student response: The student drew 9 squares to show the number of cups of red paint.

Question: How many dots did the student include in their array? For an erroneous student response: Model answer: 12. True answer: The student didn't include an array. True answer for a non-erroneous student response: The student included 12 dots in their array. Question: How many squares did the student draw to show the number of cups of red paint? For an erroneous student response: Model answer: The student drew 9 squares to show the number of cups of red paint. True answer: The student drew 12 squares to represent the cups of red paint. True answer for a non-erroneous student response: The student drew 9 squares to show the number of cups of red paint.

Modelsโ€™ mistakes may assume correct math solutions. Typically, models are trained on โ€œhigh qualityโ€ math so that they can hill-climb on GSM8k, MATH, etc. However, dev pipelines that favor correct math are tension w/ education, where math errors require extra attention.

03.03.2026 03:09 ๐Ÿ‘ 0 ๐Ÿ” 1 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
A bar chart disaggregating results for four VLMs across different question types. Content description QA consistently drives the gap in VLM performance between student responses that contain errors versus those that do not. In addition, questions related to studentsโ€™ correctness and errors are still the most difficult.

A bar chart disaggregating results for four VLMs across different question types. Content description QA consistently drives the gap in VLM performance between student responses that contain errors versus those that do not. In addition, questions related to studentsโ€™ correctness and errors are still the most difficult.

We find that this gap is primarily driven by QA related to content description. In addition, VLMs struggle to identify cases when help is needed; the most challenging QA are those related to assessing studentsโ€™ correctness and errors.

03.03.2026 03:09 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Title, author list, and two figures from the paper. 
Title: The Aftermath of DrawEduMath: Vision Language Models
Underperform with Struggling Students and Misdiagnose Errors
Authors: Li Lucy, Albert Zhang, Nathan Anderson, Ryan Knight, Kyle Lo
Figure 1: On the left is a math problem, where students are asked to draw x < 5/2 on a number line. The right side shows two example student responses that differ in correctness. DrawEduMath pairs each math problem with one student response, and prompts VLMs to answer questions about the student response.
Figure 2: VLMs consistently perform worse on answering DrawEduMath benchmark questions pertaining to erroneous student responses. Performance on non-erroneous student responses is labeled with specific VLMsโ€™ names; that same modelโ€™s performance on erroneous student responses is directly below.

Title, author list, and two figures from the paper. Title: The Aftermath of DrawEduMath: Vision Language Models Underperform with Struggling Students and Misdiagnose Errors Authors: Li Lucy, Albert Zhang, Nathan Anderson, Ryan Knight, Kyle Lo Figure 1: On the left is a math problem, where students are asked to draw x < 5/2 on a number line. The right side shows two example student responses that differ in correctness. DrawEduMath pairs each math problem with one student response, and prompts VLMs to answer questions about the student response. Figure 2: VLMs consistently perform worse on answering DrawEduMath benchmark questions pertaining to erroneous student responses. Performance on non-erroneous student responses is labeled with specific VLMsโ€™ names; that same modelโ€™s performance on erroneous student responses is directly below.

Models are now expert math solvers, and so AI for math education is receiving increasing attention.
Our new preprint evaluates 11 VLMs on our QA benchmark, DrawEduMath. We highlight a startling gap: models perform less well on inputs from K-12 students who need more help. ๐Ÿงต

03.03.2026 03:08 ๐Ÿ‘ 34 ๐Ÿ” 12 ๐Ÿ’ฌ 4 ๐Ÿ“Œ 2
Post image

1/7 ๐Ÿงต The GPT-4 technical report featured detailed calibration curves.

Since then, not a single major model release has reported calibration. The field quietly stopped measuring whether models know what they don't know.

Our new position paper argues this is a mistake. Here's why.

02.03.2026 19:09 ๐Ÿ‘ 8 ๐Ÿ” 2 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

Abstract submissions close on March 3rd!

We are also extending a โœจ call for mentored reviewers โœจ if you advise excellent graduate or postdoctoral researchers you are welcome to recommend them to review for IC2S2 2026. Email IC2S2@uvm.edu to nominate mentored reviewers (or faculty colleagues)

23.02.2026 19:39 ๐Ÿ‘ 14 ๐Ÿ” 12 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 2

CORRECTION, Claude Code launched in February 2025, suggesting a roughly 13% increase above expectations.

26.02.2026 00:47 ๐Ÿ‘ 5 ๐Ÿ” 1 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 2

I remember the time to time muttering!! ๐Ÿ˜ฎ Curious, chinese-speaking culture in mainland china or US or elsewhere??

25.02.2026 23:05 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

Agents of Chaos -- what are autonomous OpenClaw agents up to? How do they interact with each other? Read our investigation of OpenClaw at
researchgate.net/publication/...
And an interactive website agentsofchaos.baulab.info
@davidbau.bsky.social @natalieshapira.bsky.social @openclaw-x.bsky.social

24.02.2026 15:04 ๐Ÿ‘ 18 ๐Ÿ” 6 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 1

I'm hiring a postdoc at @cmu.edu (w/ far.ai & @dgrand.bsky.social + @gordpennycook.bsky.social)!

How do LLMs shape human beliefs โ€” and what do we do about it? AI safety meets behavioral science.

Open to technical and social science backgrounds.

23.02.2026 18:46 ๐Ÿ‘ 42 ๐Ÿ” 27 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 3
Preview
Anthropic Education Report: The AI Fluency Index Anthropic's AI Fluency Index measures 11 observable behaviors across thousands of Claude.ai conversations to understand how people develop AI collaboration skills.

New research: The AI Fluency Index.

We tracked 11 behaviors across thousands of http://Claude.ai conversationsโ€”for example, how often people iterate and refine their work with Claudeโ€”to measure how well people collaborate with AI.

Read more: https://www.anthropic.com/research/AI-fluency-index

23.02.2026 15:06 ๐Ÿ‘ 15 ๐Ÿ” 1 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 3

We've alllllmost gotten all the Jan26 ARR reviews in, but I'm still trying to track down new emergency reviewers for papers on the following topics:
1) agents
2) jailbreaking
3) coding
4) RL
5) reasoning
6) LLM for finance
7) AMR
8) alignment
If you can review any (in next 24-48h) please DM me ๐Ÿ™๐Ÿ™๐Ÿ™

20.02.2026 04:39 ๐Ÿ‘ 3 ๐Ÿ” 9 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

I was taught that to have a great job talk narrative, you really only need ~3 high quality papers

20.02.2026 01:54 ๐Ÿ‘ 5 ๐Ÿ” 0 ๐Ÿ’ฌ 2 ๐Ÿ“Œ 0

How horrible to be a CS grad student under pressure to submit multiple first author papers to every conference deadline, whether they feel ready or not. This serves no oneโ€™s best interests in long run (science included). But lots of students appear to being getting advice itโ€™s necessary to compete

20.02.2026 01:03 ๐Ÿ‘ 71 ๐Ÿ” 8 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 2
Preview
Matching sounds to shapes: Evidence of the bouba-kiki effect in naรฏve baby chicks Humans across multiple languages spontaneously associate the nonwords โ€œkikiโ€ and โ€œboubaโ€ with spiky and round shapes, respectively, a phenomenon named the bouba-kiki effect. To explore the origin of t...

โ€œHumans across multiple languages spontaneously associate the nonwords kiki & bouba with spiky & round shapes, respectively...We tested the bouba-kiki effect in baby chickens. Similar to humans, they spontaneously chose a spiky shape when hearing a kiki sound & a round shape when hearing a bouba.โ€๐Ÿ˜ฒ๐Ÿงช

19.02.2026 19:20 ๐Ÿ‘ 334 ๐Ÿ” 123 ๐Ÿ’ฌ 13 ๐Ÿ“Œ 40

I have a small project that is taking me outside of academia to dip into industry, just ever so briefly.

I engage a lot with AI. I was not at all prepared for how industry is using it. Not. at. all.

This brief little window is definitely helping me better frame my teaching in this new world.

17.02.2026 21:28 ๐Ÿ‘ 49 ๐Ÿ” 6 ๐Ÿ’ฌ 8 ๐Ÿ“Œ 1

My contribution to the discourse, which I've said before and will say again: DH isn't over. DH has won. 1/

17.02.2026 15:46 ๐Ÿ‘ 72 ๐Ÿ” 23 ๐Ÿ’ฌ 5 ๐Ÿ“Œ 11
Preview
Bellwether Postdoctoral Scholar - School of Information University of California, Berkeley is hiring. Apply now!

Postdoc positions at UC Berkeley, including with the fabulous Cultural Analytics group: aprecruit.berkeley.edu/JPF05222

16.02.2026 19:10 ๐Ÿ‘ 40 ๐Ÿ” 26 ๐Ÿ’ฌ 2 ๐Ÿ“Œ 1

I asked Gemini to "defend itself," and say what the big benefits of LLMs have been since 2020:

"Since 2020, the volume of digital noise has increased, and LLMs have provided the first reliable shield against it."

15.02.2026 15:18 ๐Ÿ‘ 18 ๐Ÿ” 1 ๐Ÿ’ฌ 3 ๐Ÿ“Œ 1