Cooper (@afedercooper)

please let me know if you ever want to chat about any of this. I can’t promise I have anything useful to say, but I do have plenty to say about this. And am of course always around to listen.

03.03.2026 02:08 👍 0 🔁 0 💬 0 📌 0

Microsoft deletes blog telling users to train AI on pirated Harry Potter books The now-deleted Harry Potter dataset was "mistakenly" marked public domain.

like sometimes life is art, specifically an absurdist Beckett play

arstechnica.com/tech-policy/...

25.02.2026 05:53 👍 1 🔁 0 💬 0 📌 0

someone sent me this from the other place and this timeline really is something else

25.02.2026 05:51 👍 3 🔁 0 💬 1 📌 0

Not a perfect fit to the exact query I don't think, but I like this note as a starting place: lawreview.uchicago.edu/sites/defaul...

@jackbalkin.bsky.social

29.01.2026 18:02 👍 1 🔁 0 💬 0 📌 0

(lucky for everyone that I'm too lazy to write a blog post))

28.01.2026 07:02 👍 3 🔁 0 💬 0 📌 0

Yes, I have published at that track before, and related ones. But I'm not eager to again. Getting into that is maybe worth a blog post.

28.01.2026 05:51 👍 2 🔁 0 💬 1 📌 0

No I did not write/submit this paper to the ICML position paper track. Like many (but of course not all) papers submitted there, I think this is at most a blog post (where "at most" is a very generous upper bound, because the ~300 characters above almost certainly are enough).

28.01.2026 05:48 👍 3 🔁 0 💬 1 📌 0

Position: ML conferences should consider removing the position paper track

(...and just acknowledge that every scientific paper is articulating at least one position)

28.01.2026 05:44 👍 11 🔁 0 💬 1 📌 0

(This is all to say, I've been shocked at some of what I've heard coming out of industry. My assumption used to be that they knew a lot more about this than they seem to.)

25.01.2026 21:17 👍 1 🔁 0 💬 0 📌 0

I think partially yes. There definitely are full-time applied and research people working on data curation as a topic. But there are a ton of gaps/ things that might seem surprising here. E.g., making corpus-level decisions doesn't always tell you much about the underlying training data examples.

25.01.2026 21:15 👍 2 🔁 0 💬 1 📌 0

Am also concerned about this, but it’s not clear to me that companies even know everything that’s included. I suppose “use it all” is an editorial decision, though.

25.01.2026 20:44 👍 2 🔁 0 💬 1 📌 0

I just had a paper I reviewed months ago be “desk rejected” by ICLR for this reason. (It’s arguably not a desk rejection after 3 reviewers already chimed in.) But, this seems to be where things are headed.

24.01.2026 19:00 👍 1 🔁 0 💬 0 📌 0

Even if chucking the papers outright is undesirable (hallucination checkers are not error-free), I'm disappointed there's no process at all other than "oops, you can go fix it if you care to."

24.01.2026 06:43 👍 3 🔁 0 💬 0 📌 0

(though going forward, I wouldn’t be sad if I had a bit more compute 🙃)

21.01.2026 18:19 👍 0 🔁 0 💬 0 📌 0

One of my favorite responses to questions about compute in my work this year is “it’s expensive, yes, but I had to develop some efficient algos and write some efficient code to make this possible. This work was done at odd hours on 4 A100s shared by a dozen people.”

21.01.2026 18:18 👍 0 🔁 0 💬 1 📌 0

note that i said “ML” and “copyright,” which are very specific things that i actually think have very little to do with the anger i’m referring to

14.01.2026 00:27 👍 1 🔁 0 💬 0 📌 0

it’s hard to work at the intersection of ML and copyright because “both sides” of the debate are angry and, in my experience, most haven’t done much of the background reading in ML or copyright to have an informed opinion. it’s just vibes and anger. i should probably write something up about this.

14.01.2026 00:26 👍 7 🔁 1 💬 1 📌 0

got to experience the "I did not write that headline" phenomenon firsthand

The article: "Correctly scoping a legal safe harbor for A.I.-generated child sexual abuse material testing is tough."

The headline: "There's One Easy Solution to the A.I. Porn Problem"

13.01.2026 21:03 👍 51 🔁 5 💬 3 📌 0

After twelve years of work, the world’s most beautiful subway station has been inaugurated in Rome: Colosseo, an underground archaeological museum.⚜️💙⚜️💙⚜️💙⚜️

13.01.2026 05:07 👍 269 🔁 105 💬 16 📌 25

It's been quite the experience seeing the responses to this work (across the spectrum). I've been working in this area since 2020 & am very grateful to have amazing collaborators + mentors who've supported me along the way (only a few on bsky) @pamelasamuelson.bsky.social @zephoria.bsky.social

12.01.2026 19:57 👍 5 🔁 0 💬 0 📌 0

The Files are in the Computer: On Copyright, Memorization, and Generative AI By A. Feder Cooper afedercooper@gmail.com and James Grimmelmann, Published on 08/15/25

our research on memorization and copyright (with @jtlg.bsky.social ) from 2024: scholarship.kentlaw.iit.edu/cklawreview/vol100/iss1/9/

12.01.2026 19:57 👍 3 🔁 1 💬 1 📌 0

Extracting memorized pieces of (copyrighted) books from open-weight language models Plaintiffs and defendants in copyright lawsuits over generative AI often make sweeping, opposing claims about the extent to which large language models (LLMs) have memorized plaintiffs' protected expr...

our research (with @marklemley.bsky.social ) from May on open-weight LLMs like Llama 3.1 70B: arxiv.org/abs/2505.12546

12.01.2026 19:57 👍 2 🔁 0 💬 1 📌 0

Extracting books from production language models Many unresolved legal questions over LLMs and copyright center on memorization: whether specific training data have been encoded in the model's weights during training, and whether those memorized dat...

For those interested in the details:

our recent work on production LLMs like Claude 3.7 Sonnet: arxiv.org/abs/2601.02671

12.01.2026 19:57 👍 3 🔁 0 💬 1 📌 0

AI’s Memorization Crisis Large language models don’t “learn”—they copy. And that could change everything for the tech industry.

The Atlantic posted an article about memorization and generative AI, and it mentions our work on extraction of books from production LLms and open-weight models.

www.theatlantic.com/technology/2...

The referenced work reflects research with @marklemley.bsky.social @jtlg.bsky.social and others.

12.01.2026 19:57 👍 13 🔁 5 💬 1 📌 0

Extracting memorized pieces of (copyrighted) books from open-weight language models Plaintiffs and defendants in copyright lawsuits over generative AI often make sweeping, opposing claims about the extent to which large language models (LLMs) have memorized plaintiffs' protected expr...

Happy you found our work interesting! Linking to the open-weight model extraction paper @marklemley.bsky.social was referring to:

arxiv.org/abs/2505.12546

12.01.2026 04:37 👍 2 🔁 0 💬 0 📌 0

(Indexing on the word “often”)

11.01.2026 22:52 👍 1 🔁 0 💬 0 📌 0

important disclaimer that our research (and the other papers referenced in this article) don’t really capture if they “often just repeat what they have seen elsewhere”

11.01.2026 22:51 👍 2 🔁 1 💬 2 📌 0

Me too. Like every time I want to move on I get sucked back in.

11.01.2026 21:28 👍 1 🔁 0 💬 0 📌 0

Eg, there’s an information-theoretic sense where the database analogy is correct, but it’ll be entirely misunderstood if we go that route bc of common perceptions of what a database is. And so that runs more risk than it’s worth imo, since the goal here is wider understanding / conceptual clarity.

11.01.2026 21:28 👍 1 🔁 0 💬 0 📌 0

Will update you on what we come up with for that law paper that’s due in June 🙃I truly don’t know how to do this yet.

11.01.2026 21:27 👍 1 🔁 0 💬 1 📌 0

Cooper

Latest posts by Cooper @afedercooper