please let me know if you ever want to chat about any of this. I canβt promise I have anything useful to say, but I do have plenty to say about this. And am of course always around to listen.
please let me know if you ever want to chat about any of this. I canβt promise I have anything useful to say, but I do have plenty to say about this. And am of course always around to listen.
like sometimes life is art, specifically an absurdist Beckett play
arstechnica.com/tech-policy/...
someone sent me this from the other place and this timeline really is something else
Not a perfect fit to the exact query I don't think, but I like this note as a starting place: lawreview.uchicago.edu/sites/defaul...
@jackbalkin.bsky.social
(lucky for everyone that I'm too lazy to write a blog post))
Yes, I have published at that track before, and related ones. But I'm not eager to again. Getting into that is maybe worth a blog post.
No I did not write/submit this paper to the ICML position paper track. Like many (but of course not all) papers submitted there, I think this is at most a blog post (where "at most" is a very generous upper bound, because the ~300 characters above almost certainly are enough).
Position: ML conferences should consider removing the position paper track
(...and just acknowledge that every scientific paper is articulating at least one position)
(This is all to say, I've been shocked at some of what I've heard coming out of industry. My assumption used to be that they knew a lot more about this than they seem to.)
I think partially yes. There definitely are full-time applied and research people working on data curation as a topic. But there are a ton of gaps/ things that might seem surprising here. E.g., making corpus-level decisions doesn't always tell you much about the underlying training data examples.
Am also concerned about this, but itβs not clear to me that companies even know everything thatβs included. I suppose βuse it allβ is an editorial decision, though.
I just had a paper I reviewed months ago be βdesk rejectedβ by ICLR for this reason. (Itβs arguably not a desk rejection after 3 reviewers already chimed in.) But, this seems to be where things are headed.
Even if chucking the papers outright is undesirable (hallucination checkers are not error-free), I'm disappointed there's no process at all other than "oops, you can go fix it if you care to."
(though going forward, I wouldnβt be sad if I had a bit more compute π)
One of my favorite responses to questions about compute in my work this year is βitβs expensive, yes, but I had to develop some efficient algos and write some efficient code to make this possible. This work was done at odd hours on 4 A100s shared by a dozen people.β
note that i said βMLβ and βcopyright,β which are very specific things that i actually think have very little to do with the anger iβm referring to
itβs hard to work at the intersection of ML and copyright because βboth sidesβ of the debate are angry and, in my experience, most havenβt done much of the background reading in ML or copyright to have an informed opinion. itβs just vibes and anger. i should probably write something up about this.
got to experience the "I did not write that headline" phenomenon firsthand
The article: "Correctly scoping a legal safe harbor for A.I.-generated child sexual abuse material testing is tough."
The headline: "There's One Easy Solution to the A.I. Porn Problem"
After twelve years of work, the worldβs most beautiful subway station has been inaugurated in Rome: Colosseo, an underground archaeological museum.βοΈπβοΈπβοΈπβοΈ
It's been quite the experience seeing the responses to this work (across the spectrum). I've been working in this area since 2020 & am very grateful to have amazing collaborators + mentors who've supported me along the way (only a few on bsky) @pamelasamuelson.bsky.social @zephoria.bsky.social
our research on memorization and copyright (with @jtlg.bsky.social ) from 2024: scholarship.kentlaw.iit.edu/cklawreview/vol100/iss1/9/
our research (with @marklemley.bsky.social ) from May on open-weight LLMs like Llama 3.1 70B: arxiv.org/abs/2505.12546
For those interested in the details:
our recent work on production LLMs like Claude 3.7 Sonnet: arxiv.org/abs/2601.02671
The Atlantic posted an article about memorization and generative AI, and it mentions our work on extraction of books from production LLms and open-weight models.
www.theatlantic.com/technology/2...
The referenced work reflects research with @marklemley.bsky.social @jtlg.bsky.social and others.
Happy you found our work interesting! Linking to the open-weight model extraction paper @marklemley.bsky.social was referring to:
arxiv.org/abs/2505.12546
(Indexing on the word βoftenβ)
important disclaimer that our research (and the other papers referenced in this article) donβt really capture if they βoften just repeat what they have seen elsewhereβ
Me too. Like every time I want to move on I get sucked back in.
Eg, thereβs an information-theoretic sense where the database analogy is correct, but itβll be entirely misunderstood if we go that route bc of common perceptions of what a database is. And so that runs more risk than itβs worth imo, since the goal here is wider understanding / conceptual clarity.
Will update you on what we come up with for that law paper thatβs due in June πI truly donβt know how to do this yet.