Excited to share the pre-print for a forthcoming article in NLH with @richardjeanso.bsky.social π
Generative AI & Fictionality: How Novels Power Large Language Models
arxiv.org/abs/2603.01220
@dasmiq
Associate professor of computer science at Northeastern University. Natural language processing, digital humanities, OCR, computational bibliography, and computational social sciences. Artificial intelligence is an archival science.
Excited to share the pre-print for a forthcoming article in NLH with @richardjeanso.bsky.social π
Generative AI & Fictionality: How Novels Power Large Language Models
arxiv.org/abs/2603.01220
Poster advertising lectures on "Raisonnement Philologique et Modèles Informatiques" stating at 4pm, Thursday, March 12, at 54 Boulevard Raspail, Paris.
Paris friends! Amis parisiens ! This Thursday is the first of four public lectures I'm giving on AI and philology, broadly defined: "Philological Reasoning and Computational Models." The advertisement is in French, but the lectures are in English. I'd also love to meet while I'm here in March! 1/
We have a new article with Digital Classicist Online: Towards a smart edition of Apollodorus
Thanks to Daniel Stoekl and Peter Stokes and the Γcole Pratique des Hautes Γtudes for their kind invitation. I'm looking forward to it! 6/
"Cross-Language Influence and Explainable Translation" considers the process of reading in translation, going beyond the single output of a machine-translation system to support philological reasoning about cross-language textual relationships. 5/
Because texts do not exist in isolation, "Modeling Textual Networks" proceeds to develop models of the relationships between texts and apply them to commentary, historical, and scholarly traditions such as midrash and early modern philology. 4/
"Evidence and Explanations in Text Transcription" turns to the task of transcribing digital images of particular textual witnesses using OCR/HTR and the need/opportunity to provide user-directed explanations for the systems decisions. 3/
"Textual Criticism as Language Modeling" starts with the fundamental philological technology of textual criticism and casts it as a process of building statistical models of a range of possible readings, i.e., language models. 2/
Poster advertising lectures on "Raisonnement Philologique et Modèles Informatiques" stating at 4pm, Thursday, March 12, at 54 Boulevard Raspail, Paris.
Paris friends! Amis parisiens ! This Thursday is the first of four public lectures I'm giving on AI and philology, broadly defined: "Philological Reasoning and Computational Models." The advertisement is in French, but the lectures are in English. I'd also love to meet while I'm here in March! 1/
LLM as Critic as Artist
If you've made it this far, you might also want to check out Amber's earlier work on media storms: www.tandfonline.com/doi/abs/10.1..., or my student Ben Litterer's (@blitt.bsky.social) ACL paper on the same topic: aclanthology.org/2023.finding...
For additional details, including coding protocols, teaching resources, and side-by-side case comparisons, you can refer to the accompanying website: www.amber-boydstun.com/catching-fir...
We also discuss additional factors that can influence the course of a storm, such as journalistic gatekeeping, attention fatigue, political activism, and strategic communication online.
For a more in-depth summary, please take a look at Jill's thread here: bsky.app/profile/jill... or read the book!
The book is build around a series of paired case studies -- similar events, where one became a full-fledged media storm, and the other did not -- such as the Titan Submersible Implosion vs. the Messenia Migrant Boat Disaster, occurring just days apart in 2023.
The heart of this work uses the fire triangle model (heat, fuel, and oxygen) as a metaphor to characterize the necessary conditions for an event to become into a media storm -- those stories that are so pervasive in the news that they are practically inescapable.
I'm a little late in sharing this news, but thanks to the extraordinary efforts of Amber Boydstun, @jilllaufer.bsky.social, and @nlpnoah.bsky.social, our book on media storms, "Catching Fire in the News", is now published and available fully open-access from Cambridge! doi.org/10.1017/9781...
What is the relationship between memorization and generalization in AI? Is there a fundamental tradeoff? In infinitefaculty.substack.com/p/memorizati... Iβve reviewed some of the evolving perspectives on memorization & generalization in machine learning, from classic perspectives through LLMs.
Wrote a slightly more detailed guide on how to do this with your own collections/materials: danielvanstrien.xyz/posts/2026/r...
New article! "Toward an Ontological Representation of Fictional Characters" by @antoine-bourgois.bsky.social, me, @oseminck.bsky.social & @tpoibeau.bsky.social
doi.org/10.1017/chr....
Nothing fancy here β only sweat & tears. π§΅
In our new preprint, we explain how some salient features of representational geometry in language modeling originate from a single principle - translation symmetry in the statistics of data.
arxiv.org/abs/2602.150...
With Dhruva Karkada, Daniel Korchinski, Andres Nava, & Matthieu Wyart.
I gave a talk at the Google Privacy in ML Seminar last summer on privacy & memorization: "Privacy Ripple Effects from Adding or Removing Personal Information in Language Model Training".
It's up on YouTube now if you're interested :)
youtu.be/IzIsHFCqXGo?...
We also show that we are far from done, specifically for a complicated language like Old French.
But we
(1) defined the issue,
(2) propose a first solution that enables pre-annotation of larger dataset and
(3) offer an alternative to less trustable models that go beyond ATR.
We release:
π 4.66M silver training samples
π§ͺ 1.8k gold evaluation set huggingface.co/datasets/com...
π€ ByT5-based model β 6.7% CER huggingface.co/comma-projec...
Try it here π
huggingface.co/spaces/comma...
π We propose Pre-Editorial Normalization (PEN):
An intermediate layer between:
π graphemic ATR output
π fully edited text
Goal: preserve palaeographic fidelity + enable usability.
Keep two layer, ATR output and normalization, with aligned token to go back to the source.
Recent ATR progressβespecially with palaeographic datasets like CATMuSβhas improved access to medieval sources.
But:
β Raw outputs are hard to use
β Fully normalized models over-normalize & hallucinate
Thereβs a methodological gap.
If I give you the text
π omnium peccatorum quia ex quo dyaconus quando esset in futurum, stultus esset
Can you find the ATR error without the manuscript ?
Probably not.
ATR models that predict text and normalize in one go generate trustable text, but prevent detecting issues.
π New paper:
Pre-Editorial Normalization for Automatically Transcribed Medieval Manuscripts in Old French and Latin
Thibault ClΓ©rice, @rachelbawden.bsky.social , Anthony Glaise, Ariane Pinche, @dasmiq.bsky.social (2026) arxiv.org/abs/2602.13905
We introduce Pre-Editorial Normalization (PEN).
π§΅β¬οΈ
Excited to be co-organizing the #CHI2026 workshop on augmented reading interfaces πβ¨ Submissions are open for one more week! We want to know what you're working on!
our open model proving out specialized rag LMs over scientific literature has been published in nature βπ»
congrats to our lead @akariasai.bsky.social & team of students and Ai2 researchers/engineers
www.nature.com/articles/s41...
Cool postdoc job opportunity! A chance to work with some great English & comp sci scholars at Carnegie Mellon. Appreciate this ad stresses: chance to do interesting technical work; work on an interesting humanities problem; chance to publish both in humanities & comp sci venues. Looks great, apply!