David Smith's Avatar

David Smith

@dasmiq

Associate professor of computer science at Northeastern University. Natural language processing, digital humanities, OCR, computational bibliography, and computational social sciences. Artificial intelligence is an archival science.

5,356
Followers
299
Following
399
Posts
01.09.2023
Joined
Posts Following

Latest posts by David Smith @dasmiq

Preview
Generative AI & Fictionality: How Novels Power Large Language Models Generative models, like the one in ChatGPT, are powered by their training data. The models are simply next-word predictors, based on patterns learned from vast amounts of pre-existing text. Since the ...

Excited to share the pre-print for a forthcoming article in NLH with @richardjeanso.bsky.social πŸŽ‰

Generative AI & Fictionality: How Novels Power Large Language Models

arxiv.org/abs/2603.01220

09.03.2026 16:41 πŸ‘ 8 πŸ” 2 πŸ’¬ 1 πŸ“Œ 0
Poster advertising lectures on "Raisonnement Philologique et Modèles Informatiques" stating at 4pm, Thursday, March 12, at 54 Boulevard Raspail, Paris.

Poster advertising lectures on "Raisonnement Philologique et Modèles Informatiques" stating at 4pm, Thursday, March 12, at 54 Boulevard Raspail, Paris.

Paris friends! Amis parisiens ! This Thursday is the first of four public lectures I'm giving on AI and philology, broadly defined: "Philological Reasoning and Computational Models." The advertisement is in French, but the lectures are in English. I'd also love to meet while I'm here in March! 1/

09.03.2026 08:34 πŸ‘ 21 πŸ” 15 πŸ’¬ 1 πŸ“Œ 1
Post image

We have a new article with Digital Classicist Online: Towards a smart edition of Apollodorus

09.03.2026 13:14 πŸ‘ 7 πŸ” 3 πŸ’¬ 0 πŸ“Œ 0

Thanks to Daniel Stoekl and Peter Stokes and the Ècole Pratique des Hautes Ètudes for their kind invitation. I'm looking forward to it! 6/

09.03.2026 08:34 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

"Cross-Language Influence and Explainable Translation" considers the process of reading in translation, going beyond the single output of a machine-translation system to support philological reasoning about cross-language textual relationships. 5/

09.03.2026 08:34 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

Because texts do not exist in isolation, "Modeling Textual Networks" proceeds to develop models of the relationships between texts and apply them to commentary, historical, and scholarly traditions such as midrash and early modern philology. 4/

09.03.2026 08:34 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

"Evidence and Explanations in Text Transcription" turns to the task of transcribing digital images of particular textual witnesses using OCR/HTR and the need/opportunity to provide user-directed explanations for the systems decisions. 3/

09.03.2026 08:34 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

"Textual Criticism as Language Modeling" starts with the fundamental philological technology of textual criticism and casts it as a process of building statistical models of a range of possible readings, i.e., language models. 2/

09.03.2026 08:34 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Poster advertising lectures on "Raisonnement Philologique et Modèles Informatiques" stating at 4pm, Thursday, March 12, at 54 Boulevard Raspail, Paris.

Poster advertising lectures on "Raisonnement Philologique et Modèles Informatiques" stating at 4pm, Thursday, March 12, at 54 Boulevard Raspail, Paris.

Paris friends! Amis parisiens ! This Thursday is the first of four public lectures I'm giving on AI and philology, broadly defined: "Philological Reasoning and Computational Models." The advertisement is in French, but the lectures are in English. I'd also love to meet while I'm here in March! 1/

09.03.2026 08:34 πŸ‘ 21 πŸ” 15 πŸ’¬ 1 πŸ“Œ 1

LLM as Critic as Artist

06.03.2026 12:12 πŸ‘ 2 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
When it Rains, it Pours: Modeling Media Storms and the News Ecosystem Benjamin Litterer, David Jurgens, Dallas Card. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023.

If you've made it this far, you might also want to check out Amber's earlier work on media storms: www.tandfonline.com/doi/abs/10.1..., or my student Ben Litterer's (@blitt.bsky.social) ACL paper on the same topic: aclanthology.org/2023.finding...

22.02.2026 18:00 πŸ‘ 4 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0
Preview
Catching Fire Appendix Cambridge Political Communication Element APPENDIX FOR Β  CATCHING FIRE IN THE NEWS

For additional details, including coding protocols, teaching resources, and side-by-side case comparisons, you can refer to the accompanying website: www.amber-boydstun.com/catching-fir...

22.02.2026 17:59 πŸ‘ 4 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0

We also discuss additional factors that can influence the course of a storm, such as journalistic gatekeeping, attention fatigue, political activism, and strategic communication online.
For a more in-depth summary, please take a look at Jill's thread here: bsky.app/profile/jill... or read the book!

22.02.2026 17:58 πŸ‘ 2 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0
Post image

The book is build around a series of paired case studies -- similar events, where one became a full-fledged media storm, and the other did not -- such as the Titan Submersible Implosion vs. the Messenia Migrant Boat Disaster, occurring just days apart in 2023.

22.02.2026 17:58 πŸ‘ 3 πŸ” 2 πŸ’¬ 1 πŸ“Œ 0
Post image

The heart of this work uses the fire triangle model (heat, fuel, and oxygen) as a metaphor to characterize the necessary conditions for an event to become into a media storm -- those stories that are so pervasive in the news that they are practically inescapable.

22.02.2026 17:57 πŸ‘ 5 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0
Preview
Catching Fire in the News Cambridge Core - Politics: General Interest - Catching Fire in the News

I'm a little late in sharing this news, but thanks to the extraordinary efforts of Amber Boydstun, @jilllaufer.bsky.social, and @nlpnoah.bsky.social, our book on media storms, "Catching Fire in the News", is now published and available fully open-access from Cambridge! doi.org/10.1017/9781...

22.02.2026 17:54 πŸ‘ 29 πŸ” 8 πŸ’¬ 2 πŸ“Œ 0
Preview
Memorization vs. generalization in deep learning: implicit biases, benign overfitting, and more Or: how I learned to stop worrying and love the memorization

What is the relationship between memorization and generalization in AI? Is there a fundamental tradeoff? In infinitefaculty.substack.com/p/memorizati... I’ve reviewed some of the evolving perspectives on memorization & generalization in machine learning, from classic perspectives through LLMs.

18.02.2026 15:54 πŸ‘ 134 πŸ” 27 πŸ’¬ 4 πŸ“Œ 5
Re-OCR Your Digitised Collections for ~$0.002/Page – Daniel van Strien A guide to re-processing digitised collections with open-source VLM-based OCR models.

Wrote a slightly more detailed guide on how to do this with your own collections/materials: danielvanstrien.xyz/posts/2026/r...

19.02.2026 14:28 πŸ‘ 28 πŸ” 6 πŸ’¬ 4 πŸ“Œ 1
Toward an Ontological Representation of Fictional Characters | Computational Humanities Research | Cambridge Core Toward an Ontological Representation of Fictional Characters

New article! "Toward an Ontological Representation of Fictional Characters" by @antoine-bourgois.bsky.social, me, @oseminck.bsky.social & @tpoibeau.bsky.social

doi.org/10.1017/chr....

Nothing fancy here β€” only sweat & tears. 🧡

20.02.2026 13:35 πŸ‘ 21 πŸ” 7 πŸ’¬ 1 πŸ“Œ 0
Preview
Symmetry in language statistics shapes the geometry of model representations Although learned representations underlie neural networks' success, their fundamental properties remain poorly understood. A striking example is the emergence of simple geometric structures in LLM rep...

In our new preprint, we explain how some salient features of representational geometry in language modeling originate from a single principle - translation symmetry in the statistics of data.

arxiv.org/abs/2602.150...

With Dhruva Karkada, Daniel Korchinski, Andres Nava, & Matthieu Wyart.

19.02.2026 04:20 πŸ‘ 37 πŸ” 8 πŸ’¬ 1 πŸ“Œ 0
Privacy Ripple Effects from Adding or Removing Personal Information in Language Model Training
Privacy Ripple Effects from Adding or Removing Personal Information in Language Model Training YouTube video by Google TechTalks

I gave a talk at the Google Privacy in ML Seminar last summer on privacy & memorization: "Privacy Ripple Effects from Adding or Removing Personal Information in Language Model Training".

It's up on YouTube now if you're interested :)
youtu.be/IzIsHFCqXGo?...

18.02.2026 02:05 πŸ‘ 2 πŸ” 2 πŸ’¬ 0 πŸ“Œ 0

We also show that we are far from done, specifically for a complicated language like Old French.

But we
(1) defined the issue,
(2) propose a first solution that enables pre-annotation of larger dataset and
(3) offer an alternative to less trustable models that go beyond ATR.

17.02.2026 18:11 πŸ‘ 3 πŸ” 1 πŸ’¬ 0 πŸ“Œ 0
Preview
Pre Editorial Normalization - a Hugging Face Space by comma-project Latin and Old French normalization of CATMuS output

We release:

πŸ“š 4.66M silver training samples
πŸ§ͺ 1.8k gold evaluation set huggingface.co/datasets/com...
πŸ€– ByT5-based model β†’ 6.7% CER huggingface.co/comma-projec...

Try it here πŸ‘‡
huggingface.co/spaces/comma...

17.02.2026 18:11 πŸ‘ 4 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0

πŸ‘‰ We propose Pre-Editorial Normalization (PEN):

An intermediate layer between:
πŸ“ graphemic ATR output
πŸ“– fully edited text

Goal: preserve palaeographic fidelity + enable usability.
Keep two layer, ATR output and normalization, with aligned token to go back to the source.

17.02.2026 18:11 πŸ‘ 2 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0

Recent ATR progressβ€”especially with palaeographic datasets like CATMuSβ€”has improved access to medieval sources.

But:
❌ Raw outputs are hard to use
❌ Fully normalized models over-normalize & hallucinate

There’s a methodological gap.

17.02.2026 18:11 πŸ‘ 2 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0

If I give you the text
πŸ“š omnium peccatorum quia ex quo dyaconus quando esset in futurum, stultus esset

Can you find the ATR error without the manuscript ?

Probably not.

ATR models that predict text and normalize in one go generate trustable text, but prevent detecting issues.

17.02.2026 18:11 πŸ‘ 1 πŸ” 1 πŸ’¬ 2 πŸ“Œ 0
Post image

πŸ“„ New paper:
Pre-Editorial Normalization for Automatically Transcribed Medieval Manuscripts in Old French and Latin

Thibault ClΓ©rice, @rachelbawden.bsky.social , Anthony Glaise, Ariane Pinche, @dasmiq.bsky.social (2026) arxiv.org/abs/2602.13905

We introduce Pre-Editorial Normalization (PEN).

πŸ§΅β¬‡οΈ

17.02.2026 18:11 πŸ‘ 23 πŸ” 9 πŸ’¬ 1 πŸ“Œ 2

Excited to be co-organizing the #CHI2026 workshop on augmented reading interfaces πŸ“šβœ¨ Submissions are open for one more week! We want to know what you're working on!

06.02.2026 20:21 πŸ‘ 10 πŸ” 2 πŸ’¬ 1 πŸ“Œ 0
Post image

our open model proving out specialized rag LMs over scientific literature has been published in nature ✌🏻

congrats to our lead @akariasai.bsky.social & team of students and Ai2 researchers/engineers

www.nature.com/articles/s41...

04.02.2026 22:43 πŸ‘ 44 πŸ” 10 πŸ’¬ 2 πŸ“Œ 2

Cool postdoc job opportunity! A chance to work with some great English & comp sci scholars at Carnegie Mellon. Appreciate this ad stresses: chance to do interesting technical work; work on an interesting humanities problem; chance to publish both in humanities & comp sci venues. Looks great, apply!

30.01.2026 18:01 πŸ‘ 5 πŸ” 4 πŸ’¬ 0 πŸ“Œ 0