Daniel van Strien (@danielvanstrien)

dots-ocr.py · uv-scripts/ocr at main We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Some models can also predict boduning boxes for images, charts etc. You would still need to do the extra step of grabbing the images from the bounding boxes but it can work quite well. dots.ocr is the one I've used most for this, i.e. this mode huggingface.co/datasets/uv-...

05.03.2026 15:48 👍 0 🔁 0 💬 1 📌 0

Still an early experiment. Would love feedback on whether something like this would be useful for your work!

05.03.2026 14:48 👍 0 🔁 0 💬 1 📌 0

OCR Bench Viewer - a Hugging Face Space by davanstrien This tool shows two OCR results side‑by‑side for the same picture, letting you pick which one is correct or mark a tie. You can filter comparisons by model or by which side won, and your votes cont...

Early results across 3 test collections:

• Library card catalogs → LightOn #1
• Britannica 1771 → GLM #1
• Icelandic PDFs → dots.ocr #1

Different documents, different winners!

Example space: huggingface.co/spaces/davan...

05.03.2026 14:48 👍 1 🔁 0 💬 1 📌 0

GitHub - davanstrien/ocr-bench: Per-collection OCR leaderboards using VLM-as-judge Per-collection OCR leaderboards using VLM-as-judge - davanstrien/ocr-bench

Point it at any Hugging Face dataset, launches OCR models, compares outputs pairwise using a VLM judge, and publishes an interactive leaderboard.

Inspired by datalab's benchmarks approach, but open source so you can run it on your own collections.

github.com/davanstrien/...

05.03.2026 14:48 👍 2 🔁 0 💬 1 📌 0

Screenshot of plot showing ELO vs paramter count for different OCR models

There is no best VLM OCR model - rankings can flip completely by document type.

I built ocr-bench: run open OCR models on YOUR documents, get a per-collection leaderboard.

VLM-as-judge with Bradley-Terry ELO, all running on @hf.co. No local GPU needed.

05.03.2026 14:48 👍 48 🔁 10 💬 1 📌 1

This sounds great!

02.03.2026 17:58 👍 1 🔁 0 💬 0 📌 0

Docling Docling converts messy documents into structured data and simplifies downstream document and AI processing by detecting tables, formulas, reading order, OCR, and much more.

That sounds great! IIRC screen readers tend to work okay with Markdown format? Might also be worth exploring www.docling.ai if you didn't already.

02.03.2026 15:18 👍 0 🔁 0 💬 1 📌 0

Screenshot of a search UI showing a text box with search results showing index cards next to the ocr for the card

Is it worth re-OCR'ing old library index cards?

Re-OCR'd 453,000 from @bpl.boston.gov's rare books catalogue.

~$50 compute using @huggingface Jobs

BPL's own guide calls their search "extremely unreliable." Does better OCR + semantic search help fix it?

Demo space link below

27.02.2026 17:09 👍 41 🔁 8 💬 1 📌 0

BPL Card Catalog Search - a Hugging Face Space by davanstrien Enter a word or phrase to find items in the BPL Rare Books & Manuscripts card catalog. Choose semantic or keyword mode, set how many results you want, and view the original index‑card images (click...

huggingface.co/spaces/davan...

27.02.2026 17:09 👍 8 🔁 1 💬 1 📌 0

Screenshot of a search UI showing a text box with search results showing index cards next to the ocr for the card

Is it worth re-OCR'ing old library index cards?

Re-OCR'd 453,000 from @bpl.boston.gov's rare books catalogue.

~$50 compute using @huggingface Jobs

BPL's own guide calls their search "extremely unreliable." Does better OCR + semantic search help fix it?

Demo space link below

27.02.2026 17:09 👍 41 🔁 8 💬 1 📌 0

Ran the same OCR models on 68 pages of historic newspaper. Every model hallucinated or looped.

DeepSeek-OCR-2, LightOnOCR-2, GLM-OCR – all melt down on dense newspaper columns.

You can try yourself using this @hf.co dataset: huggingface.co/datasets/dav...

23.02.2026 14:07 👍 20 🔁 3 💬 4 📌 0

Great to hear of some fresh eyes on this task! Think there is a lot that wasn't possible a few years ago that is now.

23.02.2026 17:26 👍 2 🔁 0 💬 0 📌 0

Looking forward to reading it! Looking forward to it even more if it comes with data 😛

23.02.2026 14:26 👍 1 🔁 0 💬 0 📌 0

Ran the same OCR models on 68 pages of historic newspaper. Every model hallucinated or looped.

DeepSeek-OCR-2, LightOnOCR-2, GLM-OCR – all melt down on dense newspaper columns.

You can try yourself using this @hf.co dataset: huggingface.co/datasets/dav...

23.02.2026 14:07 👍 20 🔁 3 💬 4 📌 0

llama.cpp logo + Hugging Face logo

Llama.cpp joins Hugging Face

github.com/ggml-org/lla...

20.02.2026 14:04 👍 54 🔁 7 💬 2 📌 1

Nice!

20.02.2026 21:27 👍 0 🔁 0 💬 0 📌 0

Re-OCR Your Digitised Collections for ~$0.002/Page – Daniel van Strien A guide to re-processing digitised collections with open-source VLM-based OCR models.

Wrote a slightly more detailed guide on how to do this with your own collections/materials: danielvanstrien.xyz/posts/2026/r...

19.02.2026 14:28 👍 28 🔁 6 💬 4 📌 1

llama.cpp logo + Hugging Face logo

Llama.cpp joins Hugging Face

github.com/ggml-org/lla...

20.02.2026 14:04 👍 54 🔁 7 💬 2 📌 1

@willwhim.com bsky.app/profile/dani...

19.02.2026 14:28 👍 5 🔁 0 💬 0 📌 0

Re-OCR Your Digitised Collections for ~$0.002/Page – Daniel van Strien A guide to re-processing digitised collections with open-source VLM-based OCR models.

Wrote a slightly more detailed guide on how to do this with your own collections/materials: danielvanstrien.xyz/posts/2026/r...

19.02.2026 14:28 👍 28 🔁 6 💬 4 📌 1

Yeah quality is very mixed by language. I have vague recollection of someone working a lot on sanskrit ocr using open models on the Hub. Will post if I remember where that was!

19.02.2026 13:17 👍 3 🔁 0 💬 0 📌 0

Will try to write something a bit more detailed for this!

19.02.2026 12:58 👍 2 🔁 0 💬 1 📌 0

uv-scripts/ocr · Datasets at Hugging Face We’re on a journey to advance and democratize artificial intelligence through open source and open science.

OCR scripts: huggingface.co/datasets/uv-...
Dataset: huggingface.co/datasets/dav...

19.02.2026 11:29 👍 16 🔁 0 💬 1 📌 0

Screenshot of old vs new ocr. old ocr text is garbled. New ocr much cleaner.

Re-OCR'd the complete 1771 Encyclopaedia Britannica (2,724 pages) with a single command on @hf.co Jobs.

- 0.9B model (GLM-OCR)
~$0.002/page
~$5 total on an L4 GPU

Before (old Tesseract ocr) → After

19.02.2026 11:29 👍 96 🔁 16 💬 5 📌 6

table of contents showing ocr models supported in the repo

The uv-scripts/ocr collection now includes 13 models, including GLM-OCR, a 0.9B model that scores 94.6% on OmniDocBench.

One command to run any of them on your dataset via @hf.co Jobs.

huggingface.co/datasets/uv-...

17.02.2026 15:06 👍 10 🔁 1 💬 0 📌 0

Spaces Configuration Reference We’re on a journey to advance and democratize artificial intelligence through open source and open science.

If you have the model ID in the spaces demo code, it will often get picked up automatically. Otherwise, you can specify the model in the `models` field in the space YAML metadata, see huggingface.co/docs/hub/en/...

16.02.2026 10:10 👍 1 🔁 0 💬 1 📌 0

Join us tomorrow for a demo of IIIF Illustration Detector!

Zoom link: iiif.io/community

10.02.2026 17:22 👍 3 🔁 3 💬 0 📌 0

ArXiv New ML Datasets - a Hugging Face Space by librarian-bots This tool lets you search arXiv computer‑science papers that are predicted to present new machine‑learning datasets. Enter a keyword or use semantic search, then narrow results by research category...

Semantic search, confidence filtering, updated weekly using Hugging Face Jobs.

Powered by a fine-tuned ModernBERT classifier. Full dataset stored in Lance format on the Hub with vector embeddings.

huggingface.co/spaces/libra...

09.02.2026 10:13 👍 0 🔁 0 💬 0 📌 0

Datasets and benchmarks drive AI progress, but finding papers that introduce new ones means digging through thousands of arXiv abstracts.

Updated the Dataset Papers on ArXiv app to surface them: 52K+ papers classified as introducing new datasets from 212K CS papers.

09.02.2026 10:13 👍 9 🔁 1 💬 1 📌 0

Daniel van Strien

Latest posts by Daniel van Strien @danielvanstrien