Laurie Burchell's Avatar

Laurie Burchell

@very-laurie

Senior Research Engineer with the Common Crawl Foundation. (languages ∪ tech) in Dùn Èideann

180
Followers
441
Following
100
Posts
10.08.2024
Joined
Posts Following

Latest posts by Laurie Burchell @very-laurie

Preview
Join the EleutherAI Discord Server! The original open science AI research collective. We started the open source LLM movement and have been pushing the boundaries of science ever since. | 33740 members

Happening now! @pjox.bsky.social and I are giving a talk for @eleutherai.bsky.social on CommonLID, a community-driven web domain evaluation dataset for language identification. Join here: discord.gg/aYy3Se7Q?eve...

Paper: arxiv.org/abs/2601.18026

@commoncrawl.bsky.social

25.02.2026 15:56 👍 3 🔁 1 💬 0 📌 0
Language Identification — Multi-Model Demo

Love this widget by Daan van Esch: daanvanesch.nl/langid/index... - compare language ID predictions in your browser!

22.02.2026 13:25 👍 1 🔁 0 💬 1 📌 0
Preview
CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data Language identification (LID) is a fundamental step in curating multilingual corpora. However, LID models still perform poorly for many languages, especially on the noisy and heterogeneous web data of...

Announcing our latest paper: CommonLID

In collaboration with @commoncrawl.bsky.social @mlcommons.org @jhu.edu we built a LID benchmark on actual Common Crawl text covering 109 languages. Existing evaluations overestimate how well LangID works on web data.

arxiv.org/abs/2601.18026

13.02.2026 19:27 👍 22 🔁 12 💬 1 📌 0

"The true genius is a mind of large general powers, accidentally determined to some particular direction."

~ Samuel Johnson

12.02.2026 15:55 👍 288 🔁 41 💬 10 📌 0
Examples of mislabeled web text by existing LangID systems. A full text version is available on the blog post below.

Examples of mislabeled web text by existing LangID systems. A full text version is available on the blog post below.

Examples of mislabeled web text by existing LangID systems. A full text version is available on the blog post below.

Examples of mislabeled web text by existing LangID systems. A full text version is available on the blog post below.

Language identification still proves to be a challenging task, especially for web data. In collaboration with @mlcommons.org @eleutherai.bsky.social @jhu.edu and 97 community members, we created CommonLID, a new benchmark for LangID for 100+ languages!

10.02.2026 20:44 👍 11 🔁 5 💬 1 📌 0

I'm learning Rust right now and I'd recommend the experimental version of the Rust Book: rust-book.cs.brown.edu. The quizzes help you understand that you don't actually understand ownership 👍

17.01.2026 13:30 👍 2 🔁 0 💬 0 📌 0

I feel obliged to share this masterpiece from my brother:

14.01.2026 11:17 👍 1349 🔁 404 💬 20 📌 8
Laurie Burchell at a lectern presenting her Turing Seminar talk

Laurie Burchell at a lectern presenting her Turing Seminar talk

Laurie Burchell at a lectern, with a blackboard behind her, presenting her Turing Seminar talk

Laurie Burchell at a lectern, with a blackboard behind her, presenting her Turing Seminar talk

A huge thank you to @very-laurie.bsky.social for delivering a fantastic UoB Turing seminar. Her talk was entitled “Common Crawl: open web data for everybody.”

In this talk, she introduced the @commoncrawl.bsky.social and the data products they offer.

27.11.2025 13:05 👍 6 🔁 2 💬 0 📌 0
Turing Seminar event graphic with details of the event and headshots of Laurie Burchell and Thom Vaughan to the right

Turing Seminar event graphic with details of the event and headshots of Laurie Burchell and Thom Vaughan to the right

The Turing Liaison Team is excited to host @very-laurie.bsky.social and Thom Vaughan to introduce the @commoncrawl.bsky.social and the data products it offers.

📆 26 November
⏰ 13:00 - 14:00
📍C44 Biomedical building, University of Bristol

Find out more: tinyurl.com/mrxp5h2n

21.11.2025 11:04 👍 6 🔁 2 💬 0 📌 0
Post image

If you were able to join us, let us know about your experience: docs.google.com/forms/d/e/1F...

10.10.2025 20:52 👍 4 🔁 4 💬 0 📌 0
Post image

One of the biggest obstacles to improving language technologies for low-resource languages is the lack of data. To address this, we need better language identification tools. So, we're organizing a shared task on Language Identification for Web Data! #NLP #NLProc

09.06.2025 15:44 👍 4 🔁 3 💬 1 📌 1

I've been learning by writing the tutorials 💅

Next up: using CCF's host index commoncrawl.org/blog/introdu...

12.06.2025 14:32 👍 2 🔁 0 💬 0 📌 0
Person dressed as a knight with a trans flag and a cape with a trans flag shield that says “Trans Knights are human Knights”

Person dressed as a knight with a trans flag and a cape with a trans flag shield that says “Trans Knights are human Knights”

This absolute icon at Canterbury Pride 🩷🤍🩵

07.06.2025 12:18 👍 16926 🔁 5225 💬 76 📌 119
A six panel cartoon.

Panel one:
Title: Learn to read with Physics
Lesson One: two and three-letter words
(image of a smiling book and particle)

Panel two:
A friendly scientist enters a colourful lab carryinh a coiffee and papers.
Text: The Doc is in the lab.

Panel three:
She has put on eye protectors and started a machine
Text: She put on the ion ray.

Panel four:
The machine has caused a small fire. She looks aghast.
Text: Pop! The ray lit the gas.

Panel five:
She runs from the burning lab
Text: The lab is too hot!

Panel six:
She sits on the grass. In the background the ruins of the lab smoulder.
Text: The Doc is sad.

A six panel cartoon. Panel one: Title: Learn to read with Physics Lesson One: two and three-letter words (image of a smiling book and particle) Panel two: A friendly scientist enters a colourful lab carryinh a coiffee and papers. Text: The Doc is in the lab. Panel three: She has put on eye protectors and started a machine Text: She put on the ion ray. Panel four: The machine has caused a small fire. She looks aghast. Text: Pop! The ray lit the gas. Panel five: She runs from the burning lab Text: The lab is too hot! Panel six: She sits on the grass. In the background the ruins of the lab smoulder. Text: The Doc is sad.

My cartoon for this week’s @newscientist.com

08.06.2025 08:17 👍 3947 🔁 671 💬 37 📌 37
1st Workshop on Multilingual Data Quality Signals

✨Call for papers!✨ @commoncrawl.bsky.social and friends are organising the 1st Workshop on Multilingual Data Quality Signals, held in tandem with @colmweb.org. Submit your research on multilingual data quality!

Submission deadline is 23 June, more info: wmdqs.org

28.05.2025 08:04 👍 3 🔁 1 💬 0 📌 0

I'm starting as a Senior Research Engineer with the Common Crawl Foundation today! 😎

26.05.2025 09:29 👍 5 🔁 0 💬 1 📌 0
Preview
A' Hobat - Gaelic Books Council A Gaelic translation of The Hobbit by J.R.R. Tolkien, a tale of an unlikely hobbit who goes on an unexpected journey in the company of the wizard Gandalf.

#Gaelic
#Hobbit
#Ghàidhlig

EEEEEEEEEE!

www.gaelicbooks.org/explore-the-...

24.04.2025 22:18 👍 12 🔁 6 💬 0 📌 1
A comic called "how to draw goose neck". In panel 1 the goose's neck is very short and is labeled "no". In panel 3 the goose's neck is the right size and is labeled "yes". In panel 3 the goose's neck is longer and is labeled "no". In panel 4 the goose's neck is extremely long and terrifyingly snakelike and it's labeled "yessss".

A comic called "how to draw goose neck". In panel 1 the goose's neck is very short and is labeled "no". In panel 3 the goose's neck is the right size and is labeled "yes". In panel 3 the goose's neck is longer and is labeled "no". In panel 4 the goose's neck is extremely long and terrifyingly snakelike and it's labeled "yessss".

A comic called The Canada Goose: A Role Model For Our Time. In panel 1, the goose is labeled "assertive" and is hissing furiously. In panel 2 it's "brave" and is standing up to a fearsome swan. In panel 3 it's "good parent" and is standing up to a dog while protecting its goslings. In panel 3 it's "team player" and is flying in a beautiful V. In panel 5 there's a close up of its head and it's labeled "crisp, modern aesthetic". In panel 6 it's eating grass and pooping and labeled "high fiber diet".

A comic called The Canada Goose: A Role Model For Our Time. In panel 1, the goose is labeled "assertive" and is hissing furiously. In panel 2 it's "brave" and is standing up to a fearsome swan. In panel 3 it's "good parent" and is standing up to a dog while protecting its goslings. In panel 3 it's "team player" and is flying in a beautiful V. In panel 5 there's a close up of its head and it's labeled "crisp, modern aesthetic". In panel 6 it's eating grass and pooping and labeled "high fiber diet".

There's a lot of talk about Canada Geese and whether they're good and my answer is Yes.

16.04.2025 16:07 👍 6160 🔁 869 💬 262 📌 61

omg who is she

24.03.2025 10:19 👍 2 🔁 0 💬 0 📌 0

I'm seeing way too much paper-white skin exposed to the cold over the last few days, Scotland is built different

20.03.2025 14:19 👍 2 🔁 0 💬 0 📌 0

I'm part of this! There's also a paper: arxiv.org/abs/2503.10267

17.03.2025 13:27 👍 6 🔁 3 💬 0 📌 0
28.02.2025 12:39 👍 2094 🔁 335 💬 16 📌 6

(she's very proud)

09.02.2025 11:14 👍 3 🔁 0 💬 0 📌 0

I told my mum that my model was in the top ten most downloaded models on HF last month, big "are ya winning son?" energy

09.02.2025 11:13 👍 2 🔁 0 💬 1 📌 0

how are there 5.7M downloads of a model I didn't advertise, I am suspicious

08.02.2025 13:50 👍 1 🔁 0 💬 1 📌 0
a screenshot of the top-ten most downloaded models on Hugging Face

a screenshot of the top-ten most downloaded models on Hugging Face

I'm nerd famous

08.02.2025 13:46 👍 8 🔁 1 💬 1 📌 1

What's the title?

08.02.2025 08:18 👍 0 🔁 0 💬 0 📌 0
Video thumbnail

Jesus. Twister in Donegal right now.

24.01.2025 07:58 👍 2594 🔁 602 💬 125 📌 55

My replacement PIR finally arrived! Hoping it doesn't short this time 🙏

23.01.2025 11:26 👍 1 🔁 0 💬 0 📌 0