Join the EleutherAI Discord Server!
The original open science AI research collective. We started the open source LLM movement and have been pushing the boundaries of science ever since. | 33740 members
Happening now! @pjox.bsky.social and I are giving a talk for @eleutherai.bsky.social on CommonLID, a community-driven web domain evaluation dataset for language identification. Join here: discord.gg/aYy3Se7Q?eve...
Paper: arxiv.org/abs/2601.18026
@commoncrawl.bsky.social
25.02.2026 15:56
👍 3
🔁 1
💬 0
📌 0
Language Identification — Multi-Model Demo
Love this widget by Daan van Esch: daanvanesch.nl/langid/index... - compare language ID predictions in your browser!
22.02.2026 13:25
👍 1
🔁 0
💬 1
📌 0
"The true genius is a mind of large general powers, accidentally determined to some particular direction."
~ Samuel Johnson
12.02.2026 15:55
👍 288
🔁 41
💬 10
📌 0
Examples of mislabeled web text by existing LangID systems. A full text version is available on the blog post below.
Examples of mislabeled web text by existing LangID systems. A full text version is available on the blog post below.
Language identification still proves to be a challenging task, especially for web data. In collaboration with @mlcommons.org @eleutherai.bsky.social @jhu.edu and 97 community members, we created CommonLID, a new benchmark for LangID for 100+ languages!
10.02.2026 20:44
👍 11
🔁 5
💬 1
📌 0
I'm learning Rust right now and I'd recommend the experimental version of the Rust Book: rust-book.cs.brown.edu. The quizzes help you understand that you don't actually understand ownership 👍
17.01.2026 13:30
👍 2
🔁 0
💬 0
📌 0
I feel obliged to share this masterpiece from my brother:
14.01.2026 11:17
👍 1349
🔁 404
💬 20
📌 8
Laurie Burchell at a lectern presenting her Turing Seminar talk
Laurie Burchell at a lectern, with a blackboard behind her, presenting her Turing Seminar talk
A huge thank you to @very-laurie.bsky.social for delivering a fantastic UoB Turing seminar. Her talk was entitled “Common Crawl: open web data for everybody.”
In this talk, she introduced the @commoncrawl.bsky.social and the data products they offer.
27.11.2025 13:05
👍 6
🔁 2
💬 0
📌 0
Turing Seminar event graphic with details of the event and headshots of Laurie Burchell and Thom Vaughan to the right
The Turing Liaison Team is excited to host @very-laurie.bsky.social and Thom Vaughan to introduce the @commoncrawl.bsky.social and the data products it offers.
📆 26 November
⏰ 13:00 - 14:00
📍C44 Biomedical building, University of Bristol
Find out more: tinyurl.com/mrxp5h2n
21.11.2025 11:04
👍 6
🔁 2
💬 0
📌 0
If you were able to join us, let us know about your experience: docs.google.com/forms/d/e/1F...
10.10.2025 20:52
👍 4
🔁 4
💬 0
📌 0
One of the biggest obstacles to improving language technologies for low-resource languages is the lack of data. To address this, we need better language identification tools. So, we're organizing a shared task on Language Identification for Web Data! #NLP #NLProc
09.06.2025 15:44
👍 4
🔁 3
💬 1
📌 1
I've been learning by writing the tutorials 💅
Next up: using CCF's host index commoncrawl.org/blog/introdu...
12.06.2025 14:32
👍 2
🔁 0
💬 0
📌 0
Person dressed as a knight with a trans flag and a cape with a trans flag shield that says “Trans Knights are human Knights”
This absolute icon at Canterbury Pride 🩷🤍🩵
07.06.2025 12:18
👍 16926
🔁 5225
💬 76
📌 119
A six panel cartoon.
Panel one:
Title: Learn to read with Physics
Lesson One: two and three-letter words
(image of a smiling book and particle)
Panel two:
A friendly scientist enters a colourful lab carryinh a coiffee and papers.
Text: The Doc is in the lab.
Panel three:
She has put on eye protectors and started a machine
Text: She put on the ion ray.
Panel four:
The machine has caused a small fire. She looks aghast.
Text: Pop! The ray lit the gas.
Panel five:
She runs from the burning lab
Text: The lab is too hot!
Panel six:
She sits on the grass. In the background the ruins of the lab smoulder.
Text: The Doc is sad.
My cartoon for this week’s @newscientist.com
08.06.2025 08:17
👍 3947
🔁 671
💬 37
📌 37
1st Workshop on Multilingual Data Quality Signals
✨Call for papers!✨ @commoncrawl.bsky.social and friends are organising the 1st Workshop on Multilingual Data Quality Signals, held in tandem with @colmweb.org. Submit your research on multilingual data quality!
Submission deadline is 23 June, more info: wmdqs.org
28.05.2025 08:04
👍 3
🔁 1
💬 0
📌 0
I'm starting as a Senior Research Engineer with the Common Crawl Foundation today! 😎
26.05.2025 09:29
👍 5
🔁 0
💬 1
📌 0
A comic called "how to draw goose neck". In panel 1 the goose's neck is very short and is labeled "no". In panel 3 the goose's neck is the right size and is labeled "yes". In panel 3 the goose's neck is longer and is labeled "no". In panel 4 the goose's neck is extremely long and terrifyingly snakelike and it's labeled "yessss".
A comic called The Canada Goose: A Role Model For Our Time. In panel 1, the goose is labeled "assertive" and is hissing furiously. In panel 2 it's "brave" and is standing up to a fearsome swan. In panel 3 it's "good parent" and is standing up to a dog while protecting its goslings. In panel 3 it's "team player" and is flying in a beautiful V. In panel 5 there's a close up of its head and it's labeled "crisp, modern aesthetic". In panel 6 it's eating grass and pooping and labeled "high fiber diet".
There's a lot of talk about Canada Geese and whether they're good and my answer is Yes.
16.04.2025 16:07
👍 6160
🔁 869
💬 262
📌 61
omg who is she
24.03.2025 10:19
👍 2
🔁 0
💬 0
📌 0
I'm seeing way too much paper-white skin exposed to the cold over the last few days, Scotland is built different
20.03.2025 14:19
👍 2
🔁 0
💬 0
📌 0
I'm part of this! There's also a paper: arxiv.org/abs/2503.10267
17.03.2025 13:27
👍 6
🔁 3
💬 0
📌 0
28.02.2025 12:39
👍 2094
🔁 335
💬 16
📌 6
(she's very proud)
09.02.2025 11:14
👍 3
🔁 0
💬 0
📌 0
I told my mum that my model was in the top ten most downloaded models on HF last month, big "are ya winning son?" energy
09.02.2025 11:13
👍 2
🔁 0
💬 1
📌 0
how are there 5.7M downloads of a model I didn't advertise, I am suspicious
08.02.2025 13:50
👍 1
🔁 0
💬 1
📌 0
a screenshot of the top-ten most downloaded models on Hugging Face
I'm nerd famous
08.02.2025 13:46
👍 8
🔁 1
💬 1
📌 1
What's the title?
08.02.2025 08:18
👍 0
🔁 0
💬 0
📌 0
Jesus. Twister in Donegal right now.
24.01.2025 07:58
👍 2594
🔁 602
💬 125
📌 55
My replacement PIR finally arrived! Hoping it doesn't short this time 🙏
23.01.2025 11:26
👍 1
🔁 0
💬 0
📌 0