Workshop on Multilingual Data Quality Signals's Avatar

Workshop on Multilingual Data Quality Signals

@wmdqs

The first iteration of our workshop will be co-located with @colmweb.org 2025 in Montreal. https://wmdqs.org/

10
Followers
9
Following
18
Posts
16.07.2025
Joined
Posts Following

Latest posts by Workshop on Multilingual Data Quality Signals @wmdqs

Post image

If you were able to join us, let us know about your experience: docs.google.com/forms/d/e/1F...

10.10.2025 20:52 πŸ‘ 4 πŸ” 4 πŸ’¬ 0 πŸ“Œ 0
Post image

Thank you everyone for coming to WMDQS (pronounced "whim ducks")!

10.10.2025 20:50 πŸ‘ 3 πŸ” 2 πŸ’¬ 1 πŸ“Œ 0
Post image

Then we had our second poster session for our paper submissions. The full papers are available on our website!

10.10.2025 20:49 πŸ‘ 2 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

After lunch, @sebnagel.bsky.social gave a keynote about the data collected by @commoncrawl.bsky.social!

10.10.2025 20:46 πŸ‘ 2 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0
Post image

David Adelani gave a keynote about text quality for low-resource languages.

10.10.2025 16:17 πŸ‘ 3 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image Post image Post image

We had our first poster session, hearing from some of our shared task participants!

10.10.2025 16:17 πŸ‘ 3 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image Post image

We presented the results of our shared task! We received annotations for over 30,000 document representing over 60 languages. We also showed the results of our LangID dataset and system shared task tracks. Thank you everyone who participated!

10.10.2025 16:17 πŸ‘ 3 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

We started with a keynote from @juliakreutzer.bsky.social about multilingual fine-tuning data!

10.10.2025 16:17 πŸ‘ 3 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

WMDQS is underway! Come join us in Room 520A at @colmweb.org! #COLM2025

10.10.2025 16:17 πŸ‘ 2 πŸ” 3 πŸ’¬ 1 πŸ“Œ 0

Looking forward to tomorrow's #COLM2025 workshop on multilingual data quality! 🀩

09.10.2025 23:16 πŸ‘ 6 πŸ” 3 πŸ’¬ 0 πŸ“Œ 0
Preview
1st Workshop on Multilingual Data Quality Signals (WMDQS) A workshop addressing multilingual data quality. Held on the 10th October 2025 in MontrΓ©al.

See our updated website for more details: wmdqs.org

09.10.2025 20:17 πŸ‘ 3 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

We will also have a session on our shared task, which was about improving language identification models. Participants of the shared task contributed annotations to create a new LangID dataset and also submitted new LangID systems.

09.10.2025 20:17 πŸ‘ 3 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

Our third and final keynote will be from @sebnagel.bsky.social about the data in Common Crawl.

09.10.2025 20:17 πŸ‘ 3 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

Our second keynote will be by David Adelani about text quality for low-resource languages.

09.10.2025 20:17 πŸ‘ 3 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

Our first keynote will be from @juliakreutzer.bsky.social about data for multilingual fine-tuning.

09.10.2025 20:17 πŸ‘ 3 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

In collaboration with @commoncrawl.bsky.social, MLCommons, and @eleutherai.bsky.social, the first edition of WMDQS at @colmweb.org starts tomorrow in Room 520A! We have an updated schedule on our website, including a list of all accepted papers.

09.10.2025 20:17 πŸ‘ 3 πŸ” 3 πŸ’¬ 1 πŸ“Œ 1

If you want to help us improve language and cultural coverage, and build an open source LangID system, please register to our shared task on Language Identification! πŸ’¬

Registering is easy! All the details are on the shared task webpage: wmdqs.org/shared-task/

Deadline: July 23, 2025 (AoE) ⏰

21.07.2025 22:40 πŸ‘ 2 πŸ” 2 πŸ’¬ 0 πŸ“Œ 0
Preview
Common Crawl - Blog - WMDQS Shared Task on Language Identification The Common Crawl Foundation, MLCommons, EleutherAI, and John Hopkins' Center for Language and Speech Processing have the pleasure of inviting you to register for the 1st shared task on Language Identi...

The Common Crawl Foundation, MLCommons, EleutherAI, and John Hopkins' Center for Language and Speech Processing have the pleasure of inviting you to register for the 1st shared task on Language Identification for web data.

commoncrawl.org/blog/wmdqs-s...

21.07.2025 22:34 πŸ‘ 6 πŸ” 5 πŸ’¬ 0 πŸ“Œ 1
Dynabench Dynabench

Contribute annotations here: dynabench.org/tasks/text-l...

21.07.2025 18:07 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

For context: bsky.app/profile/cath...

21.07.2025 18:07 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

We've added lots more documents/languages and extended the deadline for the first round of annotations until July 23rd. Check out the details below πŸ‘‡

21.07.2025 18:07 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

One of the biggest obstacles to improving language technologies for low-resource languages is the lack of data. To address this, we need better language identification tools. So, we're organizing a shared task on Language Identification for Web Data! #NLP #NLProc

09.06.2025 15:44 πŸ‘ 4 πŸ” 3 πŸ’¬ 1 πŸ“Œ 1