Mozilla Data Collective's Avatar

Mozilla Data Collective

@mozdatacollective

Create. Curate. Control. Mozilla Data Collective wants to rebuild the AI data ecosystem - with communities at the centre. https://datacollective.mozillafoundation.org

106
Followers
17
Following
39
Posts
03.09.2025
Joined
Posts Following

Latest posts by Mozilla Data Collective @mozdatacollective

Post image Post image

- the government’s mixed public messaging on AI & copyright is hindering licensing
- the government should make a clear public statement that AI companies operating in the UK need to license their training data (which is the law)

4/5

06.03.2026 08:53 πŸ‘ 660 πŸ” 74 πŸ’¬ 2 πŸ“Œ 1
Post image

They say:

- the government must not weaken copyright law, and should instead strengthen licensing, transparency & enforcement
- the government should stop prioritising large multinational tech firms

3/5

06.03.2026 08:53 πŸ‘ 788 πŸ” 121 πŸ’¬ 3 πŸ“Œ 3

- AI training isn’t β€˜learning’ and shouldn’t be treated as such

The House of Lords has been absolutely consistent on this, and they are totally right. Will the government listen?

/end

06.03.2026 08:53 πŸ‘ 894 πŸ” 110 πŸ’¬ 11 πŸ“Œ 5
Preview
Finnish Public Domain 20th Century Literature Text Corpus | Mozilla Data Collective This corpus contains a curated collection of public domain literature from Finland, featuring works by authors who died between 1901 and 1955. The dataset captures the literary landscape of early 20th...

datacollective.mozillafoundation.org/datasets/cmm... Curated collection of public domain literature from Finland! The dataset captures the literary landscape of early 20th-century Finland and includes independent texts in both of the country's official languages: Finnish (fi) and Swedish (sv).

03.03.2026 11:45 πŸ‘ 4 πŸ” 3 πŸ’¬ 1 πŸ“Œ 0
Preview
MDC Release Notes - 27.02.26 This week: dataset filtering, enhanced uploader request flow, API improvements, and 20 new datasets!

Discover new multilingual ASR, NLP, and TTS datasets on Mozilla Data Collective this week, and check out some site fixes and API improvements we've made!

community.mozilladatacollective.com/mdc-release-...

27.02.2026 17:31 πŸ‘ 4 πŸ” 2 πŸ’¬ 0 πŸ“Œ 0
Preview
Mozilla Data Collective Mozilla Data Collective is rebuilding the AI data ecosystem with communities at the centre.

5/5 If you want to make African languages a part of your AI training data, you can find all of the IADH's TTS uploads and more in our dataset catalogue: datacollective.mozillafoundation.org/datasets

23.02.2026 20:12 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

4/5 IADH TTS datasets continued:
- Teke-Laali datacollective.mozillafoundation.org/datasets/cmj...
- Beembe datacollective.mozillafoundation.org/datasets/cmj...
- Bomitaba datacollective.mozillafoundation.org/datasets/cmj...

23.02.2026 20:12 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

3/5 IADH TTS datasets:
- Ewondo datacollective.mozillafoundation.org/datasets/cml...
- Bulu datacollective.mozillafoundation.org/datasets/cml...
- Mbosi datacollective.mozillafoundation.org/datasets/cmj...
- Laari datacollective.mozillafoundation.org/datasets/cmj...

23.02.2026 20:12 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

2/5 Regional TTS data is a vital resource for AI tools building accessible speech synthesis models, true-native TTS for regional content, and conducting performance benchmarking for "low-resource languages". The treasure trove of data that IADH uploads is invaluable for the preservation of culture.

23.02.2026 20:12 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

1/5 Let's highlight one of our amazing text-to-speech contributors shaping AI data for African cultures. The Institute of African Digital Humanities has uploaded thousands of TTS audio clips totalling over 6 GB of data for more than 10 locales.

23.02.2026 20:12 πŸ‘ 0 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0
Preview
TTS Javanese - Ngapak Dialect | Mozilla Data Collective This dataset captures the vibrant and dynamic linguistic variety found along the North Coast (Pantura) of Central Java Province, Indonesia. Unlike the inland varieties of Javanese which are heavily st...

- TTS Javanese - Ngapak Dialect: datacollective.mozillafoundation.org/datasets/cml...

- TTS Central Javanese: datacollective.mozillafoundation.org/datasets/cml...

20.02.2026 12:48 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
TTS Javanese-Lumajang Dialect | Mozilla Data Collective The Lumajang dialect is a unique variation of the Javanese language spoken across Lumajang Regency in East Java, Indonesia. Locally known as β€œArekan”, this dialect emerged from a cultural blend of Jav...

- TTS Javanese-Lumajang Dialect: datacollective.mozillafoundation.org/datasets/cml...

- Bojonegoro Javanese TTS: datacollective.mozillafoundation.org/datasets/cml...

20.02.2026 12:48 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Post image

Community contributions play an enormous role in the development of multilingual AI.

Here are 4 Javanese datasets to level up your TTS project with genuine, local dialects (all CC-BY-SA-4.0).

Links in the comments!

20.02.2026 12:45 πŸ‘ 0 πŸ” 0 πŸ’¬ 2 πŸ“Œ 0
Preview
Gojri Literature Corpus | Mozilla Data Collective The Gojri Literature Corpus ontains approximately 60,821 tokens of Gojri (Gujari) text drawn from poetry, short stories, narrative prose, and question–answer literary books. It reflects creative writi...

Check it out: datacollective.mozillafoundation.org/datasets/cml...

19.02.2026 17:07 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Post image

The languages absent from your model are the communities it will fail to serve. Cultural preservation can be powered by technology.

The Gojri (Gujari) Literature Corpus has over 60K tokens of clean, UTF-8 normalized text to make AI more accessible to Gujjar communities.

Link in the comments!

19.02.2026 17:06 πŸ‘ 1 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0
Preview
World Factbook (JSON) | Mozilla Data Collective This dataset contains the full text of the CIA World Factbook converted into machine-readable JSON. It covers over 260 world entities, organized hierarchically by region (e.g., Africa, Europe). It ca...

Check it out: datacollective.mozillafoundation.org/datasets/cml...

18.02.2026 17:31 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Post image

LLM hallucination is a data problem as much as a model problem.

Structured, machine-readable World Factbook corpus. Geopolitical, demographic & economic facts for every country – purpose-built for RAG grounding and knowledge graph construction.

Last clean snapshot before the CIA retired the site.

18.02.2026 17:30 πŸ‘ 2 πŸ” 1 πŸ’¬ 3 πŸ“Œ 0

Hi there, Echo! Can you tell us more about the specific model or architecture you're looking to fine-tune the weights of with Tatar folklore?

18.02.2026 16:06 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Post image

πŸ“£ 114 hours of recorded Nahuatl with transcripts!

Get native speech from the municipalities of ZacatlΓ‘n and Tepetzintla to take your NLP project to the next level.

This is what responsible AI training data looks like.

Explore the collection πŸ‘‡
datacollective.mozillafoundation.org/datasets?q=n...

17.02.2026 17:37 πŸ‘ 6 πŸ” 2 πŸ’¬ 0 πŸ“Œ 0
Post image

"πŸ“š Over 18K proverbs, 243 tales & 380 legends! Tauren's Tatar Folklore helps tech get beyond multilingual by providing multiCULTURAL high-quality datasets.

βš–οΈ CC0, free for NLP projects: datacollective.mozillafoundation.org/datasets/cml...

16.02.2026 17:20 πŸ‘ 5 πŸ” 3 πŸ’¬ 1 πŸ“Œ 0
Post image

@mozilla-fr.bsky.social Nouveaux jeux de donnΓ©es pour la traduction franΓ§ais–langues africaines! 🌍

datacollective.mozillafoundation.org/datasets/cml...

datacollective.mozillafoundation.org/datasets/cml...

datacollective.mozillafoundation.org/datasets/cmk...

12.02.2026 08:31 πŸ‘ 2 πŸ” 1 πŸ’¬ 0 πŸ“Œ 0
Post image

πŸ§‘πŸ“£ 1 million tokens of Balochi just arrived @mozdatacollective.bsky.social! Check out this beautiful dataset of journalistic Western Balochi (Rakhshani) datacollective.mozillafoundation.org/datasets/cml... @mozilla.org

12.02.2026 08:21 πŸ‘ 6 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0

Team @mozdatacollective.bsky.social is looking for paper submissions to support their session on Synthetic Speech at @interspeech.bsky.social find out if your work is a fit! interspeech2026.org/en-AU/pages/...

06.02.2026 14:33 πŸ‘ 5 πŸ” 3 πŸ’¬ 0 πŸ“Œ 0

> Ethical considerations around vulnerable populations, posthumous voice rights, and cultural sensitivities

> Industry perspectives on implementing consent management at scale

> User studies on public perception, trust, and acceptance of #voice #cloning technologies

3/3 FIN

05.02.2026 14:48 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

The session will explore multiple dimensions of this challenge:

> Technical approaches to #consent verification, watermarking, and authentication in #TTS systems

> Legal frameworks for personality rights, #data protection, and liability in voice synthesis

2/3

05.02.2026 14:48 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Safeguarding Synthetic Speech @ Interspeech 2026

Safeguarding Synthetic Speech @ Interspeech 2026

3 of our key staff including @kathyreid.au are delighted to be chairing a Special Session at @interspeech.bsky.social this year:

Safeguarding Synthetic Speech: Ethical, legal and technical perspectives.

The #CfP is now open and you can learn more at:
safeguardingsyntheticspeech.org

1/3

05.02.2026 14:48 πŸ‘ 1 πŸ” 1 πŸ’¬ 1 πŸ“Œ 1
Preview
TTS-Tolaki | Mozilla Data Collective The Tolaki language is the predominant language in Southeast Sulawesi Province, Indonesia. It is spoken across the regencies of Kolaka, North Kolaka, Konawe, North Konawe, South Konawe, and East Konaw...

NEW! Text-to-Speech for Tolaki from the amazing community in #Indonesia datacollective.mozillafoundation.org/datasets/cml...

04.02.2026 10:57 πŸ‘ 2 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0
Preview
Mandar Spontaneous Speech | Mozilla Data Collective Mandar Spontaneous Speech is a representative dataset of the Mandar language and contains a variety of dialects, particularly those used in Majene and Polewali Mandar. It also includes Mandar–Indonesi...

Welcoming Mandar-Indonesian to Mozilla Data Collective! datacollective.mozillafoundation.org/datasets/cml...

03.02.2026 18:52 πŸ‘ 1 πŸ” 1 πŸ’¬ 0 πŸ“Œ 0

We just released version 0.3.0 of our Python library with resumable downloads and other efficiency improvements. Now, downloading a dataset will now continue where a previous download left off - especially helpful for working with larger datasets. mozilla-data-collective.github.io/datacollecti...

28.01.2026 20:15 πŸ‘ 4 πŸ” 1 πŸ’¬ 0 πŸ“Œ 0
Preview
Mozilla Data Collective Mozilla Data Collective is rebuilding the AI data ecosystem with communities at the centre.

7/7
Explore all datasets β†’ datacollective.mozillafoundation.org/datasets
Get in touch β†’ mozilladatacollective@mozillafoundation.org

28.01.2026 14:28 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0