- the governmentβs mixed public messaging on AI & copyright is hindering licensing
- the government should make a clear public statement that AI companies operating in the UK need to license their training data (which is the law)
4/5
- the governmentβs mixed public messaging on AI & copyright is hindering licensing
- the government should make a clear public statement that AI companies operating in the UK need to license their training data (which is the law)
4/5
They say:
- the government must not weaken copyright law, and should instead strengthen licensing, transparency & enforcement
- the government should stop prioritising large multinational tech firms
3/5
- AI training isnβt βlearningβ and shouldnβt be treated as such
The House of Lords has been absolutely consistent on this, and they are totally right. Will the government listen?
/end
datacollective.mozillafoundation.org/datasets/cmm... Curated collection of public domain literature from Finland! The dataset captures the literary landscape of early 20th-century Finland and includes independent texts in both of the country's official languages: Finnish (fi) and Swedish (sv).
Discover new multilingual ASR, NLP, and TTS datasets on Mozilla Data Collective this week, and check out some site fixes and API improvements we've made!
community.mozilladatacollective.com/mdc-release-...
5/5 If you want to make African languages a part of your AI training data, you can find all of the IADH's TTS uploads and more in our dataset catalogue: datacollective.mozillafoundation.org/datasets
4/5 IADH TTS datasets continued:
- Teke-Laali datacollective.mozillafoundation.org/datasets/cmj...
- Beembe datacollective.mozillafoundation.org/datasets/cmj...
- Bomitaba datacollective.mozillafoundation.org/datasets/cmj...
3/5 IADH TTS datasets:
- Ewondo datacollective.mozillafoundation.org/datasets/cml...
- Bulu datacollective.mozillafoundation.org/datasets/cml...
- Mbosi datacollective.mozillafoundation.org/datasets/cmj...
- Laari datacollective.mozillafoundation.org/datasets/cmj...
2/5 Regional TTS data is a vital resource for AI tools building accessible speech synthesis models, true-native TTS for regional content, and conducting performance benchmarking for "low-resource languages". The treasure trove of data that IADH uploads is invaluable for the preservation of culture.
1/5 Let's highlight one of our amazing text-to-speech contributors shaping AI data for African cultures. The Institute of African Digital Humanities has uploaded thousands of TTS audio clips totalling over 6 GB of data for more than 10 locales.
- TTS Javanese - Ngapak Dialect: datacollective.mozillafoundation.org/datasets/cml...
- TTS Central Javanese: datacollective.mozillafoundation.org/datasets/cml...
- TTS Javanese-Lumajang Dialect: datacollective.mozillafoundation.org/datasets/cml...
- Bojonegoro Javanese TTS: datacollective.mozillafoundation.org/datasets/cml...
Community contributions play an enormous role in the development of multilingual AI.
Here are 4 Javanese datasets to level up your TTS project with genuine, local dialects (all CC-BY-SA-4.0).
Links in the comments!
Check it out: datacollective.mozillafoundation.org/datasets/cml...
The languages absent from your model are the communities it will fail to serve. Cultural preservation can be powered by technology.
The Gojri (Gujari) Literature Corpus has over 60K tokens of clean, UTF-8 normalized text to make AI more accessible to Gujjar communities.
Link in the comments!
Check it out: datacollective.mozillafoundation.org/datasets/cml...
LLM hallucination is a data problem as much as a model problem.
Structured, machine-readable World Factbook corpus. Geopolitical, demographic & economic facts for every country β purpose-built for RAG grounding and knowledge graph construction.
Last clean snapshot before the CIA retired the site.
Hi there, Echo! Can you tell us more about the specific model or architecture you're looking to fine-tune the weights of with Tatar folklore?
π£ 114 hours of recorded Nahuatl with transcripts!
Get native speech from the municipalities of ZacatlΓ‘n and Tepetzintla to take your NLP project to the next level.
This is what responsible AI training data looks like.
Explore the collection π
datacollective.mozillafoundation.org/datasets?q=n...
"π Over 18K proverbs, 243 tales & 380 legends! Tauren's Tatar Folklore helps tech get beyond multilingual by providing multiCULTURAL high-quality datasets.
βοΈ CC0, free for NLP projects: datacollective.mozillafoundation.org/datasets/cml...
@mozilla-fr.bsky.social Nouveaux jeux de donnΓ©es pour la traduction franΓ§aisβlangues africaines! π
datacollective.mozillafoundation.org/datasets/cml...
datacollective.mozillafoundation.org/datasets/cml...
datacollective.mozillafoundation.org/datasets/cmk...
π§‘π£ 1 million tokens of Balochi just arrived @mozdatacollective.bsky.social! Check out this beautiful dataset of journalistic Western Balochi (Rakhshani) datacollective.mozillafoundation.org/datasets/cml... @mozilla.org
Team @mozdatacollective.bsky.social is looking for paper submissions to support their session on Synthetic Speech at @interspeech.bsky.social find out if your work is a fit! interspeech2026.org/en-AU/pages/...
> Ethical considerations around vulnerable populations, posthumous voice rights, and cultural sensitivities
> Industry perspectives on implementing consent management at scale
> User studies on public perception, trust, and acceptance of #voice #cloning technologies
3/3 FIN
The session will explore multiple dimensions of this challenge:
> Technical approaches to #consent verification, watermarking, and authentication in #TTS systems
> Legal frameworks for personality rights, #data protection, and liability in voice synthesis
2/3
Safeguarding Synthetic Speech @ Interspeech 2026
3 of our key staff including @kathyreid.au are delighted to be chairing a Special Session at @interspeech.bsky.social this year:
Safeguarding Synthetic Speech: Ethical, legal and technical perspectives.
The #CfP is now open and you can learn more at:
safeguardingsyntheticspeech.org
1/3
NEW! Text-to-Speech for Tolaki from the amazing community in #Indonesia datacollective.mozillafoundation.org/datasets/cml...
Welcoming Mandar-Indonesian to Mozilla Data Collective! datacollective.mozillafoundation.org/datasets/cml...
We just released version 0.3.0 of our Python library with resumable downloads and other efficiency improvements. Now, downloading a dataset will now continue where a previous download left off - especially helpful for working with larger datasets. mozilla-data-collective.github.io/datacollecti...
7/7
Explore all datasets β datacollective.mozillafoundation.org/datasets
Get in touch β mozilladatacollective@mozillafoundation.org