LIVE SOON: From multilingual to multicultural: how do we escape from LLMs flattening culture?
Starting Mar 11 at 9:00 AM EDT
How do we stop LLMs from flattening culture?
Join EM Lewis-Jong, CEO of @mozdatacollective.bsky.social, and Slow AI's Dr. Sam Illingworth live at 9am ET TODAY as they discuss why language and culture should exist in multitudes.
open.substack.com/live-stream/...
11.03.2026 12:20
π 17
π 6
π¬ 0
π 0
Own your AI stack or rent it? @samillingworth.com speaks to @mozdatacollective.bsky.social on Substack Live substack.com/@samillingwo...
11.03.2026 19:26
π 0
π 0
π¬ 0
π 0
- the governmentβs mixed public messaging on AI & copyright is hindering licensing
- the government should make a clear public statement that AI companies operating in the UK need to license their training data (which is the law)
4/5
06.03.2026 08:53
π 659
π 74
π¬ 2
π 1
They say:
- the government must not weaken copyright law, and should instead strengthen licensing, transparency & enforcement
- the government should stop prioritising large multinational tech firms
3/5
06.03.2026 08:53
π 789
π 122
π¬ 3
π 3
- AI training isnβt βlearningβ and shouldnβt be treated as such
The House of Lords has been absolutely consistent on this, and they are totally right. Will the government listen?
/end
06.03.2026 08:53
π 893
π 110
π¬ 11
π 5
MDC Release Notes - 27.02.26
This week: dataset filtering, enhanced uploader request flow, API improvements, and 20 new datasets!
Discover new multilingual ASR, NLP, and TTS datasets on Mozilla Data Collective this week, and check out some site fixes and API improvements we've made!
community.mozilladatacollective.com/mdc-release-...
27.02.2026 17:31
π 4
π 2
π¬ 0
π 0
Mozilla Data Collective
Mozilla Data Collective is rebuilding the AI data ecosystem with communities at the centre.
5/5 If you want to make African languages a part of your AI training data, you can find all of the IADH's TTS uploads and more in our dataset catalogue: datacollective.mozillafoundation.org/datasets
23.02.2026 20:12
π 0
π 0
π¬ 0
π 0
4/5 IADH TTS datasets continued:
- Teke-Laali datacollective.mozillafoundation.org/datasets/cmj...
- Beembe datacollective.mozillafoundation.org/datasets/cmj...
- Bomitaba datacollective.mozillafoundation.org/datasets/cmj...
23.02.2026 20:12
π 0
π 0
π¬ 1
π 0
3/5 IADH TTS datasets:
- Ewondo datacollective.mozillafoundation.org/datasets/cml...
- Bulu datacollective.mozillafoundation.org/datasets/cml...
- Mbosi datacollective.mozillafoundation.org/datasets/cmj...
- Laari datacollective.mozillafoundation.org/datasets/cmj...
23.02.2026 20:12
π 0
π 0
π¬ 1
π 0
2/5 Regional TTS data is a vital resource for AI tools building accessible speech synthesis models, true-native TTS for regional content, and conducting performance benchmarking for "low-resource languages". The treasure trove of data that IADH uploads is invaluable for the preservation of culture.
23.02.2026 20:12
π 0
π 0
π¬ 1
π 0
1/5 Let's highlight one of our amazing text-to-speech contributors shaping AI data for African cultures. The Institute of African Digital Humanities has uploaded thousands of TTS audio clips totalling over 6 GB of data for more than 10 locales.
23.02.2026 20:12
π 0
π 1
π¬ 1
π 0
Community contributions play an enormous role in the development of multilingual AI.
Here are 4 Javanese datasets to level up your TTS project with genuine, local dialects (all CC-BY-SA-4.0).
Links in the comments!
20.02.2026 12:45
π 0
π 0
π¬ 2
π 0
The languages absent from your model are the communities it will fail to serve. Cultural preservation can be powered by technology.
The Gojri (Gujari) Literature Corpus has over 60K tokens of clean, UTF-8 normalized text to make AI more accessible to Gujjar communities.
Link in the comments!
19.02.2026 17:06
π 1
π 1
π¬ 1
π 0
LLM hallucination is a data problem as much as a model problem.
Structured, machine-readable World Factbook corpus. Geopolitical, demographic & economic facts for every country β purpose-built for RAG grounding and knowledge graph construction.
Last clean snapshot before the CIA retired the site.
18.02.2026 17:30
π 2
π 1
π¬ 3
π 0
Hi there, Echo! Can you tell us more about the specific model or architecture you're looking to fine-tune the weights of with Tatar folklore?
18.02.2026 16:06
π 1
π 0
π¬ 0
π 0
π£ 114 hours of recorded Nahuatl with transcripts!
Get native speech from the municipalities of ZacatlΓ‘n and Tepetzintla to take your NLP project to the next level.
This is what responsible AI training data looks like.
Explore the collection π
datacollective.mozillafoundation.org/datasets?q=n...
17.02.2026 17:37
π 6
π 2
π¬ 0
π 0
"π Over 18K proverbs, 243 tales & 380 legends! Tauren's Tatar Folklore helps tech get beyond multilingual by providing multiCULTURAL high-quality datasets.
βοΈ CC0, free for NLP projects: datacollective.mozillafoundation.org/datasets/cml...
16.02.2026 17:20
π 5
π 3
π¬ 1
π 0
@mozilla-fr.bsky.social Nouveaux jeux de donnΓ©es pour la traduction franΓ§aisβlangues africaines! π
datacollective.mozillafoundation.org/datasets/cml...
datacollective.mozillafoundation.org/datasets/cml...
datacollective.mozillafoundation.org/datasets/cmk...
12.02.2026 08:31
π 2
π 1
π¬ 0
π 0
π§‘π£ 1 million tokens of Balochi just arrived @mozdatacollective.bsky.social! Check out this beautiful dataset of journalistic Western Balochi (Rakhshani) datacollective.mozillafoundation.org/datasets/cml... @mozilla.org
12.02.2026 08:21
π 6
π 1
π¬ 1
π 0
Team @mozdatacollective.bsky.social is looking for paper submissions to support their session on Synthetic Speech at @interspeech.bsky.social find out if your work is a fit! interspeech2026.org/en-AU/pages/...
06.02.2026 14:33
π 5
π 3
π¬ 0
π 0
> Ethical considerations around vulnerable populations, posthumous voice rights, and cultural sensitivities
> Industry perspectives on implementing consent management at scale
> User studies on public perception, trust, and acceptance of #voice #cloning technologies
3/3 FIN
05.02.2026 14:48
π 1
π 0
π¬ 0
π 0
The session will explore multiple dimensions of this challenge:
> Technical approaches to #consent verification, watermarking, and authentication in #TTS systems
> Legal frameworks for personality rights, #data protection, and liability in voice synthesis
2/3
05.02.2026 14:48
π 0
π 0
π¬ 1
π 0
Safeguarding Synthetic Speech @ Interspeech 2026
3 of our key staff including @kathyreid.au are delighted to be chairing a Special Session at @interspeech.bsky.social this year:
Safeguarding Synthetic Speech: Ethical, legal and technical perspectives.
The #CfP is now open and you can learn more at:
safeguardingsyntheticspeech.org
1/3
05.02.2026 14:48
π 1
π 1
π¬ 1
π 1