Fine-Tune a Speech-to-Text Model for Any Language - Including Yours
A step-by-step developer tutorial from Kostis at Mozilla Data Collective
A step-by-step developer tutorial from Kostis at Mozilla Data Collective
In this video, produced by the Data Nutrition Project and illustrated by Jessica Yurkofsky, you'll learn more about the role of the datasheet and how you can use it to give clear guidance to potential downloaders about how your data can (and can't!) be used.
In mid-April, Mozilla Data Collective's primary domain will change to mozilladatacollective.com
379 new datasets with a Mozilla Common Voice update, improvements to the Python SDK (make sure you update to the latest!) and a preview of an upcoming feature. 👀
Thorsten Müller has created five TTS voice datasets totaling 40 hours of German speech data. In this community-authored post, he speaks to the importance of sharing his voice.
Thorsten Müller hat fünf TTS-Stimmdatensätze mit insgesamt 40 Stunden an deutschen Sprachdaten erstellt. In diesem von der Community verfassten Beitrag spricht er über die Bedeutung, seine Stimme zu teilen.
By Kathy Reid · Mozilla Data Collective Ask a Queenslander to say "no worries" into a Home Assistant Voice Preview. There's a decent chance it mishears them. Ask someone from Radelaide and things get worse. Ask anyone who grew up calling things "heaps good" and
Panjebar Semangat, a weekly Javanese-language magazine established before Indonesian independence, is collaborating with Mozilla Data Collective to advance community-governed language dataset frameworks.
Mozilla Data Collective is building towards a multicultural, multilingual, and multimodal future that works for all of us. And over the past few months, we’ve listened as people have flagged what kinds of datasets they need, but are struggling to find. So we’re pleased to announce that MDC
This week: 19 new datasets and a few small changes while we're heads down in some exciting new features that will be coming soon...
We get a lot of questions about how to approach licensing your data for AI training. So to help you share your datasets, we’ve compiled some guidance here – it’s intended to be a living document, that we iterate with our partners and communities. Explore Mozilla Data Collective What
Overcoming the complexity of AI Mozilla Data Collective helps communities to offer unique, multilingual, multicultural, and multimodal datasets. From transcribed and translated videos of narrated Ekpeye folktales to complex question-answering text pairs for the Georgian language, the diversity of datasets on our platform is core to our mission. But with
News
The institutions that safeguard humanity's cultural memory, galleries, libraries, archives, and museums (collectively known as the GLAM sector) are confronting a paradox that defines the current moment in AI development. Years of careful digitization of their archives have transformed physical collections into vast, machine-readable repositories of human knowledge.
Guide
In this guide, you will learn how to use the MDC Python SDK Library to download datasets from the Mozilla Data Collective website.
News
This week: dataset filtering, enhanced uploader request flow, API improvements, and 20 new datasets!
News
This week: new features for uploaders, updates to dataset search, and new datasets on MDC!
Guides
A practical guide for communities creating datasets together—no legal expertise required. Based on our data governance workshop at Mozilla Festival Zambia 2024 Your community has created something valuable: a dataset. Maybe it's voice recordings in your language. Maybe it's traditional knowledge, local photographs, or cultural
News
Mozilla Data Collective has an amazing opportunity for you to get a free ticket to the 2026 Mozilla Festival in beautiful Barcelona, Spain this November, 2026. We are looking for feedback to inform our 2026 roadmap. Help shape the future of ethical data-sharing by filling out the form below
FAQ
If you need to remove a dataset from Mozilla Data Collective after it has been published, you can make your dataset private through the following steps: 1. Sign into your account - make sure you are using the account that published the dataset originally 2. Go to your Profile >
Guides
You might be sitting on something precious The modern world runs on data. One unfortunate result of this is the fact that many of us are unknowingly producing data for third-party companies, who use our content and actions as data points to make AI models that they then sell back
News
The internet belongs to everyone—but right now, it doesn't work for everyone. From the islands of Borneo to the mountains of Pakistan, hundreds of millions of people speak languages that AI simply can't understand. That's a problem we can solve together. At Mozilla
News
The internet belongs to everyone—but right now, it doesn't work for everyone. Millions of people speak languages that AI simply can't understand, and that's a problem we can solve together. At Mozilla Data Collective, we're proud to host a growing collection
News
Tl;dr: Update to newest version of the MDC Python library (0.2.0 or newer) to continue downloading datasets
FAQ
We review every request to become a data provider on Mozilla Data Collective. Reviewing Uploader Requests If you have not already been in contact with our team about uploading data to Mozilla Data Collective, a member of our team will reach out to you via email to discuss your dataset