Mozilla Data Collective (Page 2)

How to License Your Dataset for AI Training: Some Best Practices

We get a lot of questions about how to approach licensing your data for AI training. So to help you share your datasets, we’ve compiled some guidance here – it’s intended to be a living document, that we iterate with our partners and communities. What Does It Mean to

Picture of a red panda lying on a branch

Behind the scenes: Integrating MDC datasets into your Python project

Overcoming the complexity of AI Mozilla Data Collective helps communities to offer unique, multilingual, multicultural, and multimodal datasets. From transcribed and translated videos of narrated Ekpeye folktales to complex question-answering text pairs for the Georgian language, the diversity of datasets on our platform is core to our mission. But with

Cultural Heritage and AI: How Institutions Can Reclaim Control of Their Data

The institutions that safeguard humanity's cultural memory, galleries, libraries, archives, and museums (collectively known as the GLAM sector) are confronting a paradox that defines the current moment in AI development. Years of careful digitization of their archives have transformed physical collections into vast, machine-readable repositories of human knowledge.

Using the MDC Python SDK Library to Download Datasets

In this guide, you will learn how to use the MDC Python SDK Library to download datasets from the Mozilla Data Collective website.

MDC Release Notes - 27.02.26

This week: dataset filtering, enhanced uploader request flow, API improvements, and 20 new datasets!

MDC Release Notes - 13.02.26

This week: new features for uploaders, updates to dataset search, and new datasets on MDC!

Your Data, Your Rules: A Community Workshop for Dataset Governance

A practical guide for communities creating datasets together—no legal expertise required. Based on our data governance workshop at Mozilla Festival Zambia 2024 Your community has created something valuable: a dataset. Maybe it's voice recordings in your language. Maybe it's traditional knowledge, local photographs, or cultural

Help Shape the Future of Mozilla Data Collective

Mozilla Data Collective has an amazing opportunity for you to get a free ticket to the 2026 Mozilla Festival in beautiful Barcelona, Spain this November, 2026. We are looking for feedback to inform our 2026 roadmap. Help shape the future of ethical data-sharing by filling out the form below

FAQ: How can I remove published datasets from MDC?

If you need to remove a dataset from Mozilla Data Collective after it has been published, you can make your dataset private through the following steps: 1. Sign into your account - make sure you are using the account that published the dataset originally 2. Go to your Profile >

Turning Your Data Into a Valuable ML Resource Without Giving Up Control

You might be sitting on something precious The modern world runs on data. One unfortunate result of this is the fact that many of us are unknowingly producing data for third-party companies, who use our content and actions as data points to make AI models that they then sell back

Latest: Asian Language datasets by communities on Mozilla Data Collective

The internet belongs to everyone—but right now, it doesn't work for everyone. From the islands of Borneo to the mountains of Pakistan, hundreds of millions of people speak languages that AI simply can't understand. That's a problem we can solve together. At Mozilla

Latest: African Language datasets by communities on Mozilla Data Collective

The internet belongs to everyone—but right now, it doesn't work for everyone. Millions of people speak languages that AI simply can't understand, and that's a problem we can solve together. At Mozilla Data Collective, we're proud to host a growing collection

A woman being interviewed by someone in a brightly colored shirt against an orange triangular background and purple starburst

News

Updates to MDC REST API and Python Library

Tl;dr: Update to newest version of the MDC Python library (0.2.0 or newer) to continue downloading datasets

FAQ

FAQ: How long does it take to publish my dataset on MDC?

We review every request to become a data provider on Mozilla Data Collective. Reviewing Uploader Requests If you have not already been in contact with our team about uploading data to Mozilla Data Collective, a member of our team will reach out to you via email to discuss your dataset

Common Voice

Pashto becomes third-highest language by volume of data in Common Voice v24

Key highlights from the Common Voice v24 Scripted Speech and v2 Spontaneous Speech release.

FAQ

FAQ: How can I contribute to Mozilla Data Collective?

We are a collective of linguists, technologists, activists, researchers and creatives. Whether you’re interested in stewarding data, conducting research, developing new AI and ML technologies, or just want to be part of our community working to make AI all it promises to be - not all it threatens to

FAQ

FAQ: Do I need to be part of an organization to upload a dataset to Mozilla Data Collective?

No, you do not need to be a part of an organization to upload a dataset to Mozilla Data Collective. We recognize that there are many use cases where an individual might want to share datasets that they have created, either on their own or on behalf of a group.

Common Voice

Improving the Spontaneous Speech English dataset: lifting the lid on speech data quality uplift techniques

Firstly, we’d like to thank you for your patience. After introducing Spontaneous Speech early in 2025, we released most locale datasets when the Mozilla Data Collective platform launched in alpha in September of this year. However, upon inspection, the English Spontaneous Speech dataset required some remedial work prior to

Docs

Uploading your dataset to the Mozilla Data Collective Platform

Interested in joining the movement and publishing your dataset on Mozilla Data Collective? This guide will walk you through the steps required, from account creation to submission!

FAQ

FAQ: What does it mean to exclusively host my dataset on Mozilla Data Collective?

When you upload a dataset to Mozilla Data Collective, you have the option to make your dataset exclusive to MDC. The default terms of use for data providers on the platform is that datasets are exclusive to MDC. Choosing to host your dataset exclusively with MDC means that you do

FAQ

FAQ: What kind of datasets can I publish on Mozilla Data Collective?

Our priority is technology that is more multilingual, multicultural, and multi-modal. We prioritise helping communities unlock content that is not on the web already, and prefer audio, image, and video formats, though we will also accept text documents that advance the above goals. Our expectation is that each dataset is

FAQ

FAQ: What are the main points of the MDC Terms of Use?

About downloading and using datasets When downloading a dataset, am I getting permission to use it from Mozilla Data Collective or the Data Provider? When you download a dataset from Mozilla Data Collective, you are entering into an agreement with the Data Provider who published the dataset. Data Providers set

Three people in front of a green background, smiling,under the text "people powered data, people centered tech"

News

What do the languages Nahuatl, Bahasa Indonesia and Bulgarian have in common?

Nahuatl, Bahasa Indonesia and Bulgarian all feature in our very first community curated datasets to be uploaded to the Mozilla Data Collective platform.

FAQ

FAQ: Why can’t I download old versions of Common Voice datasets?

In order to better serve our community and to keep up with current changes in best practices for data stewardship, we are changing how previous releases of the Common Voice datasets are accessed. They will now be accessible to interested researchers and others via an email-based request process. You will