News

Mozilla Data Collective Alpha Goes Live

Mozilla Data Collective is now in live alpha, offering the Common Voice 23.0 datasets.

Mozilla Data Collective

17 Sep 2025 — 1 min read

Mozilla Data Collective is live in Alpha – the new platform from Mozilla Foundation that puts communities in control of how datasets are shared! Starting today, the latest Common Voice datasets (23.0) are available to download through Mozilla Data Collective. We’re so excited to introduce 149 new languages in this release, alongside an important first: Spontaneous Speech datasets, which include transcribed, spontaneous responses to prompts that help train models on more realistic speech patterns.

Create a Mozilla Data Collective account to download all the datasets. From there, you can use the new datasheets pages to explore and access resources, access. You can also get access via API or integrate them easily with our new open-source Python library, which allows easy access to datasets programmatically. This means increased global access, more options for dataset downloads and new ways to work with data at scale.

Common Voice datasets are just the beginning! Mozilla Data Collective is a platform to let all dataset communities and creators share their data under their own terms. We’ll be adding more partner datasets soon. If you have a dataset you would like to see on Mozilla Data Collective, tell us about it at mozilladatacollective@mozillafoundation.org.

Mozilla Data Collective wouldn’t exist without the language communities, dataset users and amazing community that makes up Common Voice. Thank you for building with us. We’re excited to hear your feedback, wishlists and ideas for what you want us to build, so get in touch at mozilladatacollective@mozillafoundation.org.

How to License Your Dataset for AI Training: Some Best Practices

We get a lot of questions about how to approach licensing your data for AI training. So to help you share your datasets, we’ve compiled some guidance here – it’s intended to be a living document, that we iterate with our partners and communities. What Does It Mean to

Picture of a red panda lying on a branch

Behind the scenes: Integrating MDC datasets into your Python project

Overcoming the complexity of AI Mozilla Data Collective helps communities to offer unique, multilingual, multicultural, and multimodal datasets. From transcribed and translated videos of narrated Ekpeye folktales to complex question-answering text pairs for the Georgian language, the diversity of datasets on our platform is core to our mission. But with

Cultural Heritage and AI: How Institutions Can Reclaim Control of Their Data

The institutions that safeguard humanity's cultural memory, galleries, libraries, archives, and museums (collectively known as the GLAM sector) are confronting a paradox that defines the current moment in AI development. Years of careful digitization of their archives have transformed physical collections into vast, machine-readable repositories of human knowledge.

Using the MDC Python SDK Library to Download Datasets

In this guide, you will learn how to use the MDC Python SDK Library to download datasets from the Mozilla Data Collective website.

Read more

How to License Your Dataset for AI Training: Some Best Practices

Behind the scenes: Integrating MDC datasets into your Python project

Cultural Heritage and AI: How Institutions Can Reclaim Control of Their Data

Using the MDC Python SDK Library to Download Datasets