Join Mozilla Data Collective

Feature image of many orange dots on a pale orange background, representing datasets.

What makes a good dataset sample — and how to create one

In this post, we walk you through how to create a useful dataset sample as a preview of your dataset, and guide you in uploading it to the MDC platform.

15 Datasets for Building a Low-Resource Translation Model in 2026

The problem with "low-resource" machine translation Most production machine-translation systems in 2026 are still trained on a fairly narrow set of language pairs: the 50 or so for which the open web supplies enough parallel text to push BLEU scores into useful territory. Below that line, MT quality

15 Datasets for Building a Production TTS Voice in 2026

A curated list of 15 text-to-speech training datasets for teams shipping production voice models in 2026 covering emotional, multi-speaker, audiobook-derived, non-Latin script, indigenous-language datasets and more.

Open Home Foundation TTS datasets on Mozilla Data Collective

Most voice assistants listen and respond in a handful of languages. Try to build one for your home that speaks your language, though, and you quickly run into a wall: the training data does not exist, or it is locked behind licences that make it unusable for open source projects.

Discover Dataset Insights with the new Data Provider Analytics Portal

Today, we're excited to share a new way for dataset providers to better understand how their datasets are being used on Mozilla Data Collective with a new data provider analytics portal.

Building an African Voice for AI: Inside the Institute of African Digital Humanities

When you ask a voice assistant a question in English, French, or Mandarin, the underlying models have been trained on billions of words and millions of hours of speech. Ask the same question in Bafia, Mada, or Suundi, and the technology simply doesn't know how to listen. The

Never Miss a Dataset with the new Dataset Notification Feature

Today, we're excited to share a new way to stay informed about the latest datasets on Mozilla Data Collective - the ability to subscribe to get updated about similar datasets to your previous downloads.

CoVoST 2 datasets now available through Mozilla Data Collective

In 2020, Meta introduced a new benchmark dataset based on Mozilla Common Voice. CoVoST 2 has 34 translation directions for audio to text machine translation. This is one of the most widely-used benchmark datasets for speech translation, with nearly 400 citations on Google Scholar. The dataset is based on version

Press Release

New capabilities expand uploader control over access and compensation, while helping developers discover more representative datasets

Metadata magic: making datasets discoverable and tractable with Croissant

Learn how we automatically generate Croissant metadata to describe datasets on the Mozilla Data Collective platform, making them more discoverable.

Best Practices for Sharing Contact Information on Mozilla Data Collective

In this guide, we'll walk through the different options available on the platform for sharing your contact information with downloaders and setting expectations about how downloaders or other community members can reach out to you.

FAQ: Why Can't I Edit Certain Fields on my Published Dataset?

Fields that make up the terms of use of your dataset cannot be edited after publishing.

Rebuilding the AI data ecosystem - with communities at the centre

Latest