What makes a good dataset sample — and how to create one
In this post, we walk you through how to create a useful dataset sample as a preview of your dataset, and guide you in uploading it to the MDC platform.
In this post, we walk you through how to create a useful dataset sample as a preview of your dataset, and guide you in uploading it to the MDC platform.
The problem with "low-resource" machine translation Most production machine-translation systems in 2026 are still trained on a fairly narrow set of language pairs: the 50 or so for which the open web supplies enough parallel text to push BLEU scores into useful territory. Below that line, MT quality
A curated list of 15 text-to-speech training datasets for teams shipping production voice models in 2026 covering emotional, multi-speaker, audiobook-derived, non-Latin script, indigenous-language datasets and more.
Most voice assistants listen and respond in a handful of languages. Try to build one for your home that speaks your language, though, and you quickly run into a wall: the training data does not exist, or it is locked behind licences that make it unusable for open source projects.
Today, we're excited to share a new way for dataset providers to better understand how their datasets are being used on Mozilla Data Collective with a new data provider analytics portal.
When you ask a voice assistant a question in English, French, or Mandarin, the underlying models have been trained on billions of words and millions of hours of speech. Ask the same question in Bafia, Mada, or Suundi, and the technology simply doesn't know how to listen. The
Today, we're excited to share a new way to stay informed about the latest datasets on Mozilla Data Collective - the ability to subscribe to get updated about similar datasets to your previous downloads.
In 2020, Meta introduced a new benchmark dataset based on Mozilla Common Voice. CoVoST 2 has 34 translation directions for audio to text machine translation. This is one of the most widely-used benchmark datasets for speech translation, with nearly 400 citations on Google Scholar. The dataset is based on version
New capabilities expand uploader control over access and compensation, while helping developers discover more representative datasets
Learn how we automatically generate Croissant metadata to describe datasets on the Mozilla Data Collective platform, making them more discoverable.
In this guide, we'll walk through the different options available on the platform for sharing your contact information with downloaders and setting expectations about how downloaders or other community members can reach out to you.
Fields that make up the terms of use of your dataset cannot be edited after publishing.