Mozilla Data Collective

Compensated Datasets Is Now Available on Mozilla Data Collective

Helping organisations participate more directly in the AI economy while making it easier for AI builders to discover responsibly sourced datasets. Earlier this month, we shared a preview of Compensated Datasets and our vision for creating more transparent ways for organisations to participate in the AI economy while retaining agency

Common Voice segments now available through Mozilla Data Collective

Mozilla Common Voice is a massively multilingual platform for collecting speech data to train automatic speech recognition (ASR). Its mission is simple: to make language technology understand everyone’s mother tongue. But for datasets to be genuinely useful, they also need to be manageable. Many of the larger Common Voice

Mozilla Data Collective datasets now discoverable through CLARIN’s Virtual Language Observatory

New collaboration expands visibility for community-governed language datasets and improves exploration of linguistic resources, services and tools. Mozilla Data Collective datasets are now discoverable through CLARIN’s Virtual Language Observatory, making it easier for researchers, developers and language technology practitioners in Europe to find multilingual and community-centered datasets

Get a Sneak Preview of Mozilla Data Collective’s Compensation Feature!

Mozilla Data Collective was built to redefine how AI data is created, shared, and governed. As part of our mission to be the data sharing platform for human agency and fair value exchange, we have long-teased what so many partners and community members have requested: a tangible way to

15 Datasets for Fine-Tuning Whisper on a New Language in 2026

Why fine-tuning Whisper is a dataset problem OpenAI's Whisper changed what's possible in speech recognition: a single multilingual model with strong zero-shot performance across dozens of languages. But "dozens" is the catch. But for the long tail of the world's

How Radio Free Europe/Radio Liberty Also Serves Its Communities Through Its Datasets

For more than 75 years, Radio Free Europe/Radio Liberty (RFE/RL) has promoted democratic values by providing accurate, uncensored news and debate in countries where a free press is threatened. RFE/RL reaches more than 44 million people every week across 18 countries, in 24 languages, including Persian, Russian,

Feature image of many orange dots on a pale orange background, representing datasets.

What makes a good dataset sample — and how to create one

In this post, we walk you through how to create a useful dataset sample as a preview of your dataset, and guide you in uploading it to the MDC platform.

15 Datasets for Building a Low-Resource Translation Model in 2026

The problem with "low-resource" machine translation Most production machine-translation systems in 2026 are still trained on a fairly narrow set of language pairs: the 50 or so for which the open web supplies enough parallel text to push BLEU scores into useful territory. Below that line,

15 Datasets for Building a Production TTS Voice in 2026

A curated list of 15 text-to-speech training datasets for teams shipping production voice models in 2026 covering emotional, multi-speaker, audiobook-derived, non-Latin script, indigenous-language datasets and more.

Open Home Foundation TTS datasets on Mozilla Data Collective

Most voice assistants listen and respond in a handful of languages. Try to build one for your home that speaks your language, though, and you quickly run into a wall: the training data does not exist, or it is locked behind licences that make it unusable for open source projects.

Discover Dataset Insights with the new Data Provider Analytics Portal

Today, we're excited to share a new way for dataset providers to better understand how their datasets are being used on Mozilla Data Collective with a new data provider analytics portal.

Building an African Voice for AI: Inside the Institute of African Digital Humanities

When you ask a voice assistant a question in English, French, or Mandarin, the underlying models have been trained on billions of words and millions of hours of speech. Ask the same question in Bafia, Mada, or Suundi, and the technology simply doesn't know how to listen. The

Rebuilding the AI data ecosystem - with communities at the centre

Latest