Mozilla Data Collective (Page 2)

Mozilla Data Collective datasets now discoverable through CLARIN’s Virtual Language Observatory

New collaboration expands visibility for community-governed language datasets and improves exploration of linguistic resources, services and tools. Mozilla Data Collective datasets are now discoverable through CLARIN’s Virtual Language Observatory, making it easier for researchers, developers and language technology practitioners in Europe to find multilingual and community-centered datasets

Get a Sneak Preview of Mozilla Data Collective’s Compensation Feature!

Mozilla Data Collective was built to redefine how AI data is created, shared, and governed. As part of our mission to be the data sharing platform for human agency and fair value exchange, we have long-teased what so many partners and community members have requested: a tangible way to

15 Datasets for Fine-Tuning Whisper on a New Language in 2026

Why fine-tuning Whisper is a dataset problem OpenAI's Whisper changed what's possible in speech recognition: a single multilingual model with strong zero-shot performance across dozens of languages. But "dozens" is the catch. But for the long tail of the world's

How Radio Free Europe/Radio Liberty Also Serves Its Communities Through Its Datasets

For more than 75 years, Radio Free Europe/Radio Liberty (RFE/RL) has promoted democratic values by providing accurate, uncensored news and debate in countries where a free press is threatened. RFE/RL reaches more than 44 million people every week across 18 countries, in 24 languages, including Persian, Russian,

Feature image of many orange dots on a pale orange background, representing datasets.

What makes a good dataset sample — and how to create one

In this post, we walk you through how to create a useful dataset sample as a preview of your dataset, and guide you in uploading it to the MDC platform.

15 Datasets for Building a Low-Resource Translation Model in 2026

The problem with "low-resource" machine translation Most production machine-translation systems in 2026 are still trained on a fairly narrow set of language pairs: the 50 or so for which the open web supplies enough parallel text to push BLEU scores into useful territory. Below that line,

15 Datasets for Building a Production TTS Voice in 2026

A curated list of 15 text-to-speech training datasets for teams shipping production voice models in 2026 covering emotional, multi-speaker, audiobook-derived, non-Latin script, indigenous-language datasets and more.

Open Home Foundation TTS datasets on Mozilla Data Collective

Most voice assistants listen and respond in a handful of languages. Try to build one for your home that speaks your language, though, and you quickly run into a wall: the training data does not exist, or it is locked behind licences that make it unusable for open source projects.

Discover Dataset Insights with the new Data Provider Analytics Portal

Today, we're excited to share a new way for dataset providers to better understand how their datasets are being used on Mozilla Data Collective with a new data provider analytics portal.

Building an African Voice for AI: Inside the Institute of African Digital Humanities

When you ask a voice assistant a question in English, French, or Mandarin, the underlying models have been trained on billions of words and millions of hours of speech. Ask the same question in Bafia, Mada, or Suundi, and the technology simply doesn't know how to listen. The

Never Miss a Dataset with the new Dataset Notification Feature

Today, we're excited to share a new way to stay informed about the latest datasets on Mozilla Data Collective - the ability to subscribe to get updated about similar datasets to your previous downloads.

CoVoST 2 datasets now available through Mozilla Data Collective

In 2020, Meta introduced a new benchmark dataset based on Mozilla Common Voice. CoVoST 2 has 34 translation directions for audio to text machine translation. This is one of the most widely-used benchmark datasets for speech translation, with nearly 400 citations on Google Scholar. The dataset is based on

News

Press Release

New capabilities expand uploader control over access and compensation, while helping developers discover more representative datasets

data

Metadata magic: making datasets discoverable and tractable with Croissant

Learn how we automatically generate Croissant metadata to describe datasets on the Mozilla Data Collective platform, making them more discoverable.

Guides

Best Practices for Sharing Contact Information on Mozilla Data Collective

In this guide, we'll walk through the different options available on the platform for sharing your contact information with downloaders and setting expectations about how downloaders or other community members can reach out to you.

FAQ

FAQ: Why Can't I Edit Certain Fields on my Published Dataset?

Fields that make up the terms of use of your dataset cannot be edited after publishing.

Community Authors

Engendering Voice: Insights from Common Voice

The origin of Swahili language Swahili originated from the Indian ocean and has its route from the contacts of Arabian traders with the inhabitants of the east coast of Africa over many centuries. Swahili is the lingua franca of most east african countries spoken largely in countries such as Tanzania,

News

The Mozilla Data Collective Data Assistant is now available in Alpha

We're excited to share that the Mozilla Data Collective Data Assistant is now available in Alpha. Visit https://mozilladatacollective.com/chat to get started.

How Open Licensing is Changing with AI: The NOODL License

Author: Alek Tarkowski For the last twenty five years, standardized open licenses were increasingly seen as a main tool for democratizing access to knowledge. Over less then a decade, a relatively narrow set of canonical choices emerged: the Creative Commons licensing stack https://creativecommons.org/licenses/list.en, coupled with

News

Exciting Updates for Mozilla Data Collective

Today we’re so excited to announce exciting platform changes to Mozilla Data Collective that bring us closer to fulfilling our promise to give everyone the Data Platform for Human Agency and Fair Value Exchange.

MDC

Request to Access Feature is now Available

With this release, uploaders can now gate access to their datasets, requiring users to request permission and share their email address before downloading.

Guide

Fine-Tune a Speech-to-Text Model for Any Language - Including Yours

A step-by-step developer tutorial from Kostis at Mozilla Data Collective

Guides

Datasheets: The Missing Manual for your Dataset

In this video, produced by the Data Nutrition Project and illustrated by Jessica Yurkofsky, you'll learn more about the role of the datasheet and how you can use it to give clear guidance to potential downloaders about how your data can (and can't!) be used.

News

Upcoming Domain Change on 09 April

In mid-April, Mozilla Data Collective's primary domain will change to mozilladatacollective.com

Latest

Mozilla Data Collective datasets now discoverable through CLARIN’s Virtual Language Observatory

Get a Sneak Preview of Mozilla Data Collective’s Compensation Feature!

15 Datasets for Fine-Tuning Whisper on a New Language in 2026

How Radio Free Europe/Radio Liberty Also Serves Its Communities Through Its Datasets

What makes a good dataset sample — and how to create one

15 Datasets for Building a Low-Resource Translation Model in 2026

15 Datasets for Building a Production TTS Voice in 2026

Open Home Foundation TTS datasets on Mozilla Data Collective

Discover Dataset Insights with the new Data Provider Analytics Portal

Building an African Voice for AI: Inside the Institute of African Digital Humanities

Never Miss a Dataset with the new Dataset Notification Feature

CoVoST 2 datasets now available through Mozilla Data Collective

Press Release

Metadata magic: making datasets discoverable and tractable with Croissant

Best Practices for Sharing Contact Information on Mozilla Data Collective

FAQ: Why Can't I Edit Certain Fields on my Published Dataset?

Engendering Voice: Insights from Common Voice

The Mozilla Data Collective Data Assistant is now available in Alpha

How Open Licensing is Changing with AI: The NOODL License

Exciting Updates for Mozilla Data Collective

Request to Access Feature is now Available

Fine-Tune a Speech-to-Text Model for Any Language - Including Yours

Datasheets: The Missing Manual for your Dataset

Upcoming Domain Change on 09 April