News

CoVoST 2 datasets now available through Mozilla Data Collective

In 2020, Meta introduced a new benchmark dataset based on Mozilla Common Voice. CoVoST 2 has 34 translation directions for audio to text machine translation. This is one of the most widely-used benchmark datasets for speech translation, with nearly 400 citations on Google Scholar. The dataset is based on version 4.0 of Mozilla Common Voice, and that version is one of the most requested versions that we are asked for.

As a result of the changing expectations about data privacy and the right to be forgotten, we put access to old versions of Common Voice behind an email-based process. Researchers who want to access previous versions now have to email us with a description of their project and why they need to use the data. This procedure is less than automatic and still requires researchers to download the whole version 4.0 dataset for each source language in order to extract the subset of clips used in the CoVoST dataset.

We recently launched a request-to-access feature on Mozilla Data Collective which allows uploaders to gate their datasets behind individual access requests. We thought that this is an excellent opportunity to make the process of accessing the CoVoST 2 dataset more streamlined.

As part of our programme of dataset curation, we are providing experiment-ready versions of CoVoST 2 directly through the MDC platform.

Some of the most requested datasets include:

Browse all the CoVoST 2 datasets

If you want to download CoVoST 2, just sign up for an MDC account, search for the dataset through our search bar or through our Data Assistant and click “Request to access” for the translation direction. You will need to give a short description of what your intended use case is, and wait for the access request to be approved – we aim to approve all within one business day.

Explore all Mozilla Data Collective datasets →

Get in touch →

15 Datasets for Fine-Tuning Whisper on a New Language in 2026

Why fine-tuning Whisper is a dataset problem OpenAI's Whisper changed what's possible in speech recognition: a single multilingual model with strong zero-shot performance across dozens of languages. But "dozens" is the catch. But for the long tail of the world's

How Radio Free Europe/Radio Liberty Also Serves Its Communities Through Its Datasets

For more than 75 years, Radio Free Europe/Radio Liberty (RFE/RL) has promoted democratic values by providing accurate, uncensored news and debate in countries where a free press is threatened. RFE/RL reaches more than 44 million people every week across 18 countries, in 24 languages, including Persian, Russian,

Feature image of many orange dots on a pale orange background, representing datasets.

What makes a good dataset sample — and how to create one

In this post, we walk you through how to create a useful dataset sample as a preview of your dataset, and guide you in uploading it to the MDC platform.

15 Datasets for Building a Low-Resource Translation Model in 2026

The problem with "low-resource" machine translation Most production machine-translation systems in 2026 are still trained on a fairly narrow set of language pairs: the 50 or so for which the open web supplies enough parallel text to push BLEU scores into useful territory. Below that line,

Read more

15 Datasets for Fine-Tuning Whisper on a New Language in 2026

How Radio Free Europe/Radio Liberty Also Serves Its Communities Through Its Datasets

What makes a good dataset sample — and how to create one

15 Datasets for Building a Low-Resource Translation Model in 2026