CoVoST 2 datasets now available through Mozilla Data Collective

Share
CoVoST 2 datasets now available through Mozilla Data Collective

In 2020, Meta introduced a new benchmark dataset based on Mozilla Common Voice. CoVoST 2 has 34 translation directions for audio to text machine translation. This is one of the most widely-used benchmark datasets for speech translation, with nearly 400 citations on Google Scholar. The dataset is based on version 4.0 of Mozilla Common Voice, and that version is one of the most requested versions that we are asked for.

As a result of the changing expectations about data privacy and the right to be forgotten, we put access to old versions of Common Voice behind an email-based process. Researchers who want to access previous versions now have to email us with a description of their project and why they need to use the data. This procedure is less than automatic and still requires researchers to download the whole version 4.0 dataset for each source language in order to extract the subset of clips used in the CoVoST dataset.

We recently launched a request-to-access feature on Mozilla Data Collective which allows uploaders to gate their datasets behind individual access requests. We thought that this is an excellent opportunity to make the process of accessing the CoVoST 2 dataset more streamlined. 

As part of our programme of dataset curation, we are providing experiment-ready versions of CoVoST 2 directly through the MDC platform. 

Some of the most requested datasets include:

Browse all the CoVoST 2 datasets here.

If you want to download CoVoST 2, just sign up for an MDC account, search for the dataset through our search bar or through our Data Assistant and click “Request to access” for the translation direction. You will need to give a short description of what your intended use case is, and wait for the access request to be approved – we aim to approve all within one business day.

Explore all Mozilla Data Collective datasets →

Get in touch →