FAQ: Can I get the Common Voice or other MDC datasets from other platforms like GitHub or Hugging Face?

We have no plans to host Mozilla community datasets through third parties at this time, as it makes governance and stewardship extremely challenging. For example, when someone chooses to revoke their consent to be included in a dataset, we need a way to remove them from the dataset and update the data listings. When a given dataset is mirrored and hosted in multiple places, it becomes difficult to respect these requests and ensure that available versions of the dataset exclude those individuals. Mozilla community datasets, including Mozilla Common Voice datasets are exclusively available through MDC for this reason. Our new terms reflect this. Some of our contributors’ open datasets are available in other places.

We want to make sure that those of you who enjoy Hugging Face’s great model and training features can still use them easily, so we’ve published an API reference page with instructions on how to create access credentials and download datasets programmatically.

Read more