FAQ

FAQ: What kind of datasets can I publish on Mozilla Data Collective?

Mozilla Data Collective

25 Nov 2025 — 1 min read

Our priority is technology that is more multilingual, multicultural, and multi-modal. We prioritise helping communities unlock content that is not on the web already, and prefer audio, image, and video formats, though we will also accept text documents that advance the above goals. Our expectation is that each dataset is (or can be) prepared in a way that enables its use in machine learning contexts, or is intended to be consumed in such a way for research, evaluation, training, or other similar endeavors. The specific details of how each dataset can be used is up to you, and set via terms on your dataset's corresponding datasheet.

Datasets should be organized in a way that makes sense for their contents and intended use. When you upload the dataset to Mozilla Data Collective, you will need to put the contents of your dataset into a .tar.gz format, and upload it as a single file.

Datasets must adhere to the Mozilla Data Collective terms of use. By uploading a dataset to Mozilla Data Collective, you are responsible for ensuring that you have the rights to distribute the dataset and that it does not contain any data in the Prohibited Data Content section of the terms.

How to License Your Dataset for AI Training: Some Best Practices

We get a lot of questions about how to approach licensing your data for AI training. So to help you share your datasets, we’ve compiled some guidance here – it’s intended to be a living document, that we iterate with our partners and communities. What Does It Mean to

Picture of a red panda lying on a branch

Behind the scenes: Integrating MDC datasets into your Python project

Overcoming the complexity of AI Mozilla Data Collective helps communities to offer unique, multilingual, multicultural, and multimodal datasets. From transcribed and translated videos of narrated Ekpeye folktales to complex question-answering text pairs for the Georgian language, the diversity of datasets on our platform is core to our mission. But with

Cultural Heritage and AI: How Institutions Can Reclaim Control of Their Data

The institutions that safeguard humanity's cultural memory, galleries, libraries, archives, and museums (collectively known as the GLAM sector) are confronting a paradox that defines the current moment in AI development. Years of careful digitization of their archives have transformed physical collections into vast, machine-readable repositories of human knowledge.

Using the MDC Python SDK Library to Download Datasets

In this guide, you will learn how to use the MDC Python SDK Library to download datasets from the Mozilla Data Collective website.

Read more

How to License Your Dataset for AI Training: Some Best Practices

Behind the scenes: Integrating MDC datasets into your Python project

Cultural Heritage and AI: How Institutions Can Reclaim Control of Their Data

Using the MDC Python SDK Library to Download Datasets