News

Call for proposals: MDC is commissioning mission-aligned datasets!

Francis Tyers, Christine Kim

17 Mar 2026 — 3 min read

We're looking to bring together a universe of datasets (photo cred: Andre Moura)

Mozilla Data Collective is building towards a multicultural, multilingual, and multimodal future that works for all of us. And over the past few months, we’ve listened as people have flagged what kinds of datasets they need, but are struggling to find.

So we’re pleased to announce that MDC will be commissioning several exciting datasets over the next three months to begin meeting this need (and you can look forward to future rounds as well!) If you represent a small-, medium-sized enterprise (SME), nonprofit, or other mission-aligned organisation that works in dataset creation, we want to hear from you! This is a commissioning programme, the resulting datasets will be owned by Mozilla Data Collective, so we are particularly looking for datasets built from scratch.

We welcome all quotes on dataset creation in areas that are un-served, or are under-served, by existing dataset offerings on MDC.

Within our mission of a multilingual, multicultural and multimodal tech future, we seek datasets that speak to common tasks, modalities, and domains:

Egocentric videos of daily tasks oriented towards safe robotics
Agentic workflows and interactions with AI agents
Computer vision and multimodal datasets relating to physical or online safety
Speech recognition in the healthcare domain
Dialogic interactions in the finance and banking domains
Annotated datasets of performing arts, including singing, dancing

This is not an exhaustive list, if you think you have a dataset that you would like to make that meets a real identified need, get in touch!

Particular languages (and varieties) of interest are:

Arabic (Morocco, Libya, Algeria, Iraq and Syria)
Latin American Spanish
Brazilian Portuguese
Tagalog and other languages of the Philippines
Sign languages (American, British, French, Spanish)
Kannada
Tamil
Burmese and other languages of Burma (Myanmar)
Croatian
Tigrinya
Vietnamese
Thai
Pashto
Uzbek
German
French
Kurdish

Illustrative examples (for inspiration only):

Dialogic interaction dataset of customer support interactions
Annotated video datasets of public health related lectures
Multimodal dataset of teleoperated robotic arm carrying out tasks like box packing
High value numerical datasets, it could be information relating to pricing, or materials
A single-speaker dataset of studio-quality recordings of Tagalog (male or female speaker) in the health domain aimed at text to speech.
A collection of recordings of alphanumerics in dialectal Arabic (Morocco, Tunisia, Libya, Algeria) in diverse noise environments and spoken by a diverse population.

We expect to move quickly, so will start reviewing quotes on a first-come first-served basis starting from the 23th March, 2026. You can expect a response to your quote within 1-2 business days.

Quotes would ideally include the following information on a per dataset basis:

Description of dataset
Fixed costs associated with the project
Ethical and compliance considerations (we will strongly prefer proposals from close-to-context communities)
Unit price, that could be per hour in the case of ASR, TTS or per thousand tokens or interaction in the case of LLM corpora
Estimated volume deliverable per month
Brief outline of annotation process
Brief explanation of the project team and its qualifications
Earliest delivery date

Partners are encouraged to send multiple options. A simple document will suffice, maximum 1-2 pages per dataset. Please do not feel the need to spend time on visual components.

Note: It is important to us that datasets are collected in an ethical and compliant manner. Datasets should be fully anonymised and compliant with relevant privacy regulations.

Budget:

We expect most quotes to fall between $1000 and $25,000 on a per dataset basis.

Timeline:

Announcement of the call: 17th March
First quotes received: 23rd March or before
Decisions made: on a rolling basis (first decision by 27th March)
Delivery window: 27th March – 27th June (3 months)

When evaluating the quotes we will take into account the following factors: delivery speed, volume of data, ethics, price. High quality is assumed 🙂

Please send quotes or requests for clarification directly to mozilladatacollective@mozillafoundation.org with the subject line “Dataset commissioning”.

Call for proposals: MDC is commissioning mission-aligned datasets!

Francis Tyers, Christine Kim

Read more

How can a nearly century-old publisher remain relevant and continue to grow?

MDC Release Notes - 13.03.26

How to License Your Dataset for AI Training: Some Best Practices

Behind the scenes: Integrating MDC datasets into your Python project