Call for proposals: MDC is commissioning mission-aligned datasets!
Mozilla Data Collective is building towards a multicultural, multilingual, and multimodal future that works for all of us. And over the past few months, we’ve listened as people have flagged what kinds of datasets they need, but are struggling to find.
So we’re pleased to announce that MDC will be commissioning several exciting datasets over the next three months to begin meeting this need (and you can look forward to future rounds as well!) If you represent a small-, medium-sized enterprise (SME), nonprofit, or other mission-aligned organisation that works in dataset creation, we want to hear from you! This is a commissioning programme, the resulting datasets will be owned by Mozilla Data Collective, so we are particularly looking for datasets built from scratch.
We welcome all quotes on dataset creation in areas that are un-served, or are under-served, by existing dataset offerings on MDC.
Within our mission of a multilingual, multicultural and multimodal tech future, we seek datasets that speak to common tasks, modalities, and domains:
- Egocentric videos of daily tasks oriented towards safe robotics
- Agentic workflows and interactions with AI agents
- Computer vision and multimodal datasets relating to physical or online safety
- Speech recognition in the healthcare domain
- Dialogic interactions in the finance and banking domains
- Annotated datasets of performing arts, including singing, dancing
This is not an exhaustive list, if you think you have a dataset that you would like to make that meets a real identified need, get in touch!
Particular languages (and varieties) of interest are:
- Arabic (Morocco, Libya, Algeria, Iraq and Syria)
- Latin American Spanish
- Brazilian Portuguese
- Tagalog and other languages of the Philippines
- Sign languages (American, British, French, Spanish)
- Kannada
- Tamil
- Burmese and other languages of Burma (Myanmar)
- Croatian
- Tigrinya
- Vietnamese
- Thai
- Pashto
- Uzbek
- German
- French
- Kurdish
Illustrative examples (for inspiration only):
- Dialogic interaction dataset of customer support interactions
- Annotated video datasets of public health related lectures
- Multimodal dataset of teleoperated robotic arm carrying out tasks like box packing
- High value numerical datasets, it could be information relating to pricing, or materials
- A single-speaker dataset of studio-quality recordings of Tagalog (male or female speaker) in the health domain aimed at text to speech.
- A collection of recordings of alphanumerics in dialectal Arabic (Morocco, Tunisia, Libya, Algeria) in diverse noise environments and spoken by a diverse population.
We expect to move quickly, so will start reviewing quotes on a first-come first-served basis starting from the 23th March, 2026. You can expect a response to your quote within 1-2 business days.
Quotes would ideally include the following information on a per dataset basis:
- Description of dataset
- Fixed costs associated with the project
- Ethical and compliance considerations (we will strongly prefer proposals from close-to-context communities)
- Unit price, that could be per hour in the case of ASR, TTS or per thousand tokens or interaction in the case of LLM corpora
- Estimated volume deliverable per month
- Brief outline of annotation process
- Brief explanation of the project team and its qualifications
- Earliest delivery date
Partners are encouraged to send multiple options. A simple document will suffice, maximum 1-2 pages per dataset. Please do not feel the need to spend time on visual components.
Note: It is important to us that datasets are collected in an ethical and compliant manner. Datasets should be fully anonymised and compliant with relevant privacy regulations.
Budget:
We expect most quotes to fall between $1000 and $25,000 on a per dataset basis.
Timeline:
- Announcement of the call: 17th March
- First quotes received: 23rd March or before
- Decisions made: on a rolling basis (first decision by 27th March)
- Delivery window: 27th March – 27th June (3 months)
When evaluating the quotes we will take into account the following factors: delivery speed, volume of data, ethics, price. High quality is assumed 🙂
Please send quotes or requests for clarification directly to mozilladatacollective@mozillafoundation.org with the subject line “Dataset commissioning”.