Behind the scenes: Integrating MDC datasets into your Python project
Overcoming the complexity of AI
Mozilla Data Collective helps communities to offer unique, multilingual, multicultural, and multimodal datasets. From transcribed and translated videos of narrated Ekpeye folktales to complex question-answering text pairs for the Georgian language, the diversity of datasets on our platform is core to our mission.
But with the inherent complexity of all these different data formats and representations comes a challenge: how do we make these datasets easily understood by machines?
A Python and pandas solution
As it stands, the most popular programming language for AI development is Python, and one of Python's most popular data manipulation frameworks is pandas. We therefore made it our priority to support automatic integration of MDC-hosted datasets, making them loadable as a pandas DataFrame with a single function call.
From a user and developer perspective, this means that regardless of whether the dataset you want to use is text, video, or audio, in English or in Lingala, for LLM fine-tuning or for Automatic Speech Recognition, the only thing you need to do to integrate it into your Python project is:
from datacollective import load_dataset
dataframe = load_dataset("my-dataset-id")Single function call for download, extracting and parsing an MDC dataset into a pandas.DataFrame
As simple as that 😌
Behind the scenes
In this blog post we want to give you a glimpse behind the curtain of the inner workings of the datacollective Python package and how we make this seamless integration possible.
There are three main questions that need to be addressed when preparing a dataset for processing:
- What kind of task will the dataset be used for? (ASR, TTS, LM, etc.)
- How are the directories and files structured inside the dataset archive?
- How do we map the files and their contents to specific columns in a DataFrame?
We decided the most effective way to answer these questions universally, across all datasets, is by defining a schema.yaml file for each dataset.
Why a declarative YAML file?
There are a few reasons we went with a schema-first approach rather than, say, shipping custom loader code per dataset. The three most prominent ones are:
- Security: you never need to download and execute arbitrary code from the internet on your machine. All behaviour is driven by our open-source
datacollectivelibrary itself meaning that the schema only describes what the data looks like, not how to process it. - Decoupled releases: because all schemas live in a public registry that the SDK queries at runtime, we can add support for new datasets without releasing a new version of the Python package every time. When a new dataset is released, a new schema file is added in the registry and supported seamlessly.
- Human-readable and community-friendly: YAML is easy to read, write, and review, lowering the technical entry barrier for maintaining an accurate registry, without requiring any deep Python expertise.
Answering the three questions
- Task type
The first question: what is this dataset for? is answered by the task field, which is set by the data provider when uploading the dataset. Currently supported tasks on the MDC platform include: Natural Language Processing, Automatic Speech Recognition, Language Identification, Machine Translation, Language Modelling, Large Language Modelling, Natural Language Understanding, Natural Language Generation, Computer-Aided Language Learning, Retrieval-Augmented Generation, Computer Vision, Machine Learning, Other . So far for simplicity, we ask each uploader to choose one task for each dataset to help people who are searching or filtering for a particular dataset for their purposes.
- File structure
The second question: how are the files laid out? is answered by what we call a loading strategy which the SDK infers from the fields present in the schema. There are three strategies:
- Index-based (default): a metadata file (CSV / TSV / pipe-delimited) lists each sample. Key fields:
format,index_file,columns. - Multi-split: multiple split files (train, dev, test, etc) each contain samples. Key fields:
root_strategy: "multi_split",splits. - Paired-glob: each audio file has a matching
.txtsidecar and there is no index file at all. Key fields:root_strategy: "paired_glob",file_pattern,audio_extension.
- Column mapping
The third question: what goes into each DataFrame column? is answered by the columns section of the schema. Each key under columns becomes a column name in the resulting DataFrame, and its value tells the SDK where to find the data and how to interpret it:
columns:
audio_path:
source_column: "path" # column name in the index file
dtype: "file_path" # resolved to an absolute path on disk
transcription:
source_column: "sentence"
dtype: "string"
speaker_id:
source_column: "client_id"
dtype: "category"
optional: true # silently skipped if the column is missingSupported dtype values include `string`, `file_path` (resolved to an absolute path), `category` (pandas Categorical), `int`, and `float`.
The schema.yaml in practice
Here is an example of what a schema looks like for the Ehugbo TTS: biblical text to speech dataset in Ehugbo Language dataset:
dataset_id: "cmihqro9h0238o207fgg5cmf6"
task: "TTS"
format: "csv"
encoding: "utf-8-sig"
checksum: "c29134fe715a9a794f44c94c36022f548e97d6551658bc02f9540e05c1f5f203"
index_file: "metadata.csv"
base_audio_path: "audios/"
columns:
audio_path:
source_column: "Audio File Name (.wav)"
dtype: "string"
transcription:
source_column: "Transcript"
dtype: "string"
book:
source_column: "Book (audio folder)"
dtype: "category"
speaker_id:
source_column: "Pseudo ID"
dtype: "category"
gender:
source_column: "Gender"
dtype: "category"
optional: true
duration:
source_column: "Duration"
dtype: "float"
optional: true
Schema.yaml describing how to parse an MDC dataset
This tells our datacollective package: "Read metadata.csv as comma-separated (UTF-8 with BOM), map the Audio File Name (.wav) column to audio paths (under the audios/ directory), Transcript to transcriptions, Book (audio folder) and Pseudo ID as categorical metadata columns for book and speaker, with optional Gender and Duration columns when present."
Limitations
As our platform and datasets catalogue are ever-evolving, there are a few limitations with our current implementation as of right now (March 2026):
- Not all tasks are supported yet: the python SDK currently handles ASR and TTS datasets but support for additional tasks is actively being worked on and will be released in the coming weeks!
- A dataset can serve multiple tasks, but the current schema model supports declaring a single task per schema. Multi-task schemas are on the roadmap.
- Not every MDC dataset has an associated
schema.yamlyet. Coverage is growing, and community contributions are very welcome (see the call to contribute below!).
Next steps
The schema-based loading system is already working in production for more than 350+ MDC datasets, and we are excited about where it goes from here. These are the areas we are actively investing in:
- More tasks: the SDK's task registry is designed to be extended with minimal friction by adding a new loader class - plus a one-line registration is all it takes. Machine Translation (MT), Language Modelling (LM), and other multimodal tasks are on the near-term roadmap. As the MDC dataset catalogue grows, so will the set of supported tasks.
- More schemas: we are working through the catalogue systematically, and we encourage data providers and community members to contribute schemas for datasets they care about (see the call to action below).
- Broader data format support: the AI ecosystem is evolving rapidly and we want to keep pace with it. We are planning to expand
load_dataset()into a universal entry point that can emit data in whichever format best fits your pipeline.
Want to contribute?
We'd love your help to make datacollective better for everyone:
I want to use dataset <X,Y,Z> but its not supported by load_dataset() yet!-> Open an Issue in our repository to request it. We'll try to add support as quickly as possible!
I have uploaded my dataset at MDC and want to help the community use it by adding support for the load_dataset() function. What can I do? → Consider opening a Pull Request with the schema.yaml for your dataset. We have step-by-step documentation to guide you through the process in this link.
I have questions or ideas I want to share with you!
→ Feel free to open an issue in our repository if it's something technical. Or you can find us on social media at the bottom of our website. Feedback from our community is incredibly important to us!
If you found this post useful, don't forget to check out: