Behind the scenes: Integrating MDC datasets into your Python project

Picture of a red panda lying on a branch
pandas is one of the most popular and widely used data manipulation libraries in Python. Also, an adorable mammal. Photo by Xiangkun ZHU / Unsplash

Overcoming the complexity of AI

Mozilla Data Collective helps communities to offer unique, multilingual, multicultural, and multimodal datasets. From transcribed and translated videos of narrated Ekpeye folktales to complex question-answering text pairs for the Georgian language, the diversity of datasets on our platform is core to our mission.

But with the inherent complexity of all these different data formats and representations comes a challenge: how do we make these datasets easily understood by machines?

A Python and pandas solution

As it stands, the most popular programming language for AI development is Python, and one of Python's most popular data manipulation frameworks is pandas. We therefore made it our priority to support automatic integration of MDC-hosted datasets, making them loadable as a pandas DataFrame with a single function call.

From a user and developer perspective, this means that regardless of whether the dataset you want to use is text, video, or audio, in English or in Lingala, for LLM fine-tuning or for Automatic Speech Recognition, the only thing you need to do to integrate it into your Python project is:

from datacollective import load_dataset
dataframe = load_dataset("my-dataset-id")

Single function call for download, extracting and parsing an MDC dataset into a pandas.DataFrame

As simple as that 😌

Behind the scenes

In this blog post we want to give you a glimpse behind the curtain of the inner workings of the datacollective Python package and how we make this seamless integration possible.

There are three main questions that need to be addressed when preparing a dataset for processing:

  1. What kind of task will the dataset be used for? (ASR, TTS, LM, etc.)
  2. How are the directories and files structured inside the dataset archive?
  3. How do we map the files and their contents to specific columns in a DataFrame?

We decided the most effective way to answer these questions universally, across all datasets, is by defining a schema.yaml file for each dataset.

Why a declarative YAML file?

There are a few reasons we went with a schema-first approach rather than, say, shipping custom loader code per dataset. The three most prominent ones are:

  • Security: you never need to download and execute arbitrary code from the internet on your machine. All behaviour is driven by our open-source datacollective library itself meaning that the schema only describes what the data looks like, not how to process it.
  • Decoupled releases: because all schemas live in a public registry that the SDK queries at runtime, we can add support for new datasets without releasing a new version of the Python package every time. When a new dataset is released, a new schema file is added in the registry and supported seamlessly. 
  • Human-readable and community-friendly: YAML is easy to read, write, and review, lowering the technical entry barrier for maintaining an accurate registry, without requiring any deep Python expertise.

Answering the three questions

  1. Task type

The first question: what is this dataset for? is answered by the task field, which is set by the data provider when uploading the dataset. Currently supported tasks on the MDC platform include: Natural Language Processing, Automatic Speech Recognition, Language Identification, Machine Translation, Language Modelling, Large Language Modelling, Natural Language Understanding, Natural Language Generation, Computer-Aided Language Learning, Retrieval-Augmented Generation, Computer Vision, Machine Learning, Other . So far for simplicity, we ask each uploader to choose one task for each dataset to help people who are searching or filtering for a particular dataset for their purposes.

  1. File structure

The second question: how are the files laid out? is answered by what we call a loading strategy which the SDK infers from the fields present in the schema. There are three strategies:

  • Index-based (default): a metadata file (CSV / TSV / pipe-delimited) lists each sample. Key fields: format, index_file, columns.
  • Multi-split: multiple split files (train, dev, test, etc) each contain samples. Key fields: root_strategy: "multi_split", splits.
  • Paired-glob: each audio file has a matching .txt sidecar and there is no index file at all. Key fields: root_strategy: "paired_glob", file_pattern, audio_extension.

  1. Column mapping

The third question: what goes into each DataFrame column? is answered by the columns section of the schema. Each key under columns becomes a column name in the resulting DataFrame, and its value tells the SDK where to find the data and how to interpret it:

columns:
  audio_path:
    source_column: "path"       # column name in the index file
    dtype: "file_path"          # resolved to an absolute path on disk
  transcription:
    source_column: "sentence"
    dtype: "string"
  speaker_id:
    source_column: "client_id"
    dtype: "category"
    optional: true              # silently skipped if the column is missing

Supported dtype values include `string`, `file_path` (resolved to an absolute path), `category` (pandas Categorical), `int`, and `float`.

The schema.yaml in practice

Here is an example of what a schema looks like for the Ehugbo TTS: biblical text to speech dataset in Ehugbo Language dataset:

dataset_id: "cmihqro9h0238o207fgg5cmf6"
task: "TTS"
format: "csv"
encoding: "utf-8-sig"
checksum: "c29134fe715a9a794f44c94c36022f548e97d6551658bc02f9540e05c1f5f203"

index_file: "metadata.csv"
base_audio_path: "audios/"

columns:
  audio_path:
    source_column: "Audio File Name (.wav)"
    dtype: "string"
  transcription:
    source_column: "Transcript"
    dtype: "string"
  book:
    source_column: "Book (audio folder)"
    dtype: "category"
  speaker_id:
    source_column: "Pseudo ID"
    dtype: "category"
  gender:
    source_column: "Gender"
    dtype: "category"
    optional: true
  duration:
    source_column: "Duration"
    dtype: "float"
    optional: true

Schema.yaml describing how to parse an MDC dataset

This tells our datacollective package: "Read metadata.csv as comma-separated (UTF-8 with BOM), map the Audio File Name (.wav) column to audio paths (under the audios/ directory), Transcript to transcriptions, Book (audio folder) and Pseudo ID as categorical metadata columns for book and speaker, with optional Gender and Duration columns when present."

Limitations

As our platform and datasets catalogue are ever-evolving, there are a few limitations with our current implementation as of right now (March 2026):

  • Not all tasks are supported yet: the python SDK currently handles ASR and TTS datasets but support for additional tasks is actively being worked on and will be released in the coming weeks!
  • A dataset can serve multiple tasks, but the current schema model supports declaring a single task per schema. Multi-task schemas are on the roadmap.
  • Not every MDC dataset has an associated schema.yaml yet. Coverage is growing, and community contributions are very welcome (see the call to contribute below!).

Next steps

The schema-based loading system is already working in production for more than 350+ MDC datasets, and we are excited about where it goes from here. These are the areas we are actively investing in:

  • More tasks: the SDK's task registry is designed to be extended with minimal friction by adding a new loader class - plus a one-line registration is all it takes. Machine Translation (MT), Language Modelling (LM), and other multimodal tasks are on the near-term roadmap. As the MDC dataset catalogue grows, so will the set of supported tasks.
  • More schemas: we are working through the catalogue systematically, and we encourage data providers and community members to contribute schemas for datasets they care about (see the call to action below).
  • Broader data format support: the AI ecosystem is evolving rapidly and we want to keep pace with it. We are planning to expand load_dataset() into a universal entry point that can emit data in whichever format best fits your pipeline.

Want to contribute?

We'd love your help to make datacollective better for everyone:

I want to use dataset <X,Y,Z> but its not supported by load_dataset() yet!

-> Open an Issue in our repository to request it. We'll try to add support as quickly as possible!

I have uploaded my dataset at MDC and want to help the community use it by adding support for the load_dataset() function. What can I do?

  → Consider opening a Pull Request with the schema.yaml for your dataset. We have step-by-step documentation to guide you through the process in this link.

I have questions or ideas I want to share with you!

  → Feel free to open an issue in our repository if it's something technical. Or you can find us on social media at the bottom of our website. Feedback from our community is incredibly important to us!

If you found this post useful, don't forget to check out: