Using the MDC Python SDK Library to Download Datasets

In this guide, you will learn how to use the MDC Python SDK Library to download datasets from the Mozilla Data Collective website.

Recorded by Kostis Saitas - Zarkias, AI & Data Engineer at Mozilla Data Collective

0:00
/5:05

Prerequisites

  1. Create an account on Mozilla Data Collective and verify your email address

Project Setup

  1. In your profile, create an API credential in /profile/credentials. Ensure that you copy the secret key, as you will not be able to view it again once you close the credential creation window.
  2. Save your API key in your project .env file as an environment variable
  3. Install the latest version of the Mozilla Data Collective Python SDK Library - we recommend using a virtual environment
uv venv .myenv
source .myenv/bin/activate
uv pip install datacollective

Using the package in your project

In this example, we prepare a dataset for fine-tuning a speech to text model by downloading, extracting, and bringing a Common Voice dataset into a pandas data frame using the following code:

from datacollective import load_dataset
dataframe = load_dataset("<YOUR_DATASET_ID_HERE>", download_directory="data")

You will need replace <YOUR_DATASET_ID_HERE> with the dataset ID or slug for the dataset you want to download. To do this, you will need to agree to the terms and conditions for the dataset on the Mozilla Data Collective website.

💡
The interface for agreeing to dataset terms and conditions ensures that each downloader can carefully review the terms for each dataset, as set by the dataset provider, to ensure their use case aligns with the intended use of the data.

You can verify that the dataset has been downloaded correctly by printing out the first few elements of the dataframe.

print(dataframe.head(5))

Saving Datasets to Disk

If you want to download a dataset and store it on disk without using it in a specific project, you can do so with save_dataset_to_disk("<YOUR_DATASET_ID_HERE">.

💡
Downloads are automatically resumable, which can be helpful when downloading large datasets

Getting Dataset Details

You can get the metadata associated with a given dataset using get_dataset_details("<YOUR_DATASET_ID_HERE">, which can show important details about a dataset before downloading.