Using the MDC Python SDK Library to Download Datasets
In this guide, you will learn how to use the MDC Python SDK Library to download datasets from the Mozilla Data Collective website.
Recorded by Kostis Saitas - Zarkias, AI & Data Engineer at Mozilla Data Collective
Prerequisites
- Create an account on Mozilla Data Collective and verify your email address
Project Setup
- In your profile, create an API credential in /profile/credentials. Ensure that you copy the secret key, as you will not be able to view it again once you close the credential creation window.
- Save your API key in your project .
envfile as an environment variable - Install the latest version of the Mozilla Data Collective Python SDK Library - we recommend using a virtual environment
uv venv .myenv
source .myenv/bin/activate
uv pip install datacollectiveUsing the package in your project
In this example, we prepare a dataset for fine-tuning a speech to text model by downloading, extracting, and bringing a Common Voice dataset into a pandas data frame using the following code:
from datacollective import load_dataset
dataframe = load_dataset("<YOUR_DATASET_ID_HERE>", download_directory="data")You will need replace <YOUR_DATASET_ID_HERE> with the dataset ID or slug for the dataset you want to download. To do this, you will need to agree to the terms and conditions for the dataset on the Mozilla Data Collective website.
You can verify that the dataset has been downloaded correctly by printing out the first few elements of the dataframe.
print(dataframe.head(5))Saving Datasets to Disk
If you want to download a dataset and store it on disk without using it in a specific project, you can do so with save_dataset_to_disk("<YOUR_DATASET_ID_HERE">.
Getting Dataset Details
You can get the metadata associated with a given dataset using get_dataset_details("<YOUR_DATASET_ID_HERE">, which can show important details about a dataset before downloading.