Datasheets: The Missing Manual for your Dataset

In this video, produced by the Data Nutrition Project and illustrated by Jessica Yurkofsky, you'll learn more about the role of the datasheet and how you can use it to give clear guidance to potential downloaders about how your data can (and can't!) be used.

Datasheets: The Missing Manual for your Dataset
0:00
/3:51

When you upload your dataset to Mozilla Data Collective, creating a comprehensive datasheet is a critical part of the process. But what exactly is a datasheet? And how should you think about filling it out?

Creating a comprehensive datasheet adds a lot more value to your dataset. It contextualizes the data that you are sharing, so that downloaders understand how it is intended to be used.

The data card is the first preview that a downloader will have of your dataset. By being specific and clear in these fields, you can help people discover your dataset more easily. Once a potential downloader has found their way to your dataset, your datasheet will help them understand whether it's a good fit for their particular use case.

The datasheet for the Jember Javanese Spontaneous Speech Corpus on Mozilla Data Collective

While only some fields in the dataset onboarding form are required, a thorough datasheet is a concrete tool to help individuals and communities share and govern their data.

Additional Resources

Turning Your Data Into a Valuable ML Resource Without Giving Up Control - This guide introduces the principles behind ethical data sharing, ownership, and documentation for community dataset creators

Datasheets for Datasets (Microsoft Research) - Accessible overview of the datasheet concept plus downloadable templates 

Data Statements (UW Tech Policy Lab) - Practical framework specifically designed for language and speech datasets, including schema elements covering speaker demographics, annotator demographics, recording quality, and more.

Augmented Datasheets for Speech Datasets (Sony, GitHub Repo) - Sony Research's companion repo to the FAccT 2023 paper, containing the augmented speech datasheet template and completed example datasheets for five datasets including LibriSpeech and Common Voice

Datasheets for Datasets Gebru et al. (2021) - The original paper proposing the datasheet framework, published in Communications of the ACM. Essential reading for understanding the motivation, design decisions, and scope of the datasheet standard

Data Statements for NLP Bender & Friedman (2018) - Proposes a complementary framework specifically for NLP/speech datasets, including speaker demographics, language variety (BCP-47 tags), recording quality, and speech situation; directly applicable to ASR and Common Voice datasets

The Dataset Nutrition Label (2nd gen) Chmielinski et all (2022) - Offers an overview of the 2020 version of the Dataset Nutrition Label

The Data Nutrition Project is empowering data practitioners and policymakers with tools to improve AI outcomes.

Learn more about the Data Nutrition Project

Augmented Datasheets for Speech Datasets and Ethical Decision-Making Papakyriakopoulos et al. (2023) - Extends Gebru et al.'s framework with speech-specific questions covering language diversity, accent, dialect, speech impairment, data subject protection, and speaker compensation. Grounded in a large literature review of SLT datasets. Published at ACM FAccT 2023. The most directly relevant paper for documenting ASR datasets such as Common Voice

Data Statements: From Technical Concept to Community Practice McMillan-Major, Bender & Friedman (2024) - Empirical follow-up on how practitioners actually use data statements, with refined schema (v3) and community-developed best practices

Navigating Dataset Cards on Hugging Face (Research Analysis, 2024) - Large-scale empirical analysis of 7,400+ dataset cards on the Hub;  reveals what documentation fields are most/least completed and what high-quality cards look like in practice


Animation Credits

Music by: Ricky Valadez
Written by: Sarah Newman & Jessica Yurkofsky
Illustrated by: Jessica Yurkofsky
Read by: Sarah Newman
Sound editing by: Halsey Burgund
Special thanks to Liv Erickson, Katherine Reid, and Francis Tyers
Produced by the Data Nutrition Project in collaboration with Mozilla Data Collective