data

What makes a good dataset sample — and how to create one

In this post, we walk you through how to create a useful dataset sample as a preview of your dataset, and guide you in uploading it to the MDC platform.

A high quality sample dataset helps dataset consumers better understand what large datasets are comprised of - before committing to a heavy download.

Ever downloaded a huge dataset? Watched a Download meter for hours as tens of gigabytes trickle down the internet pipes to your hard drive? Only to then import it into pandas and discover it wasn't quite what you needed?

We share your frustration! That's why providing a high-quality dataset sample is so important, particularly for large datasets. A sample is the "try before you buy" of the data world — a small, representative slice of a larger dataset that lets a potential user assess its suitability before committing to a full download.

In this post, we cover two key pieces: what good sampling practice looks like (and how to do it in code), and how to upload a sample file to accompany your dataset on the Mozilla Data Collective platform.

What makes a good dataset sample?

The case for sampling

Providing a sample alongside a dataset is increasingly considered standard practice in responsible data governance. ISO/IEC 42001:2023, the international standard for AI management systems, emphasises transparency and accessibility in data documentation as a core requirement for trustworthy AI systems. Good sampling practice is part of that picture: it allows downstream users to make informed decisions about whether a dataset meets their needs, before they invest time or compute in a full download.

The FAIR data principles — Findable, Accessible, Interoperable, Reusable — similarly treat dataset transparency as foundational. A sample makes a dataset more assessable, which in turn makes it more reusable. From a practical governance perspective, a sample also gives reviewers, auditors, and ethics boards something concrete to inspect without requiring access to the full dataset.
For large datasets, the download cost alone can be a barrier to evaluation. A well-designed sample removes that barrier and increases the likelihood that your dataset reaches the people who could benefit from it.

Types of sampling and their trade-offs

Not all samples are created equal. The choice of sampling method shapes what a potential user can and cannot infer about the full dataset. The table below summarises the most common approaches.

Sampling method	What it does	Pros	Cons
Random sampling	Selects rows uniformly at random, without regard for any column values	Simple to implement; no assumptions about dataset structure required	May under-represent rare categories (e.g. minority accents, low-frequency labels); sample composition is non-deterministic
Stratified sampling	Samples proportionally from within defined subgroups (strata), such as gender, locale, or age	Preserves the distribution of important categorical variables; more representative of the full dataset	Requires knowing which categories are important before sampling; can be complex if many strata intersect
Systematic sampling	Selects every nth row	Simple and deterministic; useful for ordered datasets	Can introduce bias if the data has periodic structure (e.g. repeated patterns every n rows)
Purposive / judgement sampling	Manually selects rows to illustrate specific properties	Useful for demonstrating edge cases or data quality characteristics	Not statistically representative; may mislead users about dataset composition
Cluster sampling	Randomly selects groups (clusters) and includes all rows from those groups	Efficient for geographically or structurally clustered data	Within-cluster similarity can reduce sample diversity

For most AI and machine learning dataset use cases, stratified sampling is the recommended approach. It ensures that categorical variables that matter for model training — such as speaker demographics, accent, or label distribution — appear in the sample in proportions that reflect the full dataset, rather than being left to chance.

Creating a representative sample with pandas

The examples below assume you are working with a Common Voice-style dataset in .tsv format, with columns including client_id, sentence_id, sentence, age, gender, accents, and locale. The goal is to produce a sample that is representative across the demographic dimensions that matter most for downstream model training and evaluation.

Step 1: Load the dataset

import pandas as pd
df = pd.read_csv('cv-corpus-en.tsv', sep='\t')

# Preview the shape and column names
print(df.shape)print(df.columns.tolist())

Step 2: Inspect the categorical distributions you care about

Before sampling, understand what you have. For a speech dataset, gender and accents are the categories most likely to be unevenly distributed.

print(df['gender'].value_counts(dropna=False))
print(df['accents'].value_counts(dropna=False))

You will almost certainly find that some categories are sparse. The accents column in Common Voice datasets, for example, typically has a long tail: a few accent categories contain thousands of clips, while many contain fewer than a hundred. This is not a problem to hide — it is important information for a potential user, and a good sample should reflect it honestly.

Step 3: Stratified sample across gender and accent

# Drop rows where both gender and accents are null,
# since these cannot be assigned to a stratum
df_stratifiable = df.dropna(subset=['gender', 'accents'], how='all')

# Fill remaining nulls with a placeholder so they form their own stratum
df_stratifiable = df_stratifiable.fillna({
    'gender': 'not_specified',
    'accents': 'not_specified'
})

# Sample proportionally: ~1000 rows, or 1% of the dataset, whichever is smaller
sample_size = min(1000, max(100, int(len(df_stratifiable) * 0.01)))

# groupby + apply with a lambda handles unequal stratum sizes gracefully
sample = (
    df_stratifiable
    .groupby(['gender', 'accents'], group_keys=False)
    .apply(lambda x: x.sample(
        n=min(len(x), max(1, int(sample_size * len(x) / len(df_stratifiable)))),
        random_state=42
    ))
    .reset_index(drop=True)
)

print(f"Sample size: {len(sample)} rows")
print(sample[['gender', 'accents']].value_counts())

Setting `random_state=42` (or any fixed integer) makes the sample reproducible — the same code will produce the same sample each time. This is important for auditability.

Step 4: Include rows with missing demographic data

A sample that silently drops rows with null `gender` or `accents` values may mislead users about how complete the demographic metadata actually is. Consider explicitly including a small number of such rows:

df_no_demo = df[df['gender'].isna() & df['accents'].isna()]
null_sample = df_no_demo.sample(n=min(20, len(df_no_demo)), random_state=42)

sample = pd.concat([sample, null_sample]).reset_index(drop=True)

Step 5: Save the sample

sample.to_csv('cv-corpus-en-sample.tsv', sep='\t', index=False)

Sample datasets are a reflection of the integrity of your data

A sample dataset is a claim about the larger dataset it represents.

If your sample over-represents majority categories — because you used random sampling on an imbalanced dataset — users may form incorrect expectations about how useful the dataset is for training models on minority groups. If your sample under-represents them, users may incorrectly conclude the dataset lacks diversity it actually has.

The most honest approach is stratified sampling with clear documentation of what strata were used, including a note on categories with very few examples. If a particular accent or demographic group has fewer than 10 clips in the full dataset, say so in your dataset's Datasheet.

How to upload a sample dataset on the MDC platform

Once you have exported your sample dataset, compress it as a .tar.gz archive. If you're not sure how to do this, our dataset uploading instructions provide more information.

Next, create a new upload submission (you'll need to be approved to upload datasets first). When creating the upload submission, you'll have the opportunity to upload a sample dataset, as shown below.

Where to upload a sample dataset when creating a dataset submission

After your dataset submission has been submitted and approved, your sample dataset will be available for download from your dataset's page, as shown below:

You can also add a dataset sample to existing datasets by viewing your uploads, selecting one, then editing it.

What makes a good dataset sample — and how to create one

What makes a good dataset sample?

The case for sampling

Types of sampling and their trade-offs

Creating a representative sample with pandas

Step 1: Load the dataset

Step 2: Inspect the categorical distributions you care about

Step 3: Stratified sample across gender and accent

Step 4: Include rows with missing demographic data

Step 5: Save the sample

Sample datasets are a reflection of the integrity of your data

How to upload a sample dataset on the MDC platform

Further reading

Read more

Mozilla Data Collective datasets now discoverable through CLARIN’s Virtual Language Observatory

Get a Sneak Preview of Mozilla Data Collective’s Compensation Feature!

15 Datasets for Fine-Tuning Whisper on a New Language in 2026

How Radio Free Europe/Radio Liberty Also Serves Its Communities Through Its Datasets