We're Changing Access to Older Versions of Common Voice datasets
We’re tightening the circulation of old datasets to protect contributors, while keeping a clear, documented path for researchers who need them.
Common Voice has grown to more than 33,000 hours of speech across 137 languages, contributed by hundreds of thousands of volunteers worldwide. And each one of those people donating their voices to Common Voice is helping improve technology to reflect how real communities speak.
With those contributions comes a responsibility to do two things well: respect contributors’ rights—including when they ask us to delete their data—and support researchers who need past versions to reproduce results for work.
At times, those goals pull in different directions, creating tension for us in how we store the data we collect.
The right to be forgotten
Roughly every three months, we publish a new Common Voice dataset version (e.g., 23.0 from September 2025) of all the voice clips plus matching text for each language.
Sometimes, contributors ask us to remove their data for privacy or other reasons. We remove it from the database and leave it out of future dataset versions. We feel strongly that this is the right thing to do.
However, older releases may still contain earlier contributions, which researchers sometimes need to replicate past results (e.g., a paper that used the 17.0 dataset). They want to be able to make sure that the results are the same, to know that their implementation was correct.
What’s changing
To better balance privacy and reproducibility, we’re updating access to old datasets:
- All dataset downloaders will need to use a validated email to access datasets.
- You need to agree to basic terms when downloading datasets: please don’t share the files further.
- Current dataset releases will remain self-serve.
- For older dataset versions, you’ll need to get in touch to tell us which version you need and why (for example, to reproduce a paper). We will help you get the right version for your use case.
What it means for the community
We’ll keep honoring deletion requests in future releases and reduce the casual spread of older datasets.
Researchers can still get historical snapshots, but this short new ‘request’ step adds accountability, aligns with contributor privacy, fits today’s risk landscape (where voice cloning needs much less data than before), and creates a single point of access to find the right version with the right terms.
How to request an older dataset version
For the moment, to request an old dataset, just send us an email providing:
- The version number you need
- A short reason (e.g., replicating a paper that used v17)
- Confirm your consent to the no‑reshare and research‑only terms
We’ll reply with the next steps.
Get in touch
We’re happy to take feedback and hear how these changes affect your workflow. If centralized access makes your work harder, get in contact with us to explain how, and the context of your setup (classroom, workshop, lab, solo research), the constraints, and the versions you rely on.
This kind of input will shape how we design and develop the platform so we can keep responsibly stewarding the datasets whilst continuing to make improvements for ease of access.