News

15 Datasets for Building a Production TTS Voice in 2026

A curated list of 15 text-to-speech training datasets for teams shipping production voice models in 2026 covering emotional, multi-speaker, audiobook-derived, non-Latin script, indigenous-language datasets and more.

Why dataset selection matters more than scale

There was a time when the entire conversation about text-to-speech could be reduced to a single question: how many hours of clean single-speaker audio do you have? The standard answer was twenty-four hours of LJSpeech-style read audio and from that you got a model that sounded acceptable but flat.

That era is over. The teams shipping production TTS now have moved past "more hours of one speaker." They train on deliberate mixtures: emotional speech for prosody, multi-speaker corpora for speaker generalisation, audiobook-derived data for narrative style, dialect-specific data for regional authenticity, non-Latin script data for under-represented writing systems, indigenous-language data for cultural inclusion. The model's quality is the mixture's quality, and the mixture's quality is a function of which specific datasets you put into it.

In this article is a list of fifteen specific datasets from Mozilla Data Collective that we think belong in a serious TTS training stack in 2026. Some are large and provide an acoustic baseline. Some are small and provide a specific signal, an emotion, a script, a voice profile, that nothing else covers. All of them are downloadable today, all of them have clear licensing, and all of them are on Mozilla Data Collective.

The 15 datasets

Thorsten-Voice Dataset 2021.06 Emotional Contributor: Community | Licence: CC0-1.0 | Size: 380.80 MB | Task: TTS | Format: WAV, CSV
Urdu Multi-Speaker TTS Dataset Contributor: Community | Licence: CC-BY-NC-4.0 | Size: 514.54 MB | Task: TTS | Format: WEBM, TSV
LibriVox Italian TTS Female Voice Contributor: MDC Curators | Licence: CC0-1.0 | Size: 61.74 MB | Task: TTS | Format: MP3, TSV
LibriVox Czech TTS Female Voice Contributor: MDC Curators | Licence: CC0-1.0 | Size: 178.58 MB | Task: TTS | Format: MP3, TXT, TSV
Kokoro Speech Dataset Contributor: Community | Licence: LibriVox Public Domain | Size: 3.98 GB | Task: TTS | Format: FLAC
Yoruba-TTS-Dataset Contributor: Institute of African Digital Humanities | Licence: NOODL-1.0 | Size: 319.05 MB | Task: TTS | Format: MP3, TSV
Hausa-TTS-Dataset Contributor: Institute of African Digital Humanities | Licence: NOODL-1.0 | Size: 276.90 MB | Task: TTS | Format: MP3, TSV
isiXhosa-TTS-Dataset Contributor: Institute of African Digital Humanities | Licence: NOODL-1.0 | Size: 276.02 MB | Task: TTS | Format: MP3, TSV
Tiv-TTS-Dataset Contributor: Institute of African Digital Humanities | Licence: NOODL-1.0 | Size: 311.58 MB | Task: TTS | Format: MP3, TSV
Duala-TTS-Dataset Contributor: Institute of African Digital Humanities | Licence: NOODL-1.0 | Size: 141.26 MB | Task: TTS | Format: MP3, TSV
Bamun-TTS-Dataset Contributor: Institute of African Digital Humanities | Licence: NOODL-1.0 | Size: 219.97 MB | Task: TTS | Format: MP3, TSV
Saraiki 10 Hours TTS Dataset Contributor: MirasAI | Licence: CC-BY-NC-SA-4.0 | Size: 584.44 MB | Task: TTS | Format: WEBM, TSV
Chuvash TTS Contributor: Taruen | Licence: CC-BY-SA-4.0 | Size: 854.02 MB | Task: TTS | Format: PARQUET
Otomí (Hñähñu) TTS Voz Masculina Contributor: Community | Licence: CC-BY-SA-4.0 | Size: 119.54 MB | Task: TTS | Format: MP3, TXT, TSV
Central Kurdish TTS dataset 1.0 Contributor: The University of Melbourne | Licence: CC-BY-4.0 | Size: 293.45 MB | Task: TTS | Format: WAV

Not Scale but Curation

Production TTS in 2026 is no longer a scale problem but a curation problem. The fifteen datasets above won't, on their own, train a model but what they will do is let you build a deliberately diverse training mix that covers prosody, speaker variation, scripts, dialects, and the under-represented languages most commercial TTS still ignores. Mozilla Data Collective exists precisely to provide a platform for that diversity and unlock datasets from around the world that are more multicultural and multilingual.

Browse all Mozilla Data Collective datasets →

Get in touch →

Join Mozilla Data Collective →

15 Datasets for Building a Production TTS Voice in 2026

Why dataset selection matters more than scale

The 15 datasets

Not Scale but Curation

Read more

Mozilla Data Collective datasets now discoverable through CLARIN’s Virtual Language Observatory

Get a Sneak Preview of Mozilla Data Collective’s Compensation Feature!

15 Datasets for Fine-Tuning Whisper on a New Language in 2026

How Radio Free Europe/Radio Liberty Also Serves Its Communities Through Its Datasets