15 Datasets for Building a Production TTS Voice in 2026

A curated list of 15 text-to-speech training datasets for teams shipping production voice models in 2026 covering emotional, multi-speaker, audiobook-derived, non-Latin script, indigenous-language datasets and more.

Share
15 Datasets for Building a Production TTS Voice in 2026

Why dataset selection matters more than scale

There was a time when the entire conversation about text-to-speech could be reduced to a single question: how many hours of clean single-speaker audio do you have? The standard answer was twenty-four hours of LJSpeech-style read audio and from that you got a model that sounded acceptable but flat.

That era is over. The teams shipping production TTS now have moved past "more hours of one speaker." They train on deliberate mixtures: emotional speech for prosody, multi-speaker corpora for speaker generalisation, audiobook-derived data for narrative style, dialect-specific data for regional authenticity, non-Latin script data for under-represented writing systems, indigenous-language data for cultural inclusion. The model's quality is the mixture's quality, and the mixture's quality is a function of which specific datasets you put into it.

In this article is a list of fifteen specific datasets from Mozilla Data Collective that we think belong in a serious TTS training stack in 2026. Some are large and provide an acoustic baseline. Some are small and provide a specific signal, an emotion, a script, a voice profile, that nothing else covers. All of them are downloadable today, all of them have clear licensing, and all of them are on Mozilla Data Collective.

The 15 datasets

  • Thorsten-Voice Dataset 2021.06 Emotional Contributor: Community | Licence: CC0-1.0 | Size: 380.80 MB | Task: TTS | Format: WAV, CSV
  • Urdu Multi-Speaker TTS Dataset Contributor: Community | Licence: CC-BY-NC-4.0 | Size: 514.54 MB | Task: TTS | Format: WEBM, TSV
  • LibriVox Italian TTS Female Voice Contributor: MDC Curators | Licence: CC0-1.0 | Size: 61.74 MB | Task: TTS | Format: MP3, TSV
  • LibriVox Czech TTS Female Voice Contributor: MDC Curators | Licence: CC0-1.0 | Size: 178.58 MB | Task: TTS | Format: MP3, TXT, TSV
  • Kokoro Speech Dataset Contributor: Community | Licence: LibriVox Public Domain | Size: 3.98 GB | Task: TTS | Format: FLAC
  • Yoruba-TTS-Dataset Contributor: Institute of African Digital Humanities | Licence: NOODL-1.0 | Size: 319.05 MB | Task: TTS | Format: MP3, TSV
  • Hausa-TTS-Dataset Contributor: Institute of African Digital Humanities | Licence: NOODL-1.0 | Size: 276.90 MB | Task: TTS | Format: MP3, TSV
  • isiXhosa-TTS-Dataset Contributor: Institute of African Digital Humanities | Licence: NOODL-1.0 | Size: 276.02 MB | Task: TTS | Format: MP3, TSV
  • Tiv-TTS-Dataset Contributor: Institute of African Digital Humanities | Licence: NOODL-1.0 | Size: 311.58 MB | Task: TTS | Format: MP3, TSV
  • Duala-TTS-Dataset Contributor: Institute of African Digital Humanities | Licence: NOODL-1.0 | Size: 141.26 MB | Task: TTS | Format: MP3, TSV
  • Bamun-TTS-Dataset Contributor: Institute of African Digital Humanities | Licence: NOODL-1.0 | Size: 219.97 MB | Task: TTS | Format: MP3, TSV
  • Saraiki 10 Hours TTS Dataset Contributor: MirasAI | Licence: CC-BY-NC-SA-4.0 | Size: 584.44 MB | Task: TTS | Format: WEBM, TSV
  • Chuvash TTS Contributor: Taruen | Licence: CC-BY-SA-4.0 | Size: 854.02 MB | Task: TTS | Format: PARQUET
  • Otomí (Hñähñu) TTS Voz Masculina Contributor: Community | Licence: CC-BY-SA-4.0 | Size: 119.54 MB | Task: TTS | Format: MP3, TXT, TSV
  • Central Kurdish TTS dataset 1.0 Contributor: The University of Melbourne | Licence: CC-BY-4.0 | Size: 293.45 MB | Task: TTS | Format: WAV

Not Scale but Curation

Production TTS in 2026 is no longer a scale problem but a curation problem. The fifteen datasets above won't, on their own, train a model but what they will do is let you build a deliberately diverse training mix that covers prosody, speaker variation, scripts, dialects, and the under-represented languages most commercial TTS still ignores. Mozilla Data Collective exists precisely to provide a platform for that diversity and unlock datasets from around the world that are more multicultural and multilingual.

Browse all Mozilla Data Collective datasets →

Get in touch →

Join Mozilla Data Collective →