15 Datasets for Fine-Tuning Whisper on a New Language in 2026

Share
15 Datasets for Fine-Tuning Whisper on a New Language in 2026

Why fine-tuning Whisper is a dataset problem

OpenAI's Whisper changed what's possible in speech recognition: a single multilingual model with strong zero-shot performance across dozens of languages. But "dozens" is the catch. But for the long tail of the world's languages, even for some with tens of millions of speakers, Whisper's out-of-the-box output is unusable in terms of word error rate, the most popular metric.

Fine-tuning fixes this, and the fine-tuning recipe is well understood by now: take a Whisper checkpoint, feed it paired audio and transcripts in your target language, and the model adapts fast. How fast can be striking: the documentation for the Khmer ASR Cultural Dataset below reports that adding even a few hundred Khmer speech-text pairs to the training mix measurably lowered Whisper Large V2's character error rate, bringing it down to roughly 8%. 

What’s clear is that the bottleneck isn't the training code but the data. You need audio that's transcribed accurately, licensed in a way that lets you use it, and ideally varied enough in speaker, dialect, and recording condition that your fine-tuned model generalises beyond the training set.

This article lists 15 automatic speech recognition datasets on Mozilla Data Collective that are well suited to Whisper fine-tuning, spanning a range of language families (Dravidian, Bantu, Uralic) and regions (Mesoamerican, South Asian, European). They range from large studio-production corpora to small, phonetically precise field recordings. Several are time-aligned, which is exactly the format Whisper fine-tuning wants. Pick the one in your target language, or combine several related ones to push a whole language family across the usability threshold.

The 15 datasets

South Asia

Southeast Asia

African Region

Siberia and Northern Eurasia

  • INEL Nganasan Speech Corpus Steward: University of Hamburg | Licence: CC-BY-NC-SA-4.0 | Size: 1.41 GB, 38.5 hours | Task: ASR | Format: TSV, MP3
  • INEL Kalmyk Speech Corpus Steward: University of Hamburg | Licence: CC-BY-NC-SA-4.0 | Size: 138.31 MB, 3 hours | Task: ASR | Format: TSV, MP3
  • INEL Evenki Speech Corpus Steward: University of Hamburg | Licence: CC-BY-NC-SA-4.0 | Size: 103MB, 2.6 hours | Task: ASR | Format: TSV, MP3
  • INEL Dolgan Speech Corpus Steward: University of Hamburg | Licence: CC-BY-NC-SA-4.0 | Size: 583.34 MB, 13 hours | Task: ASR | Format: TSV, MP3

Latin America

From scraped to shared: fine-tuning Whisper on community datasets

Fine-tuning Whisper on a new language is one of the highest-leverage things you can do in speech AI right now: a few hours of well-transcribed audio can take a language from unusable to production-viable. For a step by step guide on how to fine-tune Whisper models using Mozilla Data Collective datasets you can check out our step-by-step tutorial. The 15 datasets above give you audio across a deliberately wide spread of language families: Dravidian (Tamil, Kannada, Malayalam), Bantu (isiZulu), Uralic and Mongolic (Nganasan, Kalmyk, Evenki, Dolgan); and regions: Mesoamerican (Cuicatec), and Southeast Asian (Khmer, Javanese).

Mozilla Data Collective is rebuilding the data ecosystem by keeping communities front and centre and giving the speakers of a language, and the organisations that record and preserve it, a say in how their speech datasets get licensed, credited, and applied. For under-served languages, this is a meaningful shift: away from models trained on whatever material can be scraped and extracted from the internet, and toward systems built on datasets that communities have deliberately shared and contributed on terms they set.

Browse all Mozilla Data Collective datasets →

Get in touch →

Join Mozilla Data Collective →