News

15 Datasets for Fine-Tuning Whisper on a New Language in 2026

Why fine-tuning Whisper is a dataset problem

OpenAI's Whisper changed what's possible in speech recognition: a single multilingual model with strong zero-shot performance across dozens of languages. But "dozens" is the catch. But for the long tail of the world's languages, even for some with tens of millions of speakers, Whisper's out-of-the-box output is unusable in terms of word error rate, the most popular metric.

Fine-tuning fixes this, and the fine-tuning recipe is well understood by now: take a Whisper checkpoint, feed it paired audio and transcripts in your target language, and the model adapts fast. How fast can be striking: the documentation for the Khmer ASR Cultural Dataset below reports that adding even a few hundred Khmer speech-text pairs to the training mix measurably lowered Whisper Large V2's character error rate, bringing it down to roughly 8%.

What’s clear is that the bottleneck isn't the training code but the data. You need audio that's transcribed accurately, licensed in a way that lets you use it, and ideally varied enough in speaker, dialect, and recording condition that your fine-tuned model generalises beyond the training set.

This article lists 15 automatic speech recognition datasets on Mozilla Data Collective that are well suited to Whisper fine-tuning, spanning a range of language families (Dravidian, Bantu, Uralic) and regions (Mesoamerican, South Asian, European). They range from large studio-production corpora to small, phonetically precise field recordings. Several are time-aligned, which is exactly the format Whisper fine-tuning wants. Pick the one in your target language, or combine several related ones to push a whole language family across the usability threshold.

The 15 datasets

South Asia

Malayalam Time-Aligned Speech Corpus Steward: Community | Licence: CC-BY-NC-4.0 | Size: 1.50 GB, 6 hours | Task: ASR | Format: WAV, SRT
Tamil Time-Aligned Speech Dataset Steward: MirasAI | Licence: CC-BY-NC-SA-4.0 | Size: 37.11 MB, 5 hours | Task: ASR | Format: OGG, SRT
Kannada Time-Aligned Speech Corpus Steward: MirasAI | Licence: CC-BY-NC-SA-4.0 | Size: 355.77 MB, 5 hours | Task: ASR | Format: OGG, SRT

Southeast Asia

Khmer ASR Cultural Dataset (Version 3 - Part 5) Steward: DDD-Cambodia | Licence: CC-BY-SA-4.0 | Size: 33.14 GB, 87 hours | Task: ASR | Format: WAV
Khmer ASR Cultural Dataset (Version 3 - Part 7) Steward: DDD-Cambodia | Licence: CC-BY-SA-4.0 | Size: 29.93 GB, 81 hours| Task: ASR | Format: WAV
Mandar Spontaneous Speech Steward: Community | Licence: CC-BY-NC-4.0 | Size: 534.45 MB, 10 hours | Task: ASR | Format: MP3, TSV
Jember Javanese Spontaneous Speech Corpus Steward: Universitas Gadjah Mada | Licence: CC-BY-NC-SA-4.0 | Size: 271.65 MB, 10 hours | Task: ASR | Format: MP3, TSV

African Region

DataTrust Africa: Speech Corpus of Public Radio Recordings from Northern Uganda Steward: Community | Licence: NOODL-1.0 | Size: 179.82 MB | Task: ASR | Format: WAV, TSV
IsiZulu Second Language Learner Speech Corpus Steward: Community | Licence: CC-BY-SA-4.0 | Size: 5.26 GB | Task: ASR | Format: WAV, SQLite

Siberia and Northern Eurasia

INEL Nganasan Speech Corpus Steward: University of Hamburg | Licence: CC-BY-NC-SA-4.0 | Size: 1.41 GB, 38.5 hours | Task: ASR | Format: TSV, MP3
INEL Kalmyk Speech Corpus Steward: University of Hamburg | Licence: CC-BY-NC-SA-4.0 | Size: 138.31 MB, 3 hours | Task: ASR | Format: TSV, MP3
INEL Evenki Speech Corpus Steward: University of Hamburg | Licence: CC-BY-NC-SA-4.0 | Size: 103MB, 2.6 hours | Task: ASR | Format: TSV, MP3
INEL Dolgan Speech Corpus Steward: University of Hamburg | Licence: CC-BY-NC-SA-4.0 | Size: 583.34 MB, 13 hours | Task: ASR | Format: TSV, MP3

Latin America

Speech Corpus of English Learners from Mexico Steward: Community | Licence: CC-BY-SA-4.0 | Size: 2.45 GB, 8 hours | Task: ASR | Format: MP3, TSV
Archivo GELED: Muestra general de audios del cuicateco Steward: Community | Licence: CC-BY-NC-SA-4.0 | Size: 1.85 GB, 3 hours | Task: ASR | Format: WAV, TSV

From scraped to shared: fine-tuning Whisper on community datasets

Fine-tuning Whisper on a new language is one of the highest-leverage things you can do in speech AI right now: a few hours of well-transcribed audio can take a language from unusable to production-viable. For a step by step guide on how to fine-tune Whisper models using Mozilla Data Collective datasets you can check out our step-by-step tutorial. The 15 datasets above give you audio across a deliberately wide spread of language families: Dravidian (Tamil, Kannada, Malayalam), Bantu (isiZulu), Uralic and Mongolic (Nganasan, Kalmyk, Evenki, Dolgan); and regions: Mesoamerican (Cuicatec), and Southeast Asian (Khmer, Javanese).

Mozilla Data Collective is rebuilding the data ecosystem by keeping communities front and centre and giving the speakers of a language, and the organisations that record and preserve it, a say in how their speech datasets get licensed, credited, and applied. For under-served languages, this is a meaningful shift: away from models trained on whatever material can be scraped and extracted from the internet, and toward systems built on datasets that communities have deliberately shared and contributed on terms they set.

Browse all Mozilla Data Collective datasets →

Get in touch →

Join Mozilla Data Collective →