News

Shared Task: Mozilla Common Voice Spontaneous Speech ASR

Francis Tyers

01 Oct 2025 — 5 min read

Mozilla Data Collective: Create. Curate. Control.

Quick links

Registration Form
Link to datasets
Codabench page (to submit results during the testing period)
Contact: sharedtask@mozillafoundation.org

Overview

Automatic speech recognition (ASR) has come a long way – but most systems are still trained on polished, read-aloud speech. So we set out to build a model that can handle the messy, beautiful reality of spontaneous responses and languages long ignored by mainstream tech. We’re raising the standards for accuracy scores and building speech technology that works for everyone, not just the few.

So along with Mozilla Data Collective’s new Spontaneous Speech datasets, we’re launching a shared task that challenges researchers and developers to push ASR further, across 21 underrepresented languages from Africa, Asia, Europe and the Americas.

The goal of this shared task is to promote the development of robust automatic speech recognition (ASR) systems for spontaneous speech in a number of lower-resource languages that have historically been underrepresented in speech technology research.

Many of the large, widely-used ASR datasets are either read speech (previous releases from Mozilla Common Voice), predominantly English (Switchboard datasets), or both (LibriSpeech, WSJ).

This shared task is based on the recently released spontaneous speech datasets from Mozilla Common Voice. In these datasets, participants freely respond to prompts, and the responses are transcribed and validated. The available spontaneous speech datasets represent a wide range of under-served language communities.

The task will evaluate systems based on overall performance, the best improvement over the baseline for any single language, and resource-constrained system performance, encouraging innovative approaches to handle the nuances of spontaneous speech recognition.

Tasks

The shared task includes one main task:

Multilingual ASR Performance (Task 1) : The average Word Error Rate (WER) on all languages (excluding unseen languages).

and 3 subtasks:

Best improvement on a single language (Task 2): The largest improvement on WER for any single language as compared to our baseline system's performance. Submissions to this task can be for any of the languages in Task 1 or the unseen languages from Task 4 (see below).
Model-size constrained improvement on single language (Task3): The best WER improvement over baseline for any language with a model that is less than 500 MB in size. Submissions to this task can be for any of the languages in Task 1 or the unseen languages from Task 4 (see below).
Unseen language ASR (Task 4): In addition to the set of languages that have training data (see the Data section below for details), we also include 5 languages for which no training data will be provided. We will provide the language names, but it is up to the participating teams to find additional data or leverage cross-lingual techniques. The best average WER on the set of unseen languages will win this task. Any additional data used must be able to be shared openly.

Data

The dataset (available on Mozilla Data Collective here) includes approximately 9 hours each of 21 total languages from Africa, the Americas, Europe, and Asia. Each language's dataset is available on the Mozilla Data Collective website. The following table displays all 21 languages:

	No.	Language	ISO 639
Africa
	1	Bukusu	`bxk`
	2	Chiga	`cgg`
	3	Nubi	`kcn`
	4	Konzo	`koo`
	5	Lendu	`led`
	6	Kenyi	`lke`
	7	Thur	`lth`
	8	Ruuli	`ruc`
	9	Amba	`rwm`
	10	Rutoro	`ttj`
	11	Kuku	`ukv`
Americas
	12	Wixárika	`hch`
	13	Southwestern Tlaxiaco Mixtec	`meh`
	14	Michoacán Mazahua	`mmc`
	15	Papantla Totonac	`top`
	16	Toba Qom	`tob`
Europe
	17	Gheg Albanian	`aln`
	18	Cypriot Greek	`el-CY`
	19	Scots	`sco`
Asia
	20	Betawi	`bew`
	21	Western Penan	`pne`

Additionally, for Task 4, we include 5 languages for which only test data will be released: Adyghe (ady), Kabardian (kbd), Basaa (bas), Puno Quechua (qxp), Ushojo (ush)

For these languages, teams are encouraged to consult potential useful data and/or leverage cross-lingual approaches. Any data used must be openly licensed to facilitate reproducibility.

Prizes

Task 1: $5,000 USD
Tasks 2-4: $2,000 USD each

Note: Contestants are not eligible to receive prizes if they are on the US Specifically Designated Nationals (SDN) list or if there are sanctions against the contestant’s country such that Mozilla is prohibited from paying them.

Registration

Please register for the competition through the following form.

Important dates

26th September, 2025: Train/Dev data released (via Mozilla Data Collective)
1st October, 2025: Shared task announced
1st December, 2025: Test data released
8th December, 2025: Deadline for submitting final results and system description paper
12th December, 2025: Winners announced

Submission

Once we release the test data on the 1st December (audio only), teams will have 1 week to submit their system's predicted transcriptions for the relevant tasks on the shared task CodaBench page. Submissions should take the form of a zip file containing three subdirectories, each with a set of one or more tsv files (1 per language being attempted):

multilingual-general
- aln.tsv
- bew.tsv
- …
small-model
- aln.tsv
- bew.tsv
- …
unseen-langs
- ady.tsv
- bas.tsv
- …

The tsv files should have two columns, the first is the name of the audio file, and the second is the predicted transcription. The scores for each task are the average over all of the languages in the respective task (21 for general and small model, 5 for unseen). If you do not submit transcriptions for a given language, we will treat it as a blank transcription, resulting in a WER of 1.0. For the “Biggest improvement over baseline” task, we will automatically select the language from the multilingual-general and multilingual-small-model that improves the most over our baseline.

The team with the best performance in each task will be asked to submit their model and a script to perform inference, so that we can reproduce the results (to avoid the possibility of, e.g., post-editing predicted transcriptions to improve performance). If we are unable to reproduce the system performance, the team will be disqualified and we will request the model and inference script from the next-highest-scoring team.

System description papers

Each team must submit, in addition to their system's predicted transcriptions on the test data, a system description paper, between 4-8 pages (excluding acknowledgments and references). Please use the ACL Template.

Submissions should omit the author names for review.

Organisers

Programme chairs:

Francis M. Tyers, Indiana University
Robert Pugh, Mozilla Data Collective
Anastasia Kuznetsova, Rev.com
Jean Maillard, Meta

Programme committee:

Antonios Anastasopoulos, George Mason University
Kathy Reid, Australian National University
Miguel del Rio, Rev.com
Pooneh Mousavi, MILA
Abteen Ebrahimi, University of Colorado, Boulder
Ximena Gutierrez Vasquez, UNAM

Advisory committee:

Emmanuel Ngué Um, University of Yaounde 1
Belu Ticona, George Mason University
Jennifer Smith, University of Glasgow
Joyce Nabende, Makerere University
Jonathan Mukiibi, Makerere University
Elwin Huaman, Innsbruck University
Yacub Fahmilda, Universitas Gadjah Mada
Riska Legistari Febri, Universitas Gadjah Mada
Murat Topçu, Okan University
Rosario de Fátima Alvarez García, Universidad Autónoma Metropolitana
Athziri Madeleine Vega Martínez, Universidad Nacional Autónoma de México
Marlon Vargas Méndez, Escuela Nacional de Antropología e Historia
Antonio Hayuaneme García Mijarez, Nación Wixárika
Vivian Stamou, Archimedes Athena Research Centre
Meesum Alam, Indiana University
Jonathan Lewis-Jong, University of Oxford

Shared Task: Mozilla Common Voice Spontaneous Speech ASR

Francis Tyers

Read more

FAQ: What does it mean to exclusively host my dataset on Mozilla Data Collective?

FAQ: What kind of datasets can I publish on Mozilla Data Collective?

FAQ: What are the main points of the MDC Terms of Use?

What do the languages Nahuatl, Bahasa Indonesia and Bulgarian have in common?