Shared Task: Mozilla Common Voice Spontaneous Speech ASR

Shared Task: Mozilla Common Voice Spontaneous Speech ASR
Mozilla Data Collective: Create. Curate. Control.

Quick links


Overview

Automatic speech recognition (ASR) has come a long way – but most systems are still trained on polished, read-aloud speech. So we set out to build a model that can handle the messy, beautiful reality of spontaneous responses and languages long ignored by mainstream tech. We’re raising the standards for accuracy scores and building speech technology that works for everyone, not just the few. 

So along with Mozilla Data Collective’s new Spontaneous Speech datasets, we’re launching a shared task that challenges researchers and developers to push ASR further, across 21 underrepresented languages from Africa, Asia, Europe and the Americas. 

The goal of this shared task is to promote the development of robust automatic speech recognition (ASR) systems for spontaneous speech in a number of lower-resource languages that have historically been underrepresented in speech technology research.

Many of the large, widely-used ASR datasets are either read speech (previous releases from Mozilla Common Voice), predominantly English (Switchboard datasets), or both (LibriSpeech, WSJ).

This shared task is based on the recently released spontaneous speech datasets from Mozilla Common Voice. In these datasets, participants freely respond to prompts, and the responses are transcribed and validated. The available spontaneous speech datasets represent a wide range of under-served language communities.

The task will evaluate systems based on overall performance, the best improvement over the baseline for any single language, and resource-constrained system performance, encouraging innovative approaches to handle the nuances of spontaneous speech recognition.

Tasks

The shared task includes one main task:

  • Multilingual ASR Performance (Task 1) : The average Word Error Rate (WER) on all languages (excluding unseen languages).

and 3 subtasks:

  • Best improvement on a single language (Task 2): The largest improvement on WER for any single language as compared to our baseline system's performance.
  • Model-size constrained improvement on single language (Task3): The best WER improvement over baseline for any language with a model that is less than 500 MB in size.
  • Unseen language ASR (Task 4): In addition to the set of languages that have training data (see the Data section below for details), we also include 5 languages for which no training data will be provided. We will provide the language names, but it is up to the participating teams to find additional data or leverage cross-lingual techniques. The best average WER on the set of unseen languages will win this task. Any additional data used must be able to be shared openly.

Data

The dataset (available on Mozilla Data Collective here) includes approximately 9 hours each of 21 total languages from Africa, the Americas, Europe, and Asia. Each language's dataset is available on the Mozilla Data Collective website. The following table displays all 21 languages:

No. Language ISO 639
Africa
1 Bukusu bxk
2 Chiga cgg
3 Nubi kcn
4 Konzo koo
5 Lendu led
6 Kenyi lke
7 Thur lth
8 Ruuli ruc
9 Amba rwm
10 Rutoro ttj
11 Kuku ukv
Americas
12 Wixárika hch
13 Southwestern Tlaxiaco Mixtec meh
14 Michoacán Mazahua mmc
15 Papantla Totonac top
16 Toba Qom tob
Europe
17 Gheg Albanian aln
18 Cypriot Greek el-CY
19 Scots sco
Asia
20 Betawi bew
21 Western Penan pne

Additionally, for Task 4, we include 5 languages for which only test data will be released: Adyghe (ady), Kabardian (kbd), Basaa (bas), Puno Quechua (qxp), Ushojo (ush)

For these languages, teams are encouraged to consult potential useful data and/or leverage cross-lingual approaches. Any data used must be openly licensed to facilitate reproducibility.

Prizes

  • Task 1: $5,000 USD
  • Tasks 2-4: $2,000 USD each

Note: Contestants are not eligible to receive prizes if they are on the US Specifically Designated Nationals (SDN) list or if there are sanctions against the contestant’s country such that Mozilla is prohibited from paying them.

Registration

Please register for the competition through the following form.

Important dates

  • 26th September, 2025: Train/Dev data released (via Mozilla Data Collective)
  • 1st October, 2025: Shared task announced
  • 1st December, 2025: Test data released
  • 8th December, 2025: Deadline for submitting final results and system description paper
  • 12th December, 2025: Winners announced

Submission

Once we release the test data on the 1st December  (audio only), teams will have 1 week  to submit their system's predicted transcriptions for the relevant tasks on the shared task CodaBench page.  Submissions should take the form of a zip file containing three subdirectories, each with a set of one or more tsv files (1 per language being attempted):

  • multilingual-general
    • aln.tsv
    • bew.tsv
  • small-model
    • aln.tsv
    • bew.tsv
  • unseen-langs
    • ady.tsv
    • bas.tsv

The tsv files should have two columns, the first is the name of the audio file, and the second is the predicted transcription. The scores for each task are the average over all of the languages in the respective task (21 for general and small model, 5 for unseen). If you do not submit transcriptions for a given language, we will treat it as a blank transcription, resulting in a WER of 1.0. For the “Biggest improvement over baseline” task, we will automatically select the language from the multilingual-general and multilingual-small-model that improves the most over our baseline.

The team with the best performance in each task will be asked to submit their model and a script to perform inference, so that we can reproduce the results (to avoid the possibility of, e.g., post-editing predicted transcriptions to improve performance). If we are unable to reproduce the system performance, the team will be disqualified and we will request the model and inference script from the next-highest-scoring team.

System description papers

Each team must submit, in addition to their system's predicted transcriptions on the test data, a system description paper, between 4-8 pages (excluding acknowledgments and references). Please use the ACL Template.

Submissions should omit the author names for review.

Organisers

Programme chairs

  • Francis M. Tyers, Indiana University
  • Robert Pugh, Mozilla Data Collective
  • Anastasia Kuznetsova, Rev.com
  • Jean Maillard, Meta

Programme committee:

  • Antonios Anastasopoulos, George Mason University
  • Kathy Reid, Australian National University
  • Miguel del Rio, Rev.com
  • Pooneh Mousavi, MILA
  • Abteen Ebrahimi, University of Colorado, Boulder
  • Ximena Gutierrez Vasquez, UNAM

Advisory committee:

  • Emmanuel Ngué Um, University of Yaounde 1
  • Belu Ticona, George Mason University
  • Jennifer Smith, University of Glasgow
  • Joyce Nabende, Makerere University
  • Jonathan Mukiibi, Makerere University
  • Elwin Huaman, Innsbruck University
  • Yacub Fahmilda,  Universitas Gadjah Mada
  • Riska Legistari Febri, Universitas Gadjah Mada
  • Murat Topçu, Okan University
  • Rosario de Fátima Alvarez García,  Universidad Autónoma Metropolitana
  • Athziri Madeleine Vega Martínez,  Universidad Nacional Autónoma de México
  • Marlon Vargas Méndez,  Escuela Nacional de Antropología e Historia
  • Antonio Hayuaneme García Mijarez, Nación Wixárika
  • Vivian Stamou, Archimedes Athena Research Centre
  • Meesum Alam, Indiana University
  • Jonathan Lewis-Jong, University of Oxford