Improving the Spontaneous Speech English dataset: lifting the lid on speech data quality uplift techniques

Screenshot of the Common Voice Spontaneous Speech 1.0 - English Data Listing at MDC
Visit this link to download the English CV Spontaneous Speech v1.0 Dataset

Firstly, we’d like to thank you for your patience. After introducing Spontaneous Speech early in 2025, we released most locale datasets when the Mozilla Data Collective platform launched in alpha in September of this year. However, upon inspection, the English Spontaneous Speech dataset required some remedial work prior to release. 

In this blog post, we detail the work undertaken by Kostis Saitas Zarkias, one of our ML Engineers, to improve the data quality of Spontaneous Speech English. 

Elicited speech versus spontaneous speech data: what’s the difference? 

In November 2024, the Common Voice platform introduced Spontaneous Speech: new functionality which records data contributors answering a range of questions and then allows those recordings to be transcribed. In contrast to elicited speech, where a speaker reads a sentence which is then recorded and uploaded to the platform, spontaneous speech records a more natural, conversational type of speech. 

There are both benefits and drawbacks of this approach. 

On the plus side, spontaneous  speech is of higher value to speech researchers and language technology developers because it reflects how people actually talk in real-world contexts, with disfluencies, false starts, repairs, fillers ("um," "uh"), and overlapping turns. This messy reality is what speech technologies need to handle in practice, whether it's voice assistants, transcription systems, or language models processing conversational input. Additionally, when people speak spontaneously, they display the full range natural variation in prosody (the rhythm and cadence someone speaks with), emotional expression and conversational dynamics (such as the way that Australian-accented speakers finish their sentences like a question, called a “high-rising terminal”). Elicited speech, in contrast, tends to be more controlled by the speaker, meaning that elicited speech data is missing these natural variations. 

Technologies trained or tested only on elicited speech often perform poorly when confronted with spontaneous speech in real-world situations. Spontaneous speech provides the challenging test cases—incomplete sentences, self-corrections, non-standard constructions, background noise, and simultaneous speakers—that reveal whether systems are truly robust. Having spontaneous speech data available for training can therefore help applications better match their deployment context. 

However, this variability is one of the downsides of working with spontaneous speech. It presents several challenges for speech data quality, and in turn, for researchers and language technologies applying that data to machine learning models. For example, how do researchers or language technology developers know when there is a disfluency in a recording? This has to be represented in a consistent way in speech data to be of use in training machine learning models. 

How have we improved the quality of Spontaneous Speech English? 

Several additional processing steps have been performed on Spontaneous Speech English, resulting in a dataset of much higher quality. Below, we provide a methodical technical description of the enhancements performed to support reproducibility and transparency. 

Audio language identification 

Because the user interface for Spontaneous Speech was slightly different to that used for read speech on Common Voice, there was initially some confusion about language selection, and some speakers contributed to the “English” dataset when they intended to contribute to another language. 

To remove non-English samples from Spontaneous Speech English, all audio files were processed with the Whisper large-v3-turbo model to provide automatic language identification. A threshold of 0.3 confidence was used to identify audio files where the predicted language did not match the expected language (English). Audio files that had a language identification confidence below the threshold were completely removed from the dataset. Audio files that had a language identification confidence above the threshold but did not match the expected language were moved from the ss-corpus-en.tsv to the ss-report-audios-eng.tsv and were tagged as "foreign_language" with a comment indicating the predicted language and the confidence score.

Disfluency standardization 

In conversational speech, a disfluency is an interruption or an irregularity in flow of speech. They’re a completely normal and natural part of how people talk in real-life situations. Common types of disfluencies include filler words like “um” or “ah” that indicate someone is thinking about what to say next; repeating sounds or words like “I-I-think so”, false starts, where someone starts to say one thing then says another or prolongations like “Noooo!”. 

In the guidelines for transcribing Spontaneous Speech, contributors are asked to mark disfluencies with a disfluency marker, e.g: 

Like [disfluency] I dunno, fixing the house, tending to my plants, [disfluency] go on a hike, or go somewhere with nice scenery without too many humans, or no humans at all, preferably.

However, not all disfluencies have been tagged this way, because there is often a lot of ambiguity around what is a disfluency and what could be a named entity (a people, a place or a product) or an acronym. For this specific release, if the contributor did not mark the disfluency with any disfluency marker (brackets), the disfluency has not been standardized. The wrapping characters for a disfluency marker ([ ] / ( ) / { }) have all been standardized to square brackets ([ ]). 

You can see a worked example of how disfluencies are tagged in the example below:

Oh, a lot of things! I have a lot of plans in my head. Like [disfluency] I dunno, fixing the house, tending to my plants, [disfluency] go on a hike, or go somewhere with nice scenery without too many humans, or no humans at all, preferably. And take a great, I dunno, landscape picture with my camera or something like that. But, usually that's the plan, right? Usually I will just stay at home and do nothing.

Unpacking these a little, we see: 

“Like, uhmmm, I dunno” => Like [disfluency] I dunno 
“Uhhh, go on a hike” => [disfluency] go on a hike

For practitioners who use speech data that is transcribed from spontaneous speech, it’s very helpful to have disfluencies represented in a standard, repeatable way. 

The transcription guidelines for Spontaneous Speech within Common Voice instruct transcribers to place disfluencies in brackets, e.g. “[disfluency] yes, I think so”. Processing has been applied to standardize the representation of disfluencies in transcriptions so they are all bracketed consistently. 

Splitting into train/dev/test splits 

Splitting a dataset means dividing the dataset into separate subsets—train, dev, and test—to build, tune, and evaluate a machine learning model. For Spontaneous Speech datasets, we split based on the Speaker. We “fill” the dev and test split quotas first, to ensure as much variety as possible of Speakers in the dev and test splits, then put the remainder into the train set. We also want to ensure that the test & dev splits have a minimum amount of 1.5 hours of total audio duration. In total, the v1.0 English Spontaneous Speech dataset, after the language identification filtering, had 1.368 audio clips recorded, 1.165 of which were transcribed and 736 of them were validated by other users as correct (received a positive vote from a contributor as a valid audio-transcription pair).

We created the splits on the 1.165 audio-transcription pairs which led to the follow distribution:

  • Train contains 345 audio-transcription pairs from 51 different speakers in a total of 1.66 hours of audio
  • Dev contains 539  audio-transcription pairs from 38 different speakers in a total of 1.90 hours of audio
  • Train contains 281 audio-transcription pairs from 42 different speakers in a total of 1.64 hours of audio

We try to ensure that no speaker is in more than one of the splits. For example, Alice’s voice can only be heard in the train split, and cannot be found in dev or test. Since we avoid collecting personally identifiable information (PII) from our contributors, we implement this feature through the use of the attribute “client_id”: a unique identifier automatically assigned to a user session in the Common Voice platform. This means that we cannot guarantee 100% accuracy that a single client_id is actually a single unique person, as the same person could login from a different account and contribute (so one speaker actually having two different client_id) or a group of contributors using a single account to record their voices (so multiple speakers under the same client_id).

Generating quality tag annotations

Processing was also done to identify audio samples without transcriptions, audios that were very short or very long, or where the transcription was in an unexpected orthography (for example where English was transcribed using Japanese characters). We’re happy to report there weren’t any of those in English Spontaneous Speech. 

As an example, one audio and related transcription in the English Spontaneous Speech dataset was tagged with short_transcription. Upon inspection we can see that the data contributor provided a one-word answer to the prompt: 

What do you want to do with your next day off?
“Relax”

Depending on the application the data is being used for, this short response may or may not be suitable inclusion. However, having the quality tag allows the researcher or developer to more easily identify short transcriptions, and make this decision. 

Each audio now has an accompanying quality tag to help you select the best data for your project. 

How is Spontaneous Speech data being used? 

While it’s still early days for Spontaneous Speech, data from Common Voice Spontaneous Speech is already being used to help improve automatic speech recognition through the Mozilla Common Voice Spontaneous Speech ASR Shared Task. In this challenge, researchers and developers are asked to improve the accuracy of automatic speech recognition (ASR) models across 21 under-represented languages from Africa, Asia, Europe and the Americas. The data for the Shared Task is now available on the Mozilla Data Collective Platform. Please note that only the splitting and quality tags were applied to the Shared Task data; disfluency tags were not normalised. 

Next steps 

We’ll continue to make improvements to data quality in the Mozilla Data Collective platform, and we warmly welcome your feedback on any aspect of data quality via email to mozilladatacollective@mozillafoundation.org

You can now download Spontaneous Speech English from the Mozilla Data Collective platform at: 

https://datacollective.mozillafoundation.org/datasets/cmihqzerk023co20749miafhq