What do the languages Nahuatl, Bahasa Indonesia and Bulgarian have in common?
Nahuatl, Bahasa Indonesia and Bulgarian all feature in our very first community curated datasets to be uploaded to the Mozilla Data Collective platform.
At first glance, Western Sierra Puebla Nahuatl - an endangered variety of the indigenous Mexican language Nahuatl, spoken in the state of Puebla in Mexico, Bahasa Indonesia - the official national language of the Republic of Indonesia, spoken by over 250,000,000 people, and Bulgarian - a Slavic language spoken by nearly 8 million people in south-eastern Europe - might seem to have little in common.
Not so fast!
They're all languages featured in the very first community curated datasets to be uploaded to the Mozilla Data Collective platform. With huge thanks to the organizations and individuals who created them, they're now available for you to explore.
Dimitar - a 1.4 hour corpus of Bulgarian from a single speaker
Different tasks in machine learning require different sorts of data.
For example, in automatic speech recognition (ASR), the task is to accurately recognize speech from a wide variety of speakers - from varying genders, ages and accents. So, data from speakers of varying genders, ages and accents is needed to create a robust ASR model. ASR takes spoken audio and predicts written words.
The complement to ASR is speech synthesis - also called "text to speech" or TTS. TTS models take written words and generate spoken audio. TTS models, in contrast to ASR models, need high quality data from a single speaker.
The Dimitar corpus provides just that data in Bulgarian.
Curated by the Open Home Foundation - the not-for-profit foundation behind the Home Assistant home automation platform, the Dimitar corpus is suitable for training TTS models that create synthesized speech in Bulgarian. For example, it could be used with the Piper TTS system, which is the TTS system the Home Assistant team use to generate synthetic voices for Home Assistant Voice.
Datasheets help data practitioners understand how to use a dataset
Each dataset uploaded to the Mozilla Data Collective must be accompanied by a datasheet. A datasheet is a form of dataset documentation: transparent descriptions of a dataset's contents that helps a data practitioner understand the type of data in the dataset, what it's useful for - and - equally - what it shouldn't be used for.
Looking at the datasheet for the Dimitar corpus, we can see that it provides specifics useful for building TTS, such as the median characters per sentence and median words per sentence, which are useful metrics for speech technologists building synthetic speech models. It also helpfully provides the full Bulgarian alphabet. This information helps speech technologists match phonemes - the building blocks of speech - to individual characters.
Particular thanks go to Dr. Michael Hansen, Voice Engineering Lead at Nabu Casa, for all his work on this corpus.
Download Dimitar
Podcast Hari Minggoean - a ten-hour corpus of Javanese-accented Bahasa Indonesia featuring code-switching and contemporary Indonesian speech
Derived from the "Hari Minggoean" podcast, this dataset has several features that are attractive to practitioners building ASR models for Indonesian and other Malay languages.
Firstly, the podcast hosts content tailored for young Indonesian audiences, and uses contemporary language. Data of contemporary language use is particularly important: languages are changing all the time, and ML models need to keep up. Just five years ago, there was no need for ASR to recognize phrases like "Skibidi Ohio rizz"! But if ASR is to recognize the speech of young people accurately, we need data from young people.
Secondly, the dataset features code-switching. In linguistics, code-switching is where a speaker uses two languages or two varieties in the same utterance. For young people in Indonesia, it's very common to alternate between Bahasa Indonesia and English. However, most speech recognition models are trained on only a single language, and speech recognition of code-switched speech is still an emerging research area - one this dataset might be very suited for!
Terima kasih banyak kepada Yacub Fahmilda atas hasil kerja yang bagus!
Download podcast Hari Minggoean
Listen to the Hari Minggoean podcast /
Mendengarkan podcast Hari Minggoean
Tetelancingo Nahuatl Corpus: A corpus of audio and annotated transcriptions of Western Sierra Puebla Nahuatl
From Indonesia we go across the Pacific to central Mexico, which is where PhD Candidate and Mozilla Data Collective linguist, Robert Pugh, and a team of collaborators collected monologues and dialogues from 5 individual speakers from a community in Zacatlán de las Manzanas.
What's important for data practitioners in this corpus is that each recording has both spontaneous and standard transcriptions, a Spanish translation of the recording and word-level language tags. These multiple forms of annotation help create a richer dataset which lends itself for use in many ML tasks.
For example, because the recordings contain both a Nahuatl and a Spanish transcription, the corpus could be used to build a machine translation model between these languages. That is, it's a "parallel corpus" - a corpus of two languages that have the same semantic meaning. Machine translation is notoriously difficult in endangered languages, so this dataset is one more step in the direction of speech technologies that work better for more of the world's 7000 spoken languages.
Additionally, the word-level tags make this an excellent dataset for identifying when speakers switch between languages or variants within the same utterance - again, a useful application for endangered languages.
Download Tetelancingo Nahuatl Corpus
For more information about the dataset and some preliminary experiments, see the paper Ihquin tlahtouah in Tetelahtzincocah: An annotated, multi-purpose audio and text corpus of Western Sierra Puebla Nahuatl.
How do I get started curating and uploading my own dataset to the Mozilla Data Collective platform?
Have you been inspired by these datasets to curate your own? Mozilla Data Collective provides you with the ability to host and share your datasets, on your own terms.
To upload your own dataset, first create an account, then request to be able to upload data. We verify each dataset creator as a quality assurance measure.
What community datasets will we see next? Stay tuned!