MDC Release Notes - 30.03.26

379 new datasets with a Mozilla Common Voice update, improvements to the Python SDK (make sure you update to the latest!) and a preview of an upcoming feature. 👀

MDC Release Notes - 30.03.26

Hello, Mozilla Data Collective! 👋

It's been an exciting and busy time over here as we've gotten a few ✨major ✨ updates coming your way next month. Last week, we landed a core piece of work into the code base that will unlock our first conditional access feature - individual access gating. This feature will allow uploaders to have more direct control over who is allowed to download their datasets. We've also been iterating on our Python SDK as we prepare to get uploads and submissions supported in a programmatic pipeline.

New Features & Changes

  • Improvements and fixes related to uploading datasets, especially large ones
  • We fixed a bug where draft datasheets couldn't be saved until all required fields were entered
  • SO much stuff I want to share now but have to wait until they're publicly available 😉

New Datasets

Anjuman e Katib

Persian Literature Corpus by Najwai Sukhan | Mozilla Data Collective
The Persian Literature Corpus by Najwai Sukhan is a curated collection of Persian (Farsi) literary and educational texts created for research, computational use, and cultural preservation. It contains about 1.26 million tokens across 20 complete works spanning classical literature, poetry, modern prose, educational writing, philosophy, translations, and culturally rooted creative texts. Originally compiled in Microsoft Word format, the corpus was cleaned, normalized, and converted into UTF-8 plain text while preserving original orthography and style. Each file represents a complete work, making the dataset useful for both individual text analysis and broader corpus-level study. The corpus supports corpus linguistics, literary studies, digital humanities, NLP, and Persian language preservation.

Balochistan Educational and Cultural Organization

BECO Brahui Literature Corpus | Mozilla Data Collective
This Brahui literary corpus contains short stories, novels, and other creative literary works, representing a broad range of narrative styles and themes within Brahui literature. The texts reflect both classical and contemporary writing, offering insight into cultural expression and linguistic variation in Brahui. The corpus comprises approximately 355,000 tokens, making it a valuable resource for linguistic research and natural language processing tasks involving an under-resourced language.

EELLAK - GreekFOSS

Istorima | Mozilla Data Collective
Dataset Language: Greek Dataset Info: This dataset consists of oral history content collected from the Istorima archive, including transcribed interviews and associated metadata. The material reflects personal narratives and life stories, primarily in Greek, covering a wide range of social, cultural, and historical topics. Metadata Info: This dataset consists of 13,548 oral history interview records, structured as a tabular dataset with mixed data types. Each record includes a unique identifier (id) along with textual fields such as title, summary, transcription, speaker_name, and researcher_name. Additional metadata fields capture thematic and categorical information (themes, tags), geographic references (geonames, interview_place), and temporal attributes (date, published_at). The dataset also includes numerical and boolean features such as duration_minutes, is_age_restricted, and is_on_demand, as well as a language field indicating the interview language. Dataset Statistics: Words: 96,479,186 Tokens: 138,933,365

Institute of African Digital Humanities

Bamun-French Parallel Corpus 2.0 | Mozilla Data Collective
This dataset is an extended and updated version of the ‘Bamun-French Parallel Corpus 1.1’ that is published on the Mozilla Data Collective platform. It is a parallel corpus of 4,444 lines in Bamun and French suitable for machine translation tasks. The text was obtained by transcribing raw audio files. Translations were added to enrich the original corpus. Bamun and French text alignment was performed in the process of creating this dataset. This version of the dataset resolves formatting issues flagged in the original and nearly doubles the number of aligned translation units compared to version 1.1.

Institute of Finno-Ugric/Uralic Studies, University of Hamburg

INEL Kalmyk Speech Corpus | Mozilla Data Collective
This dataset is a machine-learning-ready subset of the INEL Kalmyk Corpus (Version 1.0), processed specifically for Automatic Speech Recognition (ASR) / Speech-to-Text (STT) training. It translates the detailed EXMARaLDA XML annotations into the standard tabular layout utilized by Mozilla Common Voice. The dataset comprises 3 hours and 15 minutes of aligned supervised speech data (1,934 individual clips) across 26 speakers. It strictly populates the primary ‘sentence’ column using the ‘ts’ tier (scientific transcription) to ensure phonetic accuracy, with demographic metadata included where available.
INEL Nganasan Speech Corpus | Mozilla Data Collective
This dataset is a machine-learning-ready subset of the INEL Nganasan Corpus (Version 1.0), processed specifically for Automatic Speech Recognition (ASR) / Speech-to-Text (STT) training. It translates the highly detailed EXMARaLDA XML annotations into the standard tabular layout utilized by Mozilla Common Voice. The dataset comprises 38 hours and 30 minutes of aligned supervised speech data across 42 speakers. It features demographic metadata where available, and prioritizes Cyrillic transcriptions while falling back to Latin or Phonological tiers to ensure complete text coverage for acoustic modeling.
INEL Evenki Speech Corpus | Mozilla Data Collective
This dataset is a machine-learning-ready subset of the INEL Evenki Corpus (Version 2.0), processed specifically for Automatic Speech Recognition (ASR) / Speech-to-Text (STT) training. It translates the highly detailed EXMARaLDA XML annotations into the standard tabular layout utilized by Mozilla Common Voice. The dataset comprises 2 hours and 39 minutes of aligned supervised speech data (2,180 individual clips) across 6 speakers. It features demographic metadata where available, and prioritizes Cyrillic transcriptions while falling back to Latin or Phonological tiers to ensure complete text coverage for acoustic modeling.
INEL Dolgan Speech Corpus | Mozilla Data Collective
This dataset is a machine-learning-ready subset of the INEL Dolgan Corpus (Version 2.0), processed specifically for Automatic Speech Recognition (ASR) / Speech-to-Text (STT) training. It translates the highly detailed EXMARaLDA XML annotations into the standard tabular layout utilized by Mozilla Common Voice. The dataset comprises 13 hours and 5 minutes of perfectly aligned supervised speech data (10,609 individual clips) across recordings spanning from the 1970s to 2017. It features demographic metadata where available, and prioritizes Cyrillic transcriptions while falling back to Latin or Phonological tiers to ensure complete text coverage for acoustic modeling.
INEL Kamas Speech Corpus | Mozilla Data Collective
This dataset is a machine-learning-ready subset of the INEL Kamas Corpus (Version 2.0), processed specifically for Automatic Speech Recognition (ASR) / Speech-to-Text (STT) training. It translates the highly detailed EXMARaLDA XML annotations into the standard tabular layout utilized by Mozilla Common Voice. The dataset comprises 13 hours and 51 minutes of aligned supervised speech data (13,197 individual clips) across 6 speakers, representing the entirety of available recorded spoken Kamas data. It features demographic metadata where available, and prioritizes Cyrillic transcriptions while falling back to Latin or Phonological tiers to ensure complete text coverage for acoustic modeling.
INEL Selkup Speech Corpus | Mozilla Data Collective
This dataset is a machine-learning-ready subset of the INEL Selkup Corpus (Version 2.0), processed specifically for Automatic Speech Recognition (ASR) / Speech-to-Text (STT) training. It translates the highly detailed EXMARaLDA XML annotations into the standard tabular layout utilized by Mozilla Common Voice. The dataset comprises 1 hour and 39 minutes of aligned supervised speech data (1,286 individual clips) across 15 speakers, largely originating from the 1960s and 1970s archive of linguist Angelina Kuzmina. It features demographic metadata where available, and prioritizes Cyrillic transcriptions while falling back to Latin or Phonological tiers to ensure complete text coverage for acoustic modeling.
INEL Nenets Speech Corpus | Mozilla Data Collective
This dataset is a machine-learning-ready subset of the INEL Nenets Corpus (Version 1.0), processed specifically for Automatic Speech Recognition (ASR) / Speech-to-Text (STT) training. It translates the detailed EXMARaLDA XML annotations into the standard tabular layout utilized by Mozilla Common Voice. The dataset comprises over 36 minutes of aligned supervised speech data (447 individual clips). It strictly populates the primary \‘sentence\’ column using the \‘st\’ tier (Cyrillic source transcription) to ensure orthographic accuracy, with demographic metadata included where available.
Corpus de llenguatge ofensiu en català | Mozilla Data Collective
This dataset consists of sentences tagged as offensive-language in the version 25.0 release of Mozilla Common Voice in Catalan. The sentences are provided with the aim that they promote the development of offensive language detection in Catalan.

LocaleNLP

English Hausa Parallel Corpus | Mozilla Data Collective
This English–Hausa Parallel Corpus is a curated bilingual dataset of 5,000 aligned sentence pairs, translated from English into Hausa and organized into a clean sentence-level format to ensure reliable alignment. The dataset is designed to support machine translation training and evaluation, bilingual lexicon development, and broader linguistic and natural language processing (NLP) research for Hausa, including data-driven language technology development.
CTA Image

The Mozilla Data Collective is now on the AI @ Mozilla Discord Server. Join us for announcements, community events, and more!

Join us on Discord

Mozilla Common Voice

The Mozilla Common Voice team has released the Spontaneous Speech 3.0 datasets and Scripted Speech 25.0 datasets on Mozilla Data Collective. You can find all of the Common Voice datasets available on the Common Voice organization page.

Common Voice | Mozilla Data Collective
Common Voice is a free, open source platform for community-led data creation. Anyone can preserve, revitalise and elevate their language by sharing, creating and curating text and speech datasets.

UP EEEI - Digital Signal Processing Laboratory

UP - DSP - Philippine Languages Database (UP-DSP-PLD) | Mozilla Data Collective
This dataset contains multilingual, text and speech pairs for ten Philippine languages namely Filipino, English, Cebuano, Kapampangan, Hiligaynon, Ilokano, Bikolano, Waray, and Tausug. The dataset contains over 454 hours of recordings, covering multiple domains in news, medical, education, tourism and spontaneous speech. The applicability of the corpus has also been demonstrated in adult and children ASR, phoneme transcriber, voice conversion, and TTS applications.

MDC Curators

Corpus de llenguatge ofensiu en català | Mozilla Data Collective
This dataset consists of sentences tagged as offensive-language in the version 25.0 release of Mozilla Common Voice in Catalan. The sentences are provided with the aim that they promote the development of offensive language detection in Catalan.

Community

Araina Text Corpus (Occitan Aranese) | Mozilla Data Collective
This text corpus includes sentences from three sources. Public domain literary texts translated by Antòni Nogués. Sourced from institutestudisaranesi.cat, Language educational material by Jordi Suïls Subirà, Administrative proceedings from Conselh Generau d’Aran.
Heroes English-Spanish Dubbed Movie Speech Corpus | Mozilla Data Collective
Heroes corpus contains mapped bilingual (English and Spanish) speech segments from the TV series Heroes. It contains 7000 single speaker speech segments extracted from the original and Spanish dubbed version of 21 episodes. Audio segments are accompanied with subtitle transcriptions and word-level prosodic/paralinguistic information. Each episode directory contains word-level and segment-level information of the whole episode and also parallel samples extracted under segments_eng and segments_spa subdirectories. Each sample is stored as a wave audio file, text file and a csv file containing word timing information and word-level paralinguistic and prosodic features (speaker id, mean f0, mean intensity).
Oro_Word | Mozilla Data Collective
This dataset contains word-level recordings in Afaan Oromoo collected from native speakers to support the development of open-source speech technologies. The dataset is designed for training and evaluating automatic speech recognition (ASR) and text-to-speech (TTS) systems. Each audio file is paired with its corresponding written word and metadata. Afaan Oromoo is a widely spoken Cushitic language in Ethiopia and neighboring regions, but it remains underrepresented in digital language resources. This contribution aims to expand accessible linguistic data, support research and education, and strengthen the presence of Afaan Oromoo in modern AI technologies.
Urdu Multi-Speaker TTS Dataset | Mozilla Data Collective
This dataset is an Urdu text-to-speech corpus designed for speech technology development and related computational research. It contains approximately 10 hours of speech from 3 speakers, including 2 male and 1 female speaker. The data is distributed across 36 zip files, and each zip file includes a folder of audio files along with a CSV file that maps each audio file to its corresponding transcript. The recordings are drawn from the domains of newspaper, literature, and articles, providing a mix of formal, narrative, and informational language suitable for Urdu TTS, corpus creation, and speaker-based speech modeling.