MDC Release Notes - 13.03.26

This week: 19 new datasets and a few small changes while we're heads down in some exciting new features that will be coming soon...

MDC Release Notes - 13.03.26

Hello, Mozilla Data Collective! 👋

These past two weeks, we've been focusing on infrastructure and development of some exciting new features that will be landing in the next couple of weeks 👀 but we've got, as always, a few updates to share and the latest roundup of new datasets on MDC.

New Features and Changes


Recommended datasets are now visible on individual data listing pages. These are related datasets to help discover and explore other datasets that might be similar to the ones you're looking at. You can find these on the left side of the page.

It is now possible to report datasets. While we review each dataset on the platform before it goes live, as we grow, we want to provide trust & safety levers for the community to flag and identify to us if things don't look right. The link emails our team and we'll take a look.


New Datasets

Aim Foundation

Dari Literature Corpus by Anjuman e Adabi Nayestan | Mozilla Data Collective
The Dari Literature Corpus (Anjuman e Adabi Nayestan) is a curated collection of written Dari (Afghan Persian) literary texts totaling about 1 million tokens. It includes prose, poetry, folklore-inspired narratives, and other culturally significant writings from both contemporary and classical traditions. The texts were collected in Microsoft Word and converted into UTF-8 normalized plain text for computational and linguistic research, including corpus linguistics, digital humanities, and NLP.

Collaborative Action for Research &

IBT Torwali Wordlist | Mozilla Data Collective
The IBT Torwali Wordlist contains approximately 20,000 unique entries in Torwali (ISO 639-3: trw), an under-documented Indo-Aryan language spoken in northern Pakistan. The dataset comprises standardized lexical entries covering core vocabulary, function words, and culturally salient terms, with consistent orthography and normalization suitable for linguistic and computational use. Entries are aligned with English and Urdu glosses, and include part-of-speech tag.

Digital Divide Data

ddd-kenya-somali-68hrs-asr-part1 | Mozilla Data Collective
This dataset, curated by Digital Divide Data (DDD), provides high-quality audio recordings and corresponding text transcriptions for the Somali (som) language. The collection includes thousands of unique utterances per language to support diverse acoustic modeling. All transcriptions have undergone a manual verification process to ensure high linguistic accuracy. Recordings feature a balanced mix of genders and various age groups to minimize bias in downstream AI models. This data is specifically designed for training Automatic Speech Recognition (ASR) systems, Text-to-Speech (TTS) synthesis, and general linguistic research for underrepresented African languages.
ddd-kenya-somali-68hrs-asr-part2 | Mozilla Data Collective
This dataset, curated by Digital Divide Data (DDD), provides high-quality audio recordings and corresponding text transcriptions for the Somali (som) language. The collection includes thousands of unique utterances per language to support diverse acoustic modeling. All transcriptions have undergone a manual verification process to ensure high linguistic accuracy. Recordings feature a balanced mix of genders and various age groups to minimize bias in downstream AI models. This data is specifically designed for training Automatic Speech Recognition (ASR) systems, Text-to-Speech (TTS) synthesis, and general linguistic research for underrepresented African languages.
ddd-kenya-somali-68hrs-asr-part3 | Mozilla Data Collective
This dataset, curated by Digital Divide Data (DDD), provides high-quality audio recordings and corresponding text transcriptions for the Somali (som) language. The collection includes thousands of unique utterances per language to support diverse acoustic modeling. All transcriptions have undergone a manual verification process to ensure high linguistic accuracy. Recordings feature a balanced mix of genders and various age groups to minimize bias in downstream AI models. This data is specifically designed for training Automatic Speech Recognition (ASR) systems, Text-to-Speech (TTS) synthesis, and general linguistic research for underrepresented African languages.

Kaleem Art Press

Jhoke Publisher Multan’s Saraiki Newspaper Corpus | Mozilla Data Collective
Jhoke Publishers Multan’s Saraiki Newspaper Corpus is a curated text dataset with about 1.25M tokens (1,258K) of Saraiki content collected from Daily Jhoke Saraiki (Multan, Pakistan) and Jhoke Publishers (Multan, Pakistan). Daily Jhoke Multan (ݙین٘ھ وار جھوک ملتان) is a Saraiki newspaper and publishing house based in Multan. It covers regional news and also publishes Saraiki literature, including major literary and religious works (e.g., a Saraiki Quran translation by Professor Dilshad Kalanchvi). The corpus includes three UTF-8 text files (each treated as a separate genre/domain) and a cleaned version with Unicode normalization, standardized whitespace and punctuation, and removal of stray symbols or markup. The dataset reflects contemporary Saraiki usage across journalistic, literary, cultural, and social domains and supports computational and linguistic research.
Saraiki-English Parallel Corpus | Mozilla Data Collective
This English–Saraiki Parallel Corpus is a curated bilingual dataset of 51,447 aligned sentence pairs (about 0.89 million words in total), translated from English into Saraiki by Kaleem Art Press and cleaned into a consistent sentence-level format for reliable alignment; it is designed to support machine translation training and evaluation, bilingual lexicon and terminology work, and broader linguistic and NLP research for Saraiki, including data-driven language technology development.

Keblagh e Azergi

Elkhani Hazargi Literature Corpus | Mozilla Data Collective
The Hazargi Literature Corpus (Keblagh e Azergi) is a monolingual literary dataset for documenting and supporting computational research on Hazargi (Hazaragi), an eastern Persian (Dari) dialect spoken by Hazara communities in Afghanistan and the diaspora. It contains 12 digitized works (prose, poetry, folklore, drama) converted from Word into UTF-8 normalized plain text while preserving original orthography and dialectal features. Total size: ~0.5M tokens (513,483).

Institute of African Digital Humanities

Mada-French Parallel Corpus 1.0 | Mozilla Data Collective
This dataset comprises a parallel corpus of Mada–French literary text translations totalling 2,154 lines. It is designed to support the benchmarking, training and evaluation of machine translation models for Mada, a language spoken in Cameroon. The corpus provides aligned, sentence and paragraph-level translations that capture the stylistic, lexical and syntactic features of literary Mada discourse and how these are rendered in the local variety of French.

Taruen

Finnish Public Domain 20th Century Literature Text Corpus | Mozilla Data Collective
This corpus contains a curated collection of public domain literature from Finland, featuring works by authors who died between 1901 and 1955. The dataset captures the literary landscape of early 20th-century Finland and includes independent texts in both of the country’s official languages: Finnish (fi) and Swedish (sv). The texts were programmatically extracted from Project Lönnrot, a volunteer-driven digital library. To ensure linguistic relevance for modern NLP tasks, the extraction pipeline strictly filtered for works published in 1901 or later. Language codes for each text were dynamically detected using CLD algorithms. The corpus comprises approximately 69.1 million words across multiple plain text files, with each file prefaced by structured YAML front matter containing relevant metadata (title, author, year, source URL, language), followed by the original project’s boilerplate preamble enclosed in delimiter tags, and finally the literary text proper. All included works are fully in the public domain under Finnish and EU copyright law.
CTA Image

Mozilla Data Collective is now on reddit! Join us to share your projects, talk data, and contribute your experience and expertise to a growing community of ethical data practitioners.

Visit r/MozillaDataCollective

MDC Community Concierge

Bangor Miami Spanish-English Corpus | Mozilla Data Collective
The Bangor Miami Corpus of Spanish-English bilingual speech, containing around 240,000 words over 35 hours of recorded audio conversations. The dataset includes the audios, transcriptions and glosses in CHAT format, and word-level analyses of the transcriptions in .tsv files.
Bangor Patagonia Welsh-Spanish Corpus | Mozilla Data Collective
The Patagonia Welsh-Spanish corpus contains around 195,000 words: 78% Welsh, 17% Spanish, 5% indeterminate (i.e. the relevant word appears in the dictionaries of both main languages). The dataset includes the audios, transcriptions and glosses in CHAT format, and word-level analyses of the transcriptions in .tsv files.
Bangor Siarad Welsh-English Corpus | Mozilla Data Collective
The Siarad Welsh-English corpus, containing around 450,000 words, 84% Welsh, 4% English, 13% indeterminate (the relevant word appears in the dictionaries of both main languages). The dataset includes the audios, transcriptions and glosses in CHAT format, and word-level analyses of the transcriptions in .tsv files.

Community Datasets

Javanese TTS of Banyumasan Dialect | Mozilla Data Collective
This dataset comprises speech data produced by a speaker of the Banyumasan dialect of Javanese (locally known as Ngapak), Central Java Province, Indonesia. All datasets use the informal register (Ngoko) and include various topics.
Kokoro Speech Dataset | Mozilla Data Collective
Kokoro Speech Dataset is a public domain Japanese speech dataset. It contains 43,253 short audio clips of a single speaker reading 14 novel books. The format of the metadata is similar to that of LJ Speech so that the dataset is compatible with modern speech synthesis systems. The texts are from Aozora Bunko, which is in the public domain. The audio clips are from LibriVox project, which is also in the public domain. Readings are estimated by MeCab and UniDic Lite from kanji-kana mixture text. Readings are romanized which are similar to the format used by Julius. The audio clips were split and transcripts were aligned automatically by Kokoro-Align.
Malayalam Time-Aligned Speech Corpus | Mozilla Data Collective
This dataset is a speaker-organized Malayalam speech corpus consisting of 100 audio recordings and 100 corresponding transcription files in .srt format. The transcriptions are time-aligned and include timestamps matched to the audio. The dataset contains recordings from 5 speakers, including 3 male and 2 female speakers, and the average length of each audio file is approximately 3 minutes. The data is arranged speaker-wise, making it easy to identify and work with each speaker’s recordings and transcriptions separately. This dataset is suitable for automatic speech recognition, forced alignment, speech-text synchronization, subtitle alignment, speech segmentation, and Malayalam speech technology development.
Sundanese TTS | Mozilla Data Collective
The Sundanese TTS dataset represents the Sundanese language using the Priangan Sundanese dialect as the standard Sundanese in West Java province, Indonesia, reflecting both traditional forms and modern variations in everyday communication practices. This dataset can be utilized for linguistic research, cultural documentation, sociolinguistic studies, and the development of regional language technologies involving code-mixing with Indonesian.
TODa: Tamazight Open Dataset | Mozilla Data Collective
Welcome to the Tamazight Open Dataset (TODa), a groundbreaking open-source project dedicated to preserving and advancing the Tamazight language. With its extensive collection of linguistic data, TODa stands as a pioneering collaborative project for Tamazight <=> Englis translation, specifically designed for Natural Language Processing applications. TODa’s unique approach combines both semantic and syntactic categorization methods, offering a rich representation of words in their various contexts and forms. The dataset encompasses a comprehensive collection of linguistic elements, including detailed verb conjugations across different tenses, noun variations, and an extensive compilation of translated expressions that capture the language’s nuances. What sets TODa apart is its inclusive approach to Tamazight’s writing systems. The dataset thoughtfully incorporates Latin alphabets, acknowledging and preserving the diverse writing traditions practiced across Amazigh communities. This dual-script approach ensures broader accessibility and cultural authenticity. Our vision is to establish TODa as the cornerstone resource for Tamazight Natural Language Processing. Through this meticulously curated dataset, we strive to empower developers and researchers to create innovative NLP solutions that authentically serve the Amazigh-speaking community. We take pride in our current progress, yet acknowledge that language documentation is an evolving journey. We actively encourage participation from the Amazigh technology community to contribute their expertise in expanding and refining the dataset. Through collaborative effort, we can create a robust foundation for technological innovations that honor and advance Amazigh linguistic heritage.
TTS Balinese Language | Mozilla Data Collective
The Balinese TTS dataset is created and narrated by native Balinese speakers with code-mixing in Indonesian. This dataset is designed to showcase the use of the Balinese language in everyday contexts, covering topics such as family, social interactions, and routine community activities. Each recording reflects natural language use by Balinese speakers, thus representing authentic communication in daily life. This dataset can be utilized for linguistic research, the development of automatic speech recognition systems, and other applications focused on the preservation and advancement of the Balinese language.