Latest: African Language datasets by communities on Mozilla Data Collective

Latest: African Language datasets by communities on Mozilla Data Collective
Photo by Annie Spratt / Unsplash

The internet belongs to everyone—but right now, it doesn't work for everyone. Millions of people speak languages that AI simply can't understand, and that's a problem we can solve together.

At Mozilla Data Collective, we're proud to host a growing collection of open datasets that are helping researchers, developers, and communities build language technology for African languages. These datasets represent countless hours of work by linguists, native speakers, and organizations committed to ensuring no language gets left behind.


Speech Recognition

Luhya ASR Data (70 hours) — Digital Divide Data collected this substantial corpus of Luhya speech in Kenya. Native speakers recorded sentences to support automatic speech recognition research for this low-resource language. 13.90 GB, CC-BY-4.0

DhoNam: Dholuo Speech Dataset — The Maseno Centre for Applied Artificial Intelligence built this 51-hour corpus to supercharge ASR for Dholuo, one of Kenya's major indigenous languages spoken by over 4 million people. Created in partnership with the Dholuo community, who helped determine the licensing framework. 2.49 GB, NOODL-1.0

DataTrust Africa: Northern Uganda Radio Corpus — Amara Hub curated over 350 clips from public radio stations including Mega 100 FM, Q FM, Radio Pacis, and Radio Rupiny. Recordings in Acholi, Lango, Lugbara, and Akaramajong are on the way. 179.82 MB, NOODL-1.0

Spoken Congolese French Dataset — Semi-guided interviews from Brazzaville, paired with orthographic transcriptions. A valuable resource for understanding regional French variation in the Republic of the Congo. 3.44 GB, NOODL-1.0


Machine Translation

Adamawa Fulfulde–French Parallel Corpus — The Institute of African Digital Humanities compiled 1,977 lines of Fulfulde narratives with French translations, supporting translation research for this Afro-Asiatic language spoken across Central Africa. 112.50 KB, NOODL-1.0

Bamun-French Parallel Corpus — Transcribed audio paired with French translations for Shupamem, spoken in Cameroon. 99.24 KB, NOODL-1.0


Language Documentation

Ewondo Fong Multimodal Dataset — A multimodal resource for the Fong variety of Ewondo, pairing example sentences with audio recordings and French translations. Built to support speech and language technology for under-resourced African languages. 16.80 MB, NOODL-1.0

Ewondo Mbida-Mbani Dataset — Lexical entries from the Mbida-Mbani speech area, each accompanied by illustrative sentences, word-by-word glosses, French translations, and aligned audio. 19.25 MB, NOODL-1.0

Mada Narratives — Seventeen transcribed oral narratives in Mada, an Afro-Asiatic language of Cameroon. These texts capture natural spoken discourse and traditional storytelling. 65.04 KB, NOODL-1.0


Why we're so excited to work with our communities

Every dataset here represents a step toward a more inclusive internet! When speech recognition works for Dholuo speakers, when translation tools support Fulfulde, when AI can process Ewondo—technology becomes a bridge instead of a barrier.

This work isn't done. We need more contributors, more languages, more voices. If you're working on African language data, we'd love to hear from you.

Explore all datasets →

Get in touch →