Building an African Voice for AI: Inside the Institute of African Digital Humanities

Share
Building an African Voice for AI: Inside the Institute of African Digital Humanities

When you ask a voice assistant a question in English, French, or Mandarin, the underlying models have been trained on billions of words and millions of hours of speech. Ask the same question in Bafia, Mada, or Suundi, and the technology simply doesn't know how to listen. The Institute of African Digital Humanities (IADH) is helping to change that, one carefully curated dataset at a time.

Who they are

Founded in 2021 as the Institut des Humanités Numériques d'Afrique francophone and now known as the Institute of African Digital Humanities, IADH is a research and practice network applying digital and AI methodologies to humanities research with a focus on Africa. It serves as a collaboration platform for Digital Humanities practitioners across the continent, connecting linguists, technologists, and cultural researchers.

Since late 2024, the Institute has sharpened its focus on a specific and pressing challenge: designing and publishing NLP and machine-learning datasets for African languages, with priority given to those that are low-resourced and underrepresented in today's AI systems.

What they've built

IADH's contribution to language equity in AI is truly substantial. They have launched 46 under-served African languages on Mozilla Common Voice, including Adamawa Fulfulde, Bafut, Bakoko, Baoulé, Cameroon Pidgin English, Dagbani, Ewondo, Fang, Ghomala, Ibibio, Mada, Medumba, Mungaka, Tupuri, and many others. The majority had no prior digital speech presence at all. In addition they have contributed an enormous number of African NLP datasets on Mozilla Data Collective.

The datasets on Mozilla Data Collective

IADH has published 38 original datasets on Mozilla Data Collective, all released under the NOODL-1.0 license and covering more than 20 African languages from Cameroon, Congo, Nigeria, and West Africa. They fall into five broad categories:

Speech datasets for text-to-speech (TTS) and automatic speech recognition (ASR) including Bati ASR, Beembe TTS, Bomitaba TTS, Bulu TTS, Bamun TTS, Ewondo TTS, Hausa TTS, Kituba TTS, Laari ASR, Lingala TTS, Mbosi TTS, Naija TTS, Suundi TTS, Teke-Laali TTS, Yaka TTS, Yoruba TTS. Sizes range from tens of megabytes to several gigabytes of audio paired with transcripts.

ALCAM multimodal datasets combining International Phonetic Alphabet transcriptions, audio recordings, and French glosses covering Akoose, Basaa, Bulu, Mvele, Yezoum and three regional varieties of Ewondo (Yanda, Fong, and Mbida-Mbani). 

These are particularly valuable for phonological research and for building models that handle dialectal variation. In the local context where indigenous language education is hindered by the lack of multimodal, multilingual and/or multidialectal resources to address practical learning needs,  these resources are poised to have a significant impact. For example, users can listen to word pronunciations and access word or sentence meanings with the help of French or English translations.

Parallel corpora for machine translation, pairing African languages with French: Adamawa Fulfulde–French, Bamun–French (versions 1.1 and 2.0), Ewondo–French, and Mada–French.

Text corpora of narratives and speech transcripts, including FUB Narratives (Adamawa Fulfulde), Mada Narratives, and Spoken Congolese French, the last of which is over 3 GB and captures a regional variety of French rarely represented in NLP resources.

BOUQuET translation benchmark. Working with the Mozilla Foundation, IADH produced gold-standard human translations of 1,364 sentences across 324 paragraphs into five Cameroonian languages (Basaa, Eton, Duala, Bafia, and Tupuri) for an international AI translation benchmark. Critically, every translation was produced by human language specialists, with no machine translation or generative AI in the loop. These datasets will be available soon on Mozilla Data Collective.

Why it matters

There's a quiet asymmetry baked into modern AI: the languages spoken by hundreds of millions of Africans are often invisible to the models that increasingly mediate access to information, services, and culture. IADH's work, methodical, human-led, openly licensed, is one of the most concrete answers to that problem. Every TTS dataset, every parallel corpus, every transcribed narrative is a real step toward AI systems that can hear, read, and respond in languages they currently cannot.

For researchers and developers working on multilingual NLP, the IADH catalogue on Mozilla Data Collective is worth bookmarking. IADH is keen to partner with developers working on tools and projects using these datasets, particularly those with immediate applications to language learning. For everyone else, it's a reminder that the future of AI doesn't have to be monolingual but, instead, aim to include every language possible for a more inclusive AI.

Explore all Mozilla Data Collective datasets →

Get in touch →

Join Mozilla Data Collective →