Open Home Foundation TTS datasets on Mozilla Data Collective
Most voice assistants listen and respond in a handful of languages. Try to build one for your home that speaks your language, though, and you quickly run into a wall: the training data does not exist, or it is locked behind licences that make it unusable for open source projects.
The Open Home Foundation did something about that.
Their argument was simple: your home is where you are most yourself, which makes it the worst possible place for a corporation to have a data pipeline. The Open Home Foundation was created in 2024 by the team behind Home Assistant, after years of watching the smart home market tilt decisively toward surveillance capitalism. The foundation now stewards over 250 open source projects, standards, and libraries, including Piper, a lightweight TTS engine designed to run locally.
Why this matters
Local voice assistants are only as good as the data used to train them. And voice data raises every question that matters in the current AI economy: who records it, who owns it, who can use it, and under what terms.
The Open Home Foundation answered those questions by releasing 23 TTS datasets through Mozilla Data Collective, each one a set of scripted recordings from a single speaker, across multiple languages, all published under CC-0 with no attribution required or restrictions on commercial use.
This was a specific choice that required a specific infrastructure. You can’t just upload voice recordings to a shared drive and call it ethical data stewardship. You need a platform that can enforce your specific version of open to determine access conditions, authenticate downloaders, and give the people publishing datasets real control over what happens to it.
By publishing on Mozilla Data Collective, dataset uploaders enter a values-aligned community that cares just as much as they do about the data shared.
Open Home datasets include:
- Lili 1.0 – Slovak, ~2 hours, female speaker. A West Slavic language with 5 million native speakers and, until recently, almost no open TTS resources.
- Mihai 1.0 – Romanian, ~2 hours, male speaker. Romania has 19 million people and a growing tech sector. Now there's open TTS data to match.
- Anna 1.0 – Hungarian, ~1.6 hours, female speaker. Hungarian is a Ugric language with no close European relatives which makes dedicated TTS data especially valuable for model training.
- Dimitar 1.0 – Bulgarian, ~1.4 hours, male speaker. Bulgarian uses the Cyrillic alphabet and sits in a particularly underserved corner of EU language tech.
TTS Datasets in Mozilla Data Collective
The Open Home Foundation's 23 datasets sit alongside a growing curation of TTS and speech data from communities that have historically had no good options for sharing their data on their own terms. For example, Mozilla Data Collective already hosts TTS corpora for under-represented languages from a wide range of regions, including Otomi in the Americas; Punjabi in South Asia; Javanese and Betawi in Southeast Asia; Ewondo, isiXhosa, and Naija in Africa; Italian, Czech, and Croatian in Europe; alongside many other multilingual TTS datasets.
This matters for TTS specifically because voice synthesis in anything other than English, Mandarin, or a handful of European languages remains badly under-served. The datasets exist in scattered archives, institutional servers, and researchers' hard drives. Mozilla Data Collective is one of the few platforms with the governance model to unlock them, giving the people who created those recordings real control over what happens next.
Data stewardship is a choice, not a default
The Open Home Foundation's datasets are a great contribution to a large problem. Their work has been inspirational to the many communities working with Mozilla Data Collective. The logic they demonstrate is clear: organisations that care about openness have to be intentional about how they share, not just whether they share.
Mozilla Data Collective is a platform that makes intentional stewardship practical. And it is institutions like the Open Home Foundation that have a real impact on how tech is built by building and sharing datasets that reflect the true diversity of users across the world.
If you are working on TTS for a language that is not yet covered, and you have recordings you would like to release, we’d love to hear from you! Email us at support@mozilladatacollective.com.
Explore all Mozilla Data Collective datasets →
Join Mozilla Data Collective →