Kostis Saitas - Zarkias - Mozilla Data Collective

Guide

Fine-Tune a Speech-to-Text Model for Any Language - Including Yours

A step-by-step developer tutorial from Kostis at Mozilla Data Collective

Picture of a red panda lying on a branch

Guide

Behind the scenes: Integrating MDC datasets into your Python project

Overcoming the complexity of AI Mozilla Data Collective helps communities to offer unique, multilingual, multicultural, and multimodal datasets. From transcribed and translated videos of narrated Ekpeye folktales to complex question-answering text pairs for the Georgian language, the diversity of datasets on our platform is core to our mission. But

Common Voice

Improving the Spontaneous Speech English dataset: lifting the lid on speech data quality uplift techniques

Firstly, we’d like to thank you for your patience. After introducing Spontaneous Speech early in 2025, we released most locale datasets when the Mozilla Data Collective platform launched in alpha in September of this year. However, upon inspection, the English Spontaneous Speech dataset required some remedial work prior to

Docs

Uploading your dataset to the Mozilla Data Collective Platform

Interested in joining the movement and publishing your dataset on Mozilla Data Collective? This guide will walk you through the steps required, from account creation to submission!