Talks

Learn about Mozilla Data Collective from talks that our team has given and join the movement to reclaim your data.

Fine-Tuning a Whisper Model with MDC Datasets

Most speech recognition models were built with English, or a handful of well-resourced languages, in mind. If you speak Khmer, Galician, or any of the hundreds of languages underrepresented in mainstream AI, you've probably hit a wall trying to get accurate transcriptions.

In this tutorial, we'll walk through how to fine-tune OpenAI's Whisper model on your own language using the Mozilla Data Collective platform's datasets or your own custom audio data. Everything runs locally — even on a laptop — keeping your data private.

Beyond Extraction: Building Community-Centered Speech Data

With the advent of deep learning, speech recognition models like Open AI’s Whisper are now trained on hundreds of thousands of hours of speech data, likely gathered without the consent of the speakers who contributed it. As responsible AI practices grow in prevalence and we continue to advance machine learning-enabled speech technologies, we must define and commit to a set of shared best practices for responsible speech data collection and stewardship.

Watch on opensource.org


Your datasets, under your control: Introducing Mozilla Data Collective

AI has a data crisis. We're running out of quality training data because the entire web has already been harvested by crawlers to train AI models — leading to the "Token Crisis". What’s left? Synthetic data generated en masse - that’s bland, generic and unrepresentative of the world’s diversity. This data is also problematic for training models, as it can lead to model collapse. Meanwhile, quality datasets from diverse contributors sit unused in silos. Our vision is to encourage the creation of safe, responsible AI that works for everyone - by helping communities to share authentic, ethical and diverse data - a stark contrast to models built by indiscriminately scraping the web and reproducing or synthesising its Anglocentric, white, male biases.

Watch on YouTube