Dev Talks

Learn about Mozilla Data Collective from talks that our team has given and join the movement to reclaim your data.

Beyond Extraction: Building Community-Centered Speech Data

With the advent of deep learning, speech recognition models like Open AI’s Whisper are now trained on hundreds of thousands of hours of speech data, likely gathered without the consent of the speakers who contributed it. As responsible AI practices grow in prevalence and we continue to advance machine learning-enabled speech technologies, we must define and commit to a set of shared best practices for responsible speech data collection and stewardship.

Watch on opensource.org


Your datasets, under your control: Introducing Mozilla Data Collective

AI has a data crisis. We're running out of quality training data because the entire web has already been harvested by crawlers to train AI models — leading to the "Token Crisis". What’s left? Synthetic data generated en masse - that’s bland, generic and unrepresentative of the world’s diversity. This data is also problematic for training models, as it can lead to model collapse. Meanwhile, quality datasets from diverse contributors sit unused in silos. Our vision is to encourage the creation of safe, responsible AI that works for everyone - by helping communities to share authentic, ethical and diverse data - a stark contrast to models built by indiscriminately scraping the web and reproducing or synthesising its Anglocentric, white, male biases.

Watch on YouTube