The Mozilla Data Collective Data Assistant is now available in Alpha
We're excited to share that the Mozilla Data Collective Data Assistant is now available in Alpha. Visit https://mozilladatacollective.com/chat to get started.
Finding the right dataset for your machine learning project should not take hours. Whether you are building a speech recognition system, training a text-to-speech model, or assembling a machine translation corpus, Mozilla Data Collective holds a substantial and growing library of high-quality, ethically sourced datasets. Until now, navigating that library has required patience and persistence. Today, we are pleased to introduce the Data Assistant: a conversational tool designed to find the right data for you.
Why build the Data Assistant?
Mozilla Data Collective exists to make diverse, community-contributed, diverse datasets accessible to researchers, developers, and organisations working on AI and machine learning-based technologies. But finding the right dataset for your project - dataset discovery - can be labourious. You may know the intended task you're searching for data to train - such as speech recognition or natural language processing, but not which datasets support it. You may know your target language, but not whether suitable recordings exist. You may refine your search several times before arriving at a meaningful result — and even then, you may be uncertain whether you have considered all the options.

The Data Assistant aims to reduce that friction. By understanding your requirements in plain language and matching them against Mozilla Data Collective's dataset catalogue, it reduces the gap between "I need data for this project" and "here are the datasets you should look at."
How do I use the Data Assistant?
You do not need to know the exact name of a dataset, or the precise terminology used in its metadata. You simply describe what you are looking for, in natural language, and the Data Assistant will interpret your requirements and surface the most relevant datasets.

The most effective searches include three elements: your task, your desired language, and any additional requirements around region or recording format. Here are some examples to get you started:
- "I need ASR training data for French, with multiple speakers and a range of recording environments."
- "Can you help me find high-quality single-speaker audio for a Portuguese TTS system?"
- "I'm looking for parallel text corpora for English to Swahili machine translation, ideally covering informal or conversational domains."
- "What datasets are available for training a speech recognition model in a low-resource African language?"
You can also search by region if your project requires data from a specific geography, or by modality if you have constraints around audio quality, transcription format, or dataset size. The Data Assistant supports queries in multiple languages — if you prefer to search in French, Spanish, or another language, you are welcome to do so.
If your initial query is vague or missing key details, the Data Assistant will ask you one or two focused clarifying questions rather than returning a broad or unhelpful set of results. This ensures that the datasets you are recommended are genuinely relevant to your work.
What guardrails does the Data Assistant use to keep me safe?
The Data Assistant is built specifically for dataset discovery on the Mozilla Data Collective platform. It does not offer general-purpose assistance, and it does not draw on information outside the datasets indexed in our catalogue. This narrow focus is intentional: it ensures that every recommendation you receive is grounded in real, available data — and nothing more. No hallucinations here!
Several specific safeguards are in place:
If your search falls outside the scope of dataset discovery, such as a request for general technical advice or content unrelated to Mozilla Data Collective, you will receive a clear explanation of what the Data Assistant is designed to help with, and be guided back towards a relevant query.
If you search for sensitive content — for example, requests involving personal or biometric data, data about children, or adult content — you will not receive a generated response. Instead, you will be redirected to our dataset search page and invited to contact our team directly. Some requests require a human conversation, and the Data Assistant will always route you appropriately.
If no datasets are found matching your search requirements, you will be told so honestly, and guided on how to broaden or refine your criteria. The Data Assistant will not invent dataset names or fabricate results — if the data does not exist in our catalogue, you will be told that clearly.
If your query lacks sufficient detail to return meaningful results, the Data Assistant will ask for clarification before proceeding, so that you are not presented with a list of loosely relevant options.
How do we use the data you enter into the Mozilla Data Assistant?
As the premier platform for ethical data and fair value exchange, we're committed to being open and transparent about how we use your data - and the Mozilla Data Assistant is no exception.
Mozilla Data Assistant is an AI-powered chatbot, and it uses a large language model (LLM) made available through together.ai to help you discover datasets by searching and summarizing dataset listings and datasheet metadata. We change the underlying LLM from time to time depending on what our internal evaluation shows yields the best search results.
When you use the Assistant, your messages - the prompts you type into the chatbot - and limited chat context needed to respond - are processed and transmitted to together.ai to generate an answer, and may be logged by Mozilla Data Collective for security, troubleshooting, and improving the feature, as described in our Privacy Policy.
We don't sell your data to third parties, such as advertisers. However, as always with chatbots, please do not enter personal data, confidential information, or other sensitive content into the Mozilla Data Assistant.
The Data Assistant is available now. Log in to your Mozilla Data Collective account to get started, accept the usage consent, and ask your first question. We hope you enjoy it as much as we do, and we warmly welcome your feedback.