Turning Your Data Into a Valuable ML Resource Without Giving Up Control

You might be sitting on something precious

The modern world runs on data. One unfortunate result of this is the fact that many of us are unknowingly producing data for third-party companies, who use our content and actions as data points to make AI models that they then sell back to us in exchange for a subscription fee.

On the one hand, making more quality data available for ML and AI training can result in better, more representative models. On the other hand, it should be possible to make our data available without losing control of it outright—and it should also be possible to be compensated for this contribution.

At Mozilla Data Collective, we want to make it easier for people and organizations to open up their data to ML and AI model training while giving them control over who accesses their data and how it is used.

Many researchers, organizations, and individuals may be in possession of large amounts of data that, with just a little work, could have a big impact on artificial intelligence outcomes. This guide will walk you through some important considerations to help you identify what kind of data you already have, what you can do to maximize its impact for ML research, and how to ensure you benefit from sharing it.

Identifying what you have

Sometimes people or organizations tell us that they want to support a better tech ecosystem, but they don't think they have any datasets. Actually, pretty much everybody does. Whether it's B-roll from your video shoots, archives of documents in different languages, those reports you got translated, audio recordings of oral histories, videos of interviews, telemetry data, or workflow data from all those files you categorized—believe us, you can be part of the everyone-a-builder revolution.

Here's how to start thinking about what you've got.

Define the subject and scope. What's the core domain of your data? Language datasets, opinion polls, multimodal content, home assistant logs? Getting clear on this helps you understand who might find it useful.

Consider uniqueness and value. What problem does this data solve, or what gap does it fill? How is it different from existing public datasets? Data that captures underrepresented languages, specialized domains, or real-world use cases can be especially valuable.

Think through licensing. This is where you can get creative. Your data can be a lever for change. You can invest it in line with your values—give it to an organization you trust, withhold it from someone you don't. Consider ownership and any existing licensing constraints, and think about what terms make sense for your goals. It is also critical to ensure that any data you share fully complies with all relevant terms of service, ownership rights, and ethical guidelines. This includes strictly avoiding the publication of personal information belonging to others without their explicit, informed consent. For more details, see our Terms of Service for data providers.

Explore fair value exchange. Can or should you charge for dataset use? Is there another form of fair value exchange that would make more sense for you—attribution, reciprocal data sharing, or something else?

Investigating quality and scale

To get a sense of the extent to which your data might be useful for machine learning, it's important to assess its size, consistency, representativeness, and metadata.

Scale assessment. When dealing with datasets, there's no one-size-fits-all solution. Depending on the task, use case, and model, expectations around data volume vary. Here are some rough guidelines for common NLP tasks:

  • Text-to-speech (TTS): more than 5 hours of clear, deliberately-read speech recordings
  • Automatic speech recognition (ASR): more than 10 hours of transcribed recordings
  • LLM fine-tuning: more than 500k tokens
  • Speech language models: more than 500 hours of untranscribed speech
  • Text classification: more than 1,000 examples
  • Morphosyntactic annotation: more than 10k tokens

These are broad guidelines. A dataset for any of these tasks could still be quite useful without meeting these volume recommendations.

Data integrity. Check for missing values, corrupted files, and duplicate entries. Clean data is usable data.

Consistency, representativeness, and bias. If your data has human annotations, how were these collected? Is there an understanding of inter-annotator agreement? How well does the dataset represent the target phenomena that will be modeled? Given the source of the data and annotators, what are the sources of bias? All human annotations contain biases, but understanding their nature is incredibly important for anyone who wants to use the data responsibly.

Metadata. An important feature of the datasets published on Mozilla Data Collective is the rich, informative datasheet. Assess what metadata exists for your dataset. Did you create it directly? Good—you should be able to write a great datasheet detailing the process. If your dataset is an extension of existing data (for example, annotating publicly available texts), it's important to have information about the source data as well as the process of extending it. Details like language info, domain, and collection context are valuable in informing potential users about the nature of the dataset. We recommend taking a look at some of our existing datasheets to see what you might want to include in your own.

Enhancing usability for ML

Often, data is collected and stored in a way that reflects its original use. A linguist transcribing and annotating language data will store it in a format that works with their preferred annotation program (ELAN, Praat, Transcriber), because that makes the most sense for their workflow. However, the specific machine learning tasks your data could support will likely require—or benefit from—some task-specific considerations.

Segmentation. Think about how to break down large or complex data into manageable, logical segments. This might mean organizing by date, experiment, category, or some other structure that makes the data easier to work with.

Format and structure. What state is the data in? CSV, JSON, proprietary binary, raw images or video? Is it structured, semi-structured, or unstructured? Is the format consistent throughout, or varied? Getting clear on this helps you understand what preparation work might be needed.

Support you can expect from MDC

You don't have to figure this out alone. Mozilla Data Collective offers support to help you get your data ready for publication.

Technical guidance. We can assist with data formatting, validation, and storage best practices. If you're unsure how to package your data, we're here to help.

Datasheet guidance. We'll help you understand what to include in your datasheet and point you to our upload manual for detailed expectations.

Visibility. MDC increases the exposure of published datasets to the broader ML community. By publishing with us, your data becomes discoverable to researchers and developers who can put it to good use—on terms you've defined.

Next steps

Ready to contribute? Create an MDC account and start the upload process. If you're worried about packaging your data or have questions about any of the considerations we've discussed here, get in touch at mozilladatacollective@mozillafoundation.org. We're happy to help.

Your data has value. By sharing it thoughtfully—with control over who accesses it and how it's used—you're contributing to more representative ML models and a healthier tech ecosystem. This is collaborative work, and we're glad you're part of it.