How to License Your Dataset for AI Training: Some Best Practices
We get a lot of questions about how to approach licensing your data for AI training. So to help you share your datasets, we’ve compiled some guidance here – it’s intended to be a living document, that we iterate with our partners and communities.
What Does It Mean to License a Dataset for AI training?
When you license a dataset, you are not selling it. You are giving a company permission to use it in specific ways, under specific conditions, in exchange for payment or other agreed terms. You keep ownership. The company gets access. A well-written license can therefore help you protect your rights, help with prevention of data being misused, and ensure you are fairly compensated. A poorly written one can leave you with little control over how your data shapes AI systems for years to come. Because of how models are then used in downstream applications, it’s hard to ‘walk back’.
Why AI Training Licenses Are Different from other types of license
Licensing data for AI training is different to licensing it for research or publishing. When a company trains a model on your dataset, the data is absorbed into the model's parameters. Aka, it is not stored in a folder somewhere that can simply be deleted. This makes the stakes higher, and it means standard licensing templates may not be enough to protect you or your data. Before you begin this process, please make sure you actually have the right to license this data. Think about:
- Did you create this data yourself, or did you collect it from other sources?
- Did the original terms of service or data agreements permit commercial licensing?
- Does the dataset contain personal data?
- Do contributors or creators whose work is in the dataset have any rights you need to account for?
Once you’ve confirmed you’re within your rights to license your data, and it’s safe to do so: you will need to craft a license that is specific to AI use cases. (Ahead of this, if your dataset has a lot of stakeholders, you should make sure you’ve actually spoken to them too. You could read our article here explaining how to run a Community Workshop for Dataset Governance.)
Some Best Practices for Licensing Your Dataset
1. Explicitly Define How the Dataset Can Be Used
Be very precise about permitted uses (almost pedantically precise) unless you genuinely want them to be able to use it for anything. That might be fine in your context – only you can decide. Can the company use your data to train a general-purpose model? A commercial product? An internal tool only? Can they use it to fine-tune models they then sell to others?
Vague language like "for AI" leaves too much room for interpretation, unless you really mean anything. We see some fall out in the industry around confusing licenses already. Spell out clearly:
- Which models or products the data can be used to train
- Whether the license covers training, fine-tuning, evaluation, or all three
- Whether the resulting AI system can be used commercially or if it’s strictly for research purposes
- Whether the company can sublicense the data to third parties (usually, you want to say no)
If you already have a possible technology company or non-profit you want to work with, you can get their feedback on the license – how clear do they feel about the language? Where is there ambiguity? Bear in mind that they may have their own incentives, and you should come into conversation clear on your own boundaries.
If you don’t have a specific partner in mind – we’re also happy to connect you to people, or to comment, over here at Mozilla Data Collective: the social enterprise for data agency and fair value exchange.
2. Set Clear Restrictions on Redistribution
The data supply chain doesn’t stop with the company. Once your data is in a company's hands, you want to be clear about where it can go next. For example, your license might explicitly prohibit:
- Sharing or reselling the raw dataset to other parties
- Including your data in open datasets or public releases
- Using the data to train models that are then shared as open weights, if that is a real concern for you
3. Address Data Retention and Deletion
One of the hard problems in AI licensing is what happens to your data after the training run is complete. Your license needs to document:
- How long the user/company can retain copies of your data
- Whether they must delete your data after training is finished
- How deletion will be verified (an audit right might be useful to consider here)
Raw data can be deleted, but the model itself will have already "learned" from your data. Your license should acknowledge this distinction clearly.
4. Negotiate Attribution and Credit
Depending on your situation, you may want to receive credit when your dataset is used. If attribution matters to you, include it as a contractual requirement. Specify the exact form it should take (for example, in model cards, technical reports, or public announcements). Make sure you check what might be required by your own obligations; for example if that data has other stakeholders with their own rights.
5. Build In Spaces for Checks, Audits, and Verification
This depends a little on your own capacity, but if you’re a larger organisation with some technical resources, you may want to have the right to verify that your data is being used in the way the license permits.
This does not need to be intrusive, or heavyweight, but it should be real in order to protect you. Maybe think about including:
- The right to request written confirmation of how the data was used
- The right to commission a third-party audit if you have reasonable concerns
- Requirements for the company to keep records of their data usage
6. Get the Fair Value Exchange Part Right (this is the hardest part)
Pricing datasets is difficult. We maintain a repository of pricing data points and the fee structure of different licensing deals that we are happy to share with our data providers and allies upon request. There is no single right way to price a dataset license - it’s dependent on factors like the uniqueness, quality, annotation and potential application of the data. Common models include:
- Flat fee: a one-time payment for a defined use
- Usage-based pricing: fees tied to the scale of training runs or the number of models trained
- Revenue sharing: a percentage of revenue from AI products trained on your data
- Subscription: ongoing access fees for continued or updated data
Deciding this is very custom to your context (how much data, how often it’s going to be updated, who you expect to use it, where in the training cycle etc).
Do think expansively about what fair value exchange means to you: you might find it’s not money! Or not just money. It might be that the company lets your community use resulting tools for free for the next 5 years, or that they second an engineer to you for a period of time, or that they agree to an internship system for speakers of your language. Get creative! This type of collaboration agreement won’t necessarily live in the license, but you should think holistically about what would make working with this/any organisation a fair and exciting partnership for you. If they’re not thinking about it as partnership, maybe they aren’t the right fit.
7. Be clear about the Intellectual Property ownership
Your license should leave no ambiguity about who owns what:
- You retain full ownership of the underlying dataset
- The company/user owns the model they build (this is standard)
- Neither party gains IP rights over the other's pre-existing assets
Something to consider is making it clear whether any derivative datasets or annotations the company creates from your data belong to them or you or a blended model. For example, say, a company cleans, labels, or augments your dataset as part of their process, the resulting enriched data might be very valuable. Decide upfront who it belongs to.
8. Include Ethical Use Clauses
AI training raises real ethical questions. Consider adding clauses that prohibit the use of your data in projects that you would find problematic. Common examples include:
- Surveillance systems or tools for discriminatory profiling
- Models intended to generate disinformation
- Weapons or systems used in armed conflict
These clauses are increasingly common in data licenses and signal that you care about the downstream impact of your work.
9. Agree on a Governing Law and Dispute Resolution Process
Cross-border data deals can get complicated quickly. Be clear about:
- Which country's laws govern the contract
- How disputes will be resolved (negotiation first, then arbitration or litigation)
- Which jurisdiction courts would handle any legal proceedings
- Timelines for dispute resolution, expected response times, etc
10. Get Legal Counsel
AI data licensing is a specialist area of law, and the stakes – financial, ethical, and reputational – are significant. If you don’t already have any legal advice or expertise, consider engaging a lawyer with experience in data licensing and AI before you sign anything.
11. Think about the wider world in which you want to live
Exclusively licensing your data to the single highest bidder may seem appealing. But consider the broader social impact this type of arrangement can have. It generally gate-keeps innovation, locking in monopolies and stifling a more thriving social and economic ecosystem. You might get a larger cheque now, but over the long term, a larger volume of smaller arrangements may in fact be more financially rewarding, and certainly may help to build a tech future that is more biodiverse and thriving. Platforms like Mozilla Data Collective exist for this diversified sharing context.
You might also want to consider asking for provisions such as the right to open source the datasets for researchers after a set period, or donating them in whole or part to the public domain in the future.
We’d love to hear your feedback, questions and stories on mozilladatacollective@mozillafoundation.org