data

Metadata magic: making datasets discoverable and tractable with Croissant

Learn how we automatically generate Croissant metadata to describe datasets on the Mozilla Data Collective platform, making them more discoverable.

Learn how Mozilla Data Collective has used the Croissant metadata specification to increase the discoverability of datasets on the platform.

If a dataset exists on the internet and nobody can find it, does it actually exist? It's a slightly philosophical question, but it has very practical implications.

The Mozilla Data Collective platform hosts hundreds of carefully curated, ethically governed datasets — but hosting alone isn't enough. A dataset that can't be found, understood, or assessed for suitability is a dataset that might not be used, reducing its value. Given the many hours of expertise that goes into collecting, curating and annotating datasets, it's important that we make the "last mile" - making sure they're found - as easy as possible.

That's where metadata comes in.

What is metadata, and why does it matter?

Metadata is, in the simplest terms, data about data. It's the structured description that tells a potential user — human or machine — what a dataset contains, how it was collected, who created it, what licence applies to it, and how it's intended to be used.

It's worth distinguishing metadata from the broader concept of dataset documentation. Dataset documentation encompasses everything you might want to know about a dataset: technical specifications, ethical review processes, sample data, usage restrictions, and detailed provenance. Metadata is a structured subset of that documentation — the specific fields and values that can be indexed, searched, and acted upon programmatically. Think of dataset documentation as the full manual, and metadata as the back-cover summary that helps someone decide whether to read the manual in the first place.

For people who use datasets such as AI or ML engineers, researchers or analysts, metadata answers the essential first questions: Is this dataset in the right language? Does it cover the right domain? Is it licensed in a way that permits my intended use? Without good metadata, a downloader faces the unappealing prospect of downloading gigabytes of data and importing it into pandas only to discover it doesn't suit their needs at all.

For AI agents and automated cataloguing systems, metadata is even more critical. A cataloguing agent can only reason about what it can read. Structured, machine-readable metadata enables agents to classify datasets, compare them, assess their fitness for a given task, and route them to the right users — all without human intervention. As agentic AI systems become more prevalent in research and data science workflows, the quality of a dataset's metadata increasingly determines whether that dataset gets discovered at all.

What is the Croissant metadata standard?

The Croissant metadata format is an open, community-built vocabulary for describing machine learning datasets. It was developed under the auspices of ML Commons, the open engineering consortium behind benchmarks such as MLPerf, and is actively maintained by a working group co-chaired by Elena Simperl of King's College London and the Open Data Institute, and Omar Benjelloun of Google. Croissant was originally presented as a research paper in 2024.

Croissant builds on schema.org, the widely-adopted vocabulary for structured data on the web. Schema.org provides the foundational types — Dataset, Organization, URL — and Croissant extends them with machine learning-specific fields: encoding formats, versioning, content size, language, and task type. Because Croissant is a schema.org extension, Croissant metadata is intelligible to any system that already understands schema.org, which includes most search engines.

Over 700,000 datasets on platforms including Hugging Face, Kaggle, and OpenML are already described using Croissant, making it the de facto standard for ML dataset metadata. The format continues to evolve: extensions for geospatial data and life sciences are under active development, and the working group maintains an open-source Python library for validating and consuming Croissant metadata.

One particularly important dimension of Croissant is its RAI extension — Responsible AI. The RAI namespace expands Croissant's descriptive potential to include fields such as rai:mlTask, which captures the downstream machine learning task the dataset is intended for. This is significant for us here at Mozilla Data Collective: our mission centres on ethical, responsible data sharing: a metadata standard that can encode not just what a dataset contains but what it's for aligns naturally with our platform's values.

Croissant was the right choice for the Mozilla Data Collective platform for several reasons. It is an open standard under active community development, not a proprietary format. Its schema.org lineage means it integrates seamlessly with web-scale discovery infrastructure. And its RAI extensions allow MDC to encode dataset governance information directly into the metadata record — making responsible data sharing a first-class concern, not an afterthought.

How we create Croissant metadata automatically from a Datasheet

Croissant metadata for every public dataset on the Mozilla Data Collective platform is generated programmatically and deterministically from the dataset's Datasheet. This means that as soon as a dataset has a comprehensive Datasheet, its Croissant metadata is created automatically — no manual metadata authoring required. It's a key value-add of hosting your dataset with MDC.

The mapping works as follows:

Datasheet field	Croissant field	Notes
Dataset name	`name`	Direct mapping
Long description (or short description as fallback)	`description`	Prefers long description
Creation date	`dateCreated`	Direct mapping
Publication date	`datePublished`	Included if present
URL slug	`citeAs`	Used as the citation identifier
Locale / language	`inLanguage`	ISO language code
Task type	`rai:mlTask`	RAI extension field
Dataset URL	`@id` and `url`	JSON-LD identifier and standard URL
File format(s)	`encodingFormat`	Parsed and mapped to IANA MIME types
Task + format tokens	`keywords`	Combined keyword list for search
File version number	`version`	Formatted as semantic version (e.g., `1.0`)
File size in bytes	`contentSize`	Included as a string value
Licence URL, abbreviation, or text	`license`	Three-tier resolution: explicit URL > known abbreviation > verbatim text
Organisation	`creator` and `publisher`	Mapped to `sc:Organization` nodes

The mapping deliberately excludes contact email addresses and personal names to protect contributor privacy and personally identifying information.

This automatic generation approach has an important corollary: the quality of the Croissant metadata is directly proportional to the quality of the dataset's Datasheet. A sparse datasheet — one that skips the long description, omits the task type, or leaves the licence as a free-text string rather than a recognised abbreviation — will produce sparse Croissant metadata. A comprehensive datasheet produces rich, discoverable metadata.

If you're a dataset producer preparing a submission to the Mozilla Data Collective, two resources are essential reading. The first, Datasheets: The Missing Manual for your Dataset, explains what a datasheet is, why it matters, and how to think about the fields. The second, Uploading your dataset to the Mozilla Data Collective Platform, provides field-by-field guidance on completing your submission. The time you invest in a thorough datasheet is time that directly benefits discoverability — for your dataset and for the community that might use it.

The Croissant metadata is served as a JSON-LD document at a public, cacheable endpoint on each dataset's page. It is embedded directly in the page HTML, which means any crawler that visits a dataset page on the Mozilla Data Collective will find machine-readable metadata waiting for it - automatically.

How Croissant metadata is used downstream

The most immediate downstream consumer of MDC's Croissant metadata is Google Dataset Search. Google Dataset Search is a specialised search engine that indexes structured dataset metadata from across the web, enabling researchers and data scientists to discover datasets by topic, language, format, or licence. Because MDC embeds Croissant metadata as schema.org-compatible JSON-LD, Google Dataset Search can index MDC datasets automatically.

You can see this in action right now: a search for datasets on the Google Dataset Search platform returns MDC datasets with structured previews — names, descriptions, licences, languages — drawn directly from the Croissant metadata our service generates.

This is the practical payoff of good metadata. A dataset that is well-described in a standard format, embedded in a crawlable page, is a dataset that gets discovered by researchers who didn't know it existed. It shows up in searches. It gets downloaded. It gets used. For language communities whose data is under-represented in AI training sets — a core audience the Mozilla Data Collective was built to serve — that discoverability is not a nice-to-have feature. It is the point.

And as agentic AI systems increasingly mediate how researchers and engineers find and evaluate datasets, the role of structured metadata will only grow. A Croissant record isn't just a search engine artefact. It is a durable, machine-readable description of what a dataset is and what it's for — the kind of description that an autonomous agent can read, reason about, and act on without human intervention.

Good metadata helps take your dataset off the shelf of obscurity and into the hands of builders and makers eager to use it.

For more information on how we generate Croissant metadata, or to learn more about Mozilla Data Collective, drop us a line at support@mozilladatacollective.com.

Metadata magic: making datasets discoverable and tractable with Croissant

What is metadata, and why does it matter?

What is the Croissant metadata standard?

How we create Croissant metadata automatically from a Datasheet

How Croissant metadata is used downstream

Read more

Best Practices for Sharing Contact Information on Mozilla Data Collective

FAQ: Why Can't I Edit Certain Fields on my Published Dataset?

Engendering Voice: Insights from Common Voice

The Mozilla Data Collective Data Assistant is now available in Alpha