Cultural Heritage and AI: How Institutions Can Reclaim Control of Their Data
The institutions that safeguard humanity's cultural memory, galleries, libraries, archives, and museums (collectively known as the GLAM sector) are confronting a paradox that defines the current moment in AI development. Years of careful digitization of their archives have transformed physical collections into vast, machine-readable repositories of human knowledge. Yet the very openness that would make these collections a gift to the public now exposes them to extraction, commodification, and misuse by the AI industry's insatiable demand for training data. The result is a fundamental question of data governance: How can GLAM institutions maintain sustainable sovereignty over the datasets they are mandated to steward with care, while staying true to their mission of sharing the world's knowledge with the public?
A Governance Crisis of Extractive Data Harvesting
A 2025 report by GLAM-E Lab co-director Michael Weinberg titled Are AI Bots Knocking Cultural Heritage Offline? documented a disturbing trend: automated bots deployed by AI companies are systematically scraping GLAM collections in aggressive swarms that overwhelm server infrastructure and sometimes knock collections entirely offline.
The report showed that “Of 43 respondents, 39 had experienced a recent increase in traffic. Twenty-seven of the 39 respondents experiencing an increase in traffic attributed it to AI training data bots, with an additional seven believing that bots could be contributing to the traffic.” Respondents described servers reaching 100% CPU load within minutes and being rendered inoperable until bots moved on to the next target; while others reported sustained denial-of-service-level attacks from AI scrapers. Wikimedia reported that 65% of its most expensive traffic originated from bots, imposing systemic costs on an institution that depends on public support to remain operational.
This phenomenon operates at two distinct levels:
At the ethical level, it raises foundational questions about the meaning of "open" access in an era of commercial AI development, since rarely do institutions consent, tacitly or otherwise, to having their collections strip-mined by profit-driven AI laboratories without attribution, reciprocity, or institutional dialogue. The governance mechanisms currently available are for all practical purposes inadequate: robots.txt directives are routinely ignored and IP blocking is easily circumvented by bots rotating across hundreds of addresses simultaneously.
At the technical level, it imposes increasing infrastructure costs on institutions many times operating under chronic budget constraints, adding further hardship to the generalised funding cuts happening across the world in the cultural and heritage sector. Institutions may thus find themselves helpless in the face of extractive data harvesting, or pressured to enter unfavorable data licensing deals as a way to generate new revenue. Since the legal framework governing AI training data remains deeply ambiguous across jurisdictions, GLAM entities seeking new revenue generation vehicles through their existing datasets may be rightly discouraged to find buyers or fearful of being locked into an exclusivity deal that would prevent them from sharing their data with other institutions.
Mozilla Data Collective: A New Data Stewardship Paradigm
It is precisely within this context that the Mozilla Data Collective (MDC) emerges as a structurally significant intervention for the GLAM sector. Building on two foundational Mozilla projects, Common Voice and the Data Futures Lab, MDC was officially launched at the 15th Mozilla Festival in Barcelona in November 2025 and is backed by multi-million-dollar seed funding and operates as the first social enterprise incubated by the Mozilla Foundation.
MDC offers robust, secure, and controlled access to datasets and amplifies their visibility by featuring them alongside other high-value datasets. Its architecture is designed around a principle that stands in direct contrast to the extractive model currently exploited by commercial AI actors: contributors retain full ownership of their datasets and retain full control over the terms of access. Institutions can choose to share openly under existing licenses such as Creative Commons or NOODL, or build custom licensing frameworks tailored to their specific governance requirements. They can open data to all, or restrict access to specific categories of downloaders like academic researchers, non-commercial users, or values-aligned organizations. The datasets are still owned by their rightful owners, MDC is a self-service platform which gives creators full control. If users want to charge for a license to use their data, MDC doesn’t take a cut, they simply charge a modest 5% fee to downloaders, to cover the cost of storage and egress. The principle is anti-extractivist - organisations should continue to own, and be the primary beneficiaries of sharing their data, on their own terms.
For GLAM institutions, the implications are transformative. MDC enables institutions to become active, intentional participants in the AI data economy and is designed to accommodate both the organizational and dataset diversity of the GLAM sector. MDC champions the inclusion of multicultural and multilingual data as a foundation for more equitable AI, curating datasets that span from Indonesian podcast audio to Tatar folklore, from public radio from Northern Uganda to a speech corpus of testimonials from Armenian refugees and immigrants.
Institutions retain complete control over how their datasets are used. For example, many opt for the Creative Commons Attribution Non-Commercial 4.0 International (CC-BY-NC-4.0) license but tailor it to suit their particular values. Some choose to forbid the use of their data in systems intended for surveillance, profiling, or repression of individuals or communities, while others forbid access to their datasets to companies with annual revenue of more than one million USD. The people who access the datasets are authenticated and held to legally binding contracts to ensure the data is used as intended by the owning institution.
From Digital Fuel to Cultural Infrastructure
The GLAM sector stands at a critical point. The decades of investment that transformed physical collections into digital knowledge are now being leveraged (often without permission, compensation, or acknowledgment) by certain players in the AI industry. The sector's traditional commitment to openness, which has been one of its greatest contributions to public knowledge, has been turned against it by actors for whom cultural heritage is simply another input in the training data supply chain.
The response should not be to retreat into restriction. Cultural heritage belongs to the public, and the GLAM sector's mission is to provide access to it. The challenge is to develop governance frameworks sophisticated enough to honor that mission while asserting the institutional agency necessary to ensure such valuable datasets are used by entities and in ways that follow the values and mission of the data providers.
Mozilla Data Collective offers such a model by providing a platform where institutions can share, license, and be compensated for their data on their own terms, under legally binding frameworks, transforming the current asymmetry between cultural institutions and commercial AI into a genuine exchange. It empowers GLAM institutions to be active, valued, and sovereign participants in the AI data ecosystem by allowing them to create, curate, and control their datasets as they see fit. It allows GLAM institutions to govern their data in the ways that remain faithful to their mission and sense of purpose.
In an age when the cultural record of humanity is being consumed to build the intelligence systems of the future, it is crucial for galleries, libraries, archives, and museums to have a powerful say in the matter.