How Radio Free Europe/Radio Liberty Also Serves Its Communities Through Its Datasets
For more than 75 years, Radio Free Europe/Radio Liberty (RFE/RL) has promoted democratic values by providing accurate, uncensored news and debate in countries where a free press is threatened. RFE/RL reaches more than 44 million people every week across 18 countries, in 24 languages, including Persian, Russian, Ukrainian, Belarusian, Kyrgyz, Turkmen, Tatar, Chechen, and Georgian. RFE/RL’s archive is one of the richest collections of high-quality, low-resource language datasets in the world.
From journalism to datasets
Most of today's technologies are trained on data scraped from the open web heavily skewed toward English, toward the commercially valuable, and toward content that was never meant to represent the full diversity of human knowledge. Languages like Dari, Pashto, Kyrgyz, Tatar, Belarusian, and Romanian are radically underrepresented. The communities that speak them often go unheard by the models shaping how most of the world reads, listens, and learns.
This is the gap Mozilla Data Collective was built to close. As the responsible data exchange platform, uploaders always own their datasets, set the terms of use, and decide who benefits. There is no scraping, no extraction, and no unauthorized repurposing of someone else's work.
RFE/RL has shared 25 datasets through Mozilla Data Collective, making its multilingual journalism available for natural language processing tasks under terms RFE/RL itself defines. For NLP researchers and developers working on translation, speech recognition, summarization, or content moderation in underrepresented languages, this is a rare opportunity: datasets that are editorially rigorous, ethically sourced, and culturally grounded, contributed by an organization that has spent three quarters of a century earning the trust of its audiences.
Some of these datasets include:
- RFE/RL Chechen News Text Corpus
- RFE/RL Kazakh News Text Corpus
- RFE/RL Persian News Text Corpus
- RFE/RL Pashto (Pakistani) News Text Corpus
- RFE/RL Afghan Dari News Text Corpus
- RFE/RL Tajik News Text Corpus
- RFE/RL Azerbaijani News Text Corpus
A model for the rest of the field
The significance goes beyond any single dataset. The way institutions like RFE/RL share their datasets matters. When an organization with deep linguistic reach chooses a platform built on community ownership and fair value exchange, it sends a signal to every newsroom, archive, library, and museum watching: there is a third way.
The current data economy is extractive, opaque, and dominated by a small number of players. The alternative being built on Mozilla Data Collective is community-driven, transparent, and grounded in real cultures rather than convenient ones. Today, Mozilla Data Collective hosts 600+ datasets across 300 languages, contributed by 190 organizations. Each one is a small refusal of the idea that language diversity is someone else's problem to solve.
By participating in Mozilla Data Collective, RFE/RL is sharing its work in service of a more representative, more honest digital future. The languages of Kabul, Yerevan, Tashkent, and Tbilisi deserve to be part of how new technologies are built for the generations to come. Thanks to organizations like RFE/RL, they finally can be.
Explore all Mozilla Data Collective datasets →
Join Mozilla Data Collective →