News

How Radio Free Europe/Radio Liberty Also Serves Its Communities Through Its Datasets

For more than 75 years, Radio Free Europe/Radio Liberty (RFE/RL) has promoted democratic values by providing accurate, uncensored news and debate in countries where a free press is threatened. RFE/RL reaches more than 44 million people every week across 18 countries, in 24 languages, including Persian, Russian, Ukrainian, Belarusian, Kyrgyz, Turkmen, Tatar, Chechen, and Georgian. RFE/RL’s archive is one of the richest collections of high-quality, low-resource language datasets in the world.

From journalism to datasets

Most of today's technologies are trained on data scraped from the open web heavily skewed toward English, toward the commercially valuable, and toward content that was never meant to represent the full diversity of human knowledge. Languages like Dari, Pashto, Kyrgyz, Tatar, Belarusian, and Romanian are radically underrepresented. The communities that speak them often go unheard by the models shaping how most of the world reads, listens, and learns.

This is the gap Mozilla Data Collective was built to close. As the responsible data exchange platform, uploaders always own their datasets, set the terms of use, and decide who benefits. There is no scraping, no extraction, and no unauthorized repurposing of someone else's work.

RFE/RL has shared 25 datasets through Mozilla Data Collective, making its multilingual journalism available for natural language processing tasks under terms RFE/RL itself defines. For NLP researchers and developers working on translation, speech recognition, summarization, or content moderation in underrepresented languages, this is a rare opportunity: datasets that are editorially rigorous, ethically sourced, and culturally grounded, contributed by an organization that has spent three quarters of a century earning the trust of its audiences.

Some of these datasets include:

Browse all of RFE/RL's datasets

A model for the rest of the field

The significance goes beyond any single dataset. The way institutions like RFE/RL share their datasets matters. When an organization with deep linguistic reach chooses a platform built on community ownership and fair value exchange, it sends a signal to every newsroom, archive, library, and museum watching: there is a third way.

The current data economy is extractive, opaque, and dominated by a small number of players. The alternative being built on Mozilla Data Collective is community-driven, transparent, and grounded in real cultures rather than convenient ones. Today, Mozilla Data Collective hosts 600+ datasets across 300 languages, contributed by 190 organizations. Each one is a small refusal of the idea that language diversity is someone else's problem to solve.

By participating in Mozilla Data Collective, RFE/RL is sharing its work in service of a more representative, more honest digital future. The languages of Kabul, Yerevan, Tashkent, and Tbilisi deserve to be part of how new technologies are built for the generations to come. Thanks to organizations like RFE/RL, they finally can be.

Explore all Mozilla Data Collective datasets →

Get in touch →

Join Mozilla Data Collective →

How Radio Free Europe/Radio Liberty Also Serves Its Communities Through Its Datasets

From journalism to datasets

A model for the rest of the field

Read more

Mozilla Data Collective datasets now discoverable through CLARIN’s Virtual Language Observatory

Get a Sneak Preview of Mozilla Data Collective’s Compensation Feature!

15 Datasets for Fine-Tuning Whisper on a New Language in 2026

What makes a good dataset sample — and how to create one