Latest: Asian Language datasets by communities on Mozilla Data Collective
The internet belongs to everyone—but right now, it doesn't work for everyone. From the islands of Borneo to the mountains of Pakistan, hundreds of millions of people speak languages that AI simply can't understand. That's a problem we can solve together.
At Mozilla Data Collective, we're proud to host a growing collection of open datasets that are helping researchers, developers, and communities build language technology for Asian languages. These datasets represent countless hours of work by linguists, native speakers, and organizations committed to ensuring no language gets left behind.
Here's what's available today.
Large Language Model Training
chinese-cosmopedia — OpenCSG's large-scale Chinese text dataset containing approximately 15 million entries (≈60B tokens) covering encyclopedia, education, and multi-domain content. Cleaned and deduplicated for LLM pretraining. 6.09 GB, Apache-2.0
smoltalk-chinese — A multi-task Chinese conversational dataset covering 19 typical dialogue scenarios, built by OpenCSG. 879.81 MB, Apache-2.0
Machine Translation
English–Punjabi (Shahmukhi) Parallel Corpus — 30,405 professionally translated sentence pairs from Mediamen archives, supporting machine translation and Punjabi language technology for real-world, contemporary language use. 1.08 MB, CC-BY-NC-4.0
Multilingual Religious Parallel Corpus — Kaleem Art Press compiled 6,465 aligned sentence units (~0.98 million words) across Arabic, Urdu, Saraiki, Punjabi (Shahmukhi), and English. 2.27 MB, CC-BY-SA-4.0
Text Corpora
Corpus of Panjebar Semangat Javanese-Language Magazine — Three years of popular articles from the Javanese weekly magazine Panjebar Semangat, founded in 1933 by national hero Dr. Soetomo. Reflects contemporary themes in the Mataram style of Javanese. 4.31 MB, CC-BY-SA-4.0
Sindh Line Publishers Corpus — 1.029 million tokens from the Sindhi newspaper Sindh Line (2024-2025), including headlines, editorials, finance news, and advertisements from Karachi, Pakistan. 2.22 MB, CC-BY-SA-4.0
Speech Recognition
Khmer ASR Cultural Dataset — Digital Divide Data curated 37.62 hours of speech-text pairs from native Khmer speakers discussing Cambodian cultural topics. Includes speaker metadata for gender, age group, and origin city. 12.59 GB, CC-BY-SA-4.0
Common Voice Spontaneous Speech 2.0 - Kenyah — Spontaneous spoken phrases in Kenyah, an Austronesian language of Borneo. 212.06 MB, CC0-1.0
Common Voice Spontaneous Speech 2.0 - Melanau — Spontaneous spoken phrases in Melanau, spoken in Sarawak, Malaysia. 208.47 MB, CC0-1.0
Common Voice Spontaneous Speech 2.0 - Western Penan — Spontaneous spoken phrases in Western Penan, a language of the nomadic and semi-nomadic Penan people of Borneo. 247.12 MB, CC0-1.0
Common Voice Spontaneous Speech 2.0 - Kelabit — Spontaneous spoken phrases in Kelabit, spoken in the highlands of Sarawak. 193.77 MB, CC0-1.0
Common Voice Spontaneous Speech 2.0 - Serian Bidayuh — Spontaneous spoken phrases in Serian Bidayuh from Sarawak, Malaysia. 199.91 MB, CC0-1.0
Common Voice Spontaneous Speech 2.0 - Sabah Malay — Spontaneous spoken phrases in Sabah Malay, a regional variety from Malaysian Borneo. 275.80 MB, CC0-1.0
Common Voice Spontaneous Speech 2.0 - Bahasa Malay — Spontaneous spoken phrases in Bahasa Malay from Malaysia. 125.94 MB, CC0-1.0
Common Voice Spontaneous Speech 2.0 - Gorani — Spontaneous spoken phrases in Gorani, an Iranian language spoken in the border regions of Iran and Iraq. 224.46 MB, CC0-1.0
Common Voice Spontaneous Speech 2.0 - Ushojo — Spontaneous spoken phrases in Ushojo, a Dardic language of northern Pakistan. 102.83 MB, CC0-1.0
Why This Matters
Every dataset here represents a step toward a more inclusive internet. When speech recognition works for Khmer speakers, when translation tools support Punjabi, when AI can process Javanese—technology becomes a bridge instead of a barrier.
This collection spans languages spoken by hundreds of millions of people across Southeast Asia, South Asia, and the Pacific. From major languages like Chinese to endangered languages of Borneo, these datasets ensure that communities have the resources they need to build technology that works for them.
This work isn't done. We need more contributors, more languages, more voices. If you're working on Asian language data, we'd love to hear from you.