News

15 Datasets for Building a Low-Resource Translation Model in 2026

The problem with "low-resource" machine translation

Most production machine-translation systems in 2026 are still trained on a fairly narrow set of language pairs: the 50 or so for which the open web supplies enough parallel text to push BLEU scores into useful territory. Below that line, MT quality worsens significantly. For a Hausa-to-English translation system to be worth shipping, you need in the order of millions of aligned sentence pairs. For most of the world's 7,000+ languages, that volume of data doesn't exist anywhere and isn't going to be generated through web crawls alone.

What does exist, increasingly, is community-led parallel-data collection. Translators without Borders (now CLEAR Global) ran the TWB Gamayun parallel sentence kits specifically to give humanitarian-language translation projects a starting point. Pakistani publishers, Mexican linguists, Sephardic Jewish revitalisation projects, and Nigerian research groups have built smaller but carefully curated parallel corpora in languages mainstream MT simply ignores. Combined, these resources don't replace web-scale parallel data but they're enough to fine-tune an existing multilingual MT base model into something usable for a specific low-resource pair.

The 15 datasets below are the working set for that kind of project in 2026. They range from the TWB Gamayun humanitarian sentence kits (Hausa, Lingala, Tigrinya, Nande, Rohingya, Swahili, Kanuri, Congo Swahili) to South Asian publishing-derived parallel corpora (Saraiki-English, English-Punjabi Shahmukhi), Sephardic Ladino lexical resources for Romance-family transfer, religious-domain multilingual parallel text, and a BOUQuET translation-difficulty evaluation sets that let you benchmark whatever model you end up training.

The datasets

African region

TWB Parallel Sentence kits - Hausa (30k) Contributor: CLEAR Global | Licence: CC-BY-4.0 | Size: 1.68 MB | Task: MT | Format: TSV
TWB Parallel Sentence kits - Congo Swahili (25k) Contributor: CLEAR Global | Licence: CC-BY-4.0 | Size: 2.18 MB | Task: MT | Format: TSV
TWB Parallel Sentence kits - Nande (15k) Contributor: CLEAR Global | Licence: CC-BY-4.0 | Size: 1.26 MB | Task: MT | Format: TSV
TWB Parallel Sentence kits - Swahili (5k) Contributor: CLEAR Global | Licence: CC-BY-4.0 | Size: 347.61 KB | Task: MT | Format: TSV
TWB Parallel Sentence kits - Tigrinya (5k) Contributor: CLEAR Global | Licence: CC-BY-4.0 | Size: 404.75 KB | Task: MT | Format: TSV
TWB Parallel Sentence kits - Lingala (5k) Contributor: CLEAR Global | Licence: CC-BY-4.0 | Size: 494.43 KB | Task: MT | Format: TSV
TWB Parallel Sentence kits - Kanuri (5k) Contributor: CLEAR Global | Licence: CC-BY-4.0 | Size: 358.46 KB | Task: MT | Format: TSV
English Hausa Parallel Corpus Contributor: LocaleNLP | Licence: CC-BY-NC-4.0 | Size: 164.32 KB | Task: MT | Format: CSV

South and Southeast Asia

TWB Parallel Sentence kits - Rohingya (5k) Contributor: CLEAR Global | Licence: CC-BY-4.0 | Size: 358.88 KB | Task: MT | Format: TSV
Saraiki-English Parallel Corpus Contributor: Kaleem Art Press | Licence: CC-BY-NC-4.0 | Size: 1.92 MB | Task: MT | Format: CSV
English–Punjabi (Shahmukhi) Parallel Sentences Corpus (Mediamen Archives) Contributor: MEDIAMEN | Licence: CC-BY-NC-4.0 | Size: 1.08 MB | Task: MT | Format: CSV
Multilingual Religious Parallel Corpus (Kaleem Art Press) Contributor: Kaleem Art Press | Licence: CC-BY-SA-4.0 | Size: 2.27 MB | Task: MT | Format: CSV

Europe

Ladino-Spanish Lexical Resources Contributor: Community | Licence: CC-BY-4.0 | Size: 39.92 KB | Task: MT | Format: TXT
Synthetic Ladino Parallel Corpus Contributor: Community | Licence: CC-BY-4.0 | Size: 898.32 MB | Task: MT | Format: TSV
Sentence translation difficulty in Spanish - BOUQuET Contributor: MDC Curators | Licence: CC-BY-SA-4.0 | Size: 55.83 KB | Task: MT | Format: TSV

Conclusion

A low-resource translation model in 2026 isn't built by waiting for web crawlers to find more text in your target language. It's built by combining the carefully curated parallel corpora that humanitarian organisations, regional publishers, and language communities have produced specifically for this purpose; then fine-tuning a strong multilingual base model on the mix. The 15 datasets above are the working set: not enough alone to train an MT model from scratch, but more than enough to push a multilingual base model into useful territory for the language pair you care about.

This is exactly the data ecosystem Mozilla Data Collective was built to enable. Mozilla Data Collective's mission is to put communities at the centre by giving the organisations and contributors who built these corpora real control over how their data is licensed and used, rather than ceding that control to whichever model provider happened to scrape it first. For low-resource translation in particular, the communities who speak these languages are the ones who decided to translate the sentences, validate the alignments, and release the data. At Mozilla Data Collective we’re proud to empower these communities by making that work searchable, downloadable, and properly attributed.

Browse all Mozilla Data Collective datasets →

Get in touch →

Join Mozilla Data Collective →