15 Datasets for Building a Low-Resource Translation Model in 2026
The problem with "low-resource" machine translation
Most production machine-translation systems in 2026 are still trained on a fairly narrow set of language pairs: the 50 or so for which the open web supplies enough parallel text to push BLEU scores into useful territory. Below that line, MT quality worsens significantly. For a Hausa-to-English translation system to be worth shipping, you need in the order of millions of aligned sentence pairs. For most of the world's 7,000+ languages, that volume of data doesn't exist anywhere and isn't going to be generated through web crawls alone.
What does exist, increasingly, is community-led parallel-data collection. Translators without Borders (now CLEAR Global) ran the TWB Gamayun parallel sentence kits specifically to give humanitarian-language translation projects a starting point. Pakistani publishers, Mexican linguists, Sephardic Jewish revitalisation projects, and Nigerian research groups have built smaller but carefully curated parallel corpora in languages mainstream MT simply ignores. Combined, these resources don't replace web-scale parallel data but they're enough to fine-tune an existing multilingual MT base model into something usable for a specific low-resource pair.
The 15 datasets below are the working set for that kind of project in 2026. They range from the TWB Gamayun humanitarian sentence kits (Hausa, Lingala, Tigrinya, Nande, Rohingya, Swahili, Kanuri, Congo Swahili) to South Asian publishing-derived parallel corpora (Saraiki-English, English-Punjabi Shahmukhi), Sephardic Ladino lexical resources for Romance-family transfer, religious-domain multilingual parallel text, and a BOUQuET translation-difficulty evaluation sets that let you benchmark whatever model you end up training.
The datasets
African region
- TWB Parallel Sentence kits - Hausa (30k) Contributor: CLEAR Global | Licence: CC-BY-4.0 | Size: 1.68 MB | Task: MT | Format: TSV
- TWB Parallel Sentence kits - Congo Swahili (25k) Contributor: CLEAR Global | Licence: CC-BY-4.0 | Size: 2.18 MB | Task: MT | Format: TSV
- TWB Parallel Sentence kits - Nande (15k) Contributor: CLEAR Global | Licence: CC-BY-4.0 | Size: 1.26 MB | Task: MT | Format: TSV
- TWB Parallel Sentence kits - Swahili (5k) Contributor: CLEAR Global | Licence: CC-BY-4.0 | Size: 347.61 KB | Task: MT | Format: TSV
- TWB Parallel Sentence kits - Tigrinya (5k) Contributor: CLEAR Global | Licence: CC-BY-4.0 | Size: 404.75 KB | Task: MT | Format: TSV
- TWB Parallel Sentence kits - Lingala (5k) Contributor: CLEAR Global | Licence: CC-BY-4.0 | Size: 494.43 KB | Task: MT | Format: TSV
- TWB Parallel Sentence kits - Kanuri (5k) Contributor: CLEAR Global | Licence: CC-BY-4.0 | Size: 358.46 KB | Task: MT | Format: TSV
- English Hausa Parallel Corpus Contributor: LocaleNLP | Licence: CC-BY-NC-4.0 | Size: 164.32 KB | Task: MT | Format: CSV
South and Southeast Asia
- TWB Parallel Sentence kits - Rohingya (5k) Contributor: CLEAR Global | Licence: CC-BY-4.0 | Size: 358.88 KB | Task: MT | Format: TSV
- Saraiki-English Parallel Corpus Contributor: Kaleem Art Press | Licence: CC-BY-NC-4.0 | Size: 1.92 MB | Task: MT | Format: CSV
- English–Punjabi (Shahmukhi) Parallel Sentences Corpus (Mediamen Archives) Contributor: MEDIAMEN | Licence: CC-BY-NC-4.0 | Size: 1.08 MB | Task: MT | Format: CSV
- Multilingual Religious Parallel Corpus (Kaleem Art Press) Contributor: Kaleem Art Press | Licence: CC-BY-SA-4.0 | Size: 2.27 MB | Task: MT | Format: CSV
Europe
- Ladino-Spanish Lexical Resources Contributor: Community | Licence: CC-BY-4.0 | Size: 39.92 KB | Task: MT | Format: TXT
- Synthetic Ladino Parallel Corpus Contributor: Community | Licence: CC-BY-4.0 | Size: 898.32 MB | Task: MT | Format: TSV
- Sentence translation difficulty in Spanish - BOUQuET Contributor: MDC Curators | Licence: CC-BY-SA-4.0 | Size: 55.83 KB | Task: MT | Format: TSV
Conclusion
A low-resource translation model in 2026 isn't built by waiting for web crawlers to find more text in your target language. It's built by combining the carefully curated parallel corpora that humanitarian organisations, regional publishers, and language communities have produced specifically for this purpose; then fine-tuning a strong multilingual base model on the mix. The 15 datasets above are the working set: not enough alone to train an MT model from scratch, but more than enough to push a multilingual base model into useful territory for the language pair you care about.
This is exactly the data ecosystem Mozilla Data Collective was built to enable. Mozilla Data Collective's mission is to put communities at the centre by giving the organisations and contributors who built these corpora real control over how their data is licensed and used, rather than ceding that control to whichever model provider happened to scrape it first. For low-resource translation in particular, the communities who speak these languages are the ones who decided to translate the sentences, validate the alignments, and release the data. At Mozilla Data Collective we’re proud to empower these communities by making that work searchable, downloadable, and properly attributed.
Browse all Mozilla Data Collective datasets →
Join Mozilla Data Collective →