This repository collects open source parallel aligned corpuses Catalan to several languages.
We use these corpuses to train the Softcatalà neural translation system:
- English - Catalan
- German - Catalan
- French - Catalan
- Italian - Catalan
- Japanese - Catalan
- Dutch - Catalan
- Portuguese - Catalan
- Spanish - Catalan
- Occitan - Catalan
- Galician - Catalan
- Basque - Catalan
The corpus with extension xz need to be descompressed with xz.
You can do this easily by typing:
make extract-corpus
For backtranslation you may be interested in a monolingual Catalan corpus. You can create a monolingual corpus by typing:
make build-monolingual
This creates a single Catalan file with all unique strings across all language pairs.
We strongly recommend the following sources of aligned Catalan parallel corpuses:
- https://opus.nlpl.eu/
- https://github.com/facebookresearch/LASER/tree/master/tasks/WikiMatrix
- https://www.statmt.org/cc-aligned/
On top of these previously available corpus, we have created the following corpus:
- Europarl-catalan
- Tilde-MODEL-catalan
- Open source corpus in serval directions using Softcatalà translation tools
See here (In Catalan)
Contact Jordi Mas jmas@softcatala.org
The following table is necessary for this dataset to be indexed by search engines such as Google Dataset Search.
property | value |
---|---|
name | Open source aligned text corpus English, German, French, Italian, Japanese, Portuguese, Spanish, Occitan, Galician, Basque, etc to/from Catalan. |
description | Open source aligned text corpus for building NLP applications (e.g. machine translation). Already existing corpus have been clean up and new corpus have been introduced: Europarl Catalan, Tilde Catalan and open source translation memories. |
sameAs | https://github.com/Softcatala/parallel-catalan-corpus/ |
url | https://github.com/Softcatala/parallel-catalan-corpus/ |
creator | Softcatalà |