Skip to content

Open source bilingual Catalan corpus used to train machine learning systems

Notifications You must be signed in to change notification settings

Softcatala/parallel-catalan-corpus

Repository files navigation

Description

This repository collects open source parallel aligned corpuses Catalan to several languages.

Parallel corpus

We use these corpuses to train the Softcatalà neural translation system:

The corpus with extension xz need to be descompressed with xz.

You can do this easily by typing:

make extract-corpus

Catalan monolingual corpus

For backtranslation you may be interested in a monolingual Catalan corpus. You can create a monolingual corpus by typing:

make build-monolingual

This creates a single Catalan file with all unique strings across all language pairs.

Sources of the corpus used

We strongly recommend the following sources of aligned Catalan parallel corpuses:

On top of these previously available corpus, we have created the following corpus:

Do you want to help?

See here (In Catalan)

Contact

Contact Jordi Mas jmas@softcatala.org

Metadescription

The following table is necessary for this dataset to be indexed by search engines such as Google Dataset Search.

property value
name Open source aligned text corpus English, German, French, Italian, Japanese, Portuguese, Spanish, Occitan, Galician, Basque, etc to/from Catalan.
description Open source aligned text corpus for building NLP applications (e.g. machine translation). Already existing corpus have been clean up and new corpus have been introduced: Europarl Catalan, Tilde Catalan and open source translation memories.
sameAs https://github.com/Softcatala/parallel-catalan-corpus/
url
creator Softcatalà

About

Open source bilingual Catalan corpus used to train machine learning systems

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •