This repository has been archived by the owner on May 4, 2021. It is now read-only.
Initial public release
Initial public release of baseline parallel data collection pipeline.
The pipeline is documented in the readme and documents linked from there.
Phase 1 of the pipeline is an alpha release, Phase 2 is in beta.
Index files for the 2015_32 CommonCrawl for the language pairs en↔it, en↔fr, en↔de, en↔es, en↔pt, en↔nl and en↔ru are included as attached, compressed files. These index files are licensed under a Creative Commons Attribution 4.0 International License.