This repository has been archived by the owner on May 4, 2021. It is now read-only.
Final release of the baseline parallel data collection pipeline for the ModernMT project.
The pipeline is documented in the readme and documents linked from there.
Changes since the initial public release 0.1.0:
- Ensured that the pipeline can be run independently of any ModernMT project infrastructure (we deployed and tested in the Amazon Web Services us-east availability zone, where the Common Crawl data is hosted)
- Added support for Spanish, Portuguese, Dutch and Russian
- Documentation updates
- Bug fixes
- Documented known issues, limitations and enhancement ideas in issue tracker
Index files for the 2016_50 Common Crawl for the language pairs en→pt, en→nl and en→ru are included as attached, compressed files. These do not contain page pairs that were already contained in the 2015_32 indices attached to release 0.1.0. The index files are licensed under a Creative Commons Attribution 4.0 International License.