Skip to content
This repository has been archived by the owner on May 4, 2021. It is now read-only.

Compression, not language tagging seems to be the bottleneck in extract_monolingual.sh #12

Open
achimr opened this issue Aug 18, 2017 · 0 comments
Assignees

Comments

@achimr
Copy link
Contributor

achimr commented Aug 18, 2017

achim     28910  0.0  0.0  14404  1448 ?        SN   14:05   0:00 /bin/bash /hom
e/achim/DataCollection/metadata/extract_monolingual.sh https://commoncrawl.s3.am
azonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698541518.17/wet/CC-MAIN-201
61202170901-00350-ip-10-31-129-80.ec2.internal.warc.wet.gz
achim     28914  0.3  0.0 169336  4736 ?        SN   14:05   0:00 curl -s https:
//commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698541518
.17/wet/CC-MAIN-20161202170901-00350-ip-10-31-129-80.ec2.internal.warc.wet.gz
achim     28915  2.7  0.0   4740   620 ?        SN   14:05   0:00 gzip -cd
achim     28916  6.2  0.0  30600  8752 ?        SN   14:05   0:01 python /home/a
chim/DataCollection/metadata/read_wet.py
achim     28917 21.8  0.0   9632  7868 ?        SN   14:05   0:04 /home/achim/Da
taCollection/metadata/langsplit --printchunks
achim     28918 98.0  0.5 702652 345948 ?       RN   14:05   0:21 xz -9 -e

The compression with xz seems to be the performance bottleneck in the pipeline performing the language identification for a CommonCrawl crawl.

@achimr achimr self-assigned this Aug 18, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant