Compression, not language tagging seems to be the bottleneck in extract_monolingual.sh #12

achimr · 2017-08-18T14:12:55Z

achim     28910  0.0  0.0  14404  1448 ?        SN   14:05   0:00 /bin/bash /hom
e/achim/DataCollection/metadata/extract_monolingual.sh https://commoncrawl.s3.am
azonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698541518.17/wet/CC-MAIN-201
61202170901-00350-ip-10-31-129-80.ec2.internal.warc.wet.gz
achim     28914  0.3  0.0 169336  4736 ?        SN   14:05   0:00 curl -s https:
//commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698541518
.17/wet/CC-MAIN-20161202170901-00350-ip-10-31-129-80.ec2.internal.warc.wet.gz
achim     28915  2.7  0.0   4740   620 ?        SN   14:05   0:00 gzip -cd
achim     28916  6.2  0.0  30600  8752 ?        SN   14:05   0:01 python /home/a
chim/DataCollection/metadata/read_wet.py
achim     28917 21.8  0.0   9632  7868 ?        SN   14:05   0:04 /home/achim/Da
taCollection/metadata/langsplit --printchunks
achim     28918 98.0  0.5 702652 345948 ?       RN   14:05   0:21 xz -9 -e

The compression with xz seems to be the performance bottleneck in the pipeline performing the language identification for a CommonCrawl crawl.

The text was updated successfully, but these errors were encountered:

achimr added the enhancement label Aug 18, 2017

achimr self-assigned this Aug 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compression, not language tagging seems to be the bottleneck in extract_monolingual.sh #12

Compression, not language tagging seems to be the bottleneck in extract_monolingual.sh #12

achimr commented Aug 18, 2017

Compression, not language tagging seems to be the bottleneck in extract_monolingual.sh #12

Compression, not language tagging seems to be the bottleneck in extract_monolingual.sh #12

Comments

achimr commented Aug 18, 2017