Skip to content

Commit

Permalink
update perf numbers of segment_wiki from a fresh h3 run
Browse files Browse the repository at this point in the history
  • Loading branch information
piskvorky committed Nov 11, 2017
1 parent a21e7df commit 9e16630
Showing 1 changed file with 7 additions and 7 deletions.
14 changes: 7 additions & 7 deletions gensim/scripts/segment_wiki.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,14 +5,14 @@
# Copyright (C) 2016 RaRe Technologies

"""
CLI script for extracting plain text out of a raw Wikipedia dump. This is a xml.bz2 file provided by MediaWiki \
and looks like <LANG>wiki-<YYYYMMDD>-pages-articles.xml.bz2 or <LANG>wiki-latest-pages-articles.xml.bz2 \
(e.g. 14 GB: https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2).
CLI script for extracting plain text out of a raw Wikipedia dump. Input is a xml.bz2 file provided by MediaWiki \
that looks like <LANG>wiki-<YYYYMMDD>-pages-articles.xml.bz2 or <LANG>wiki-latest-pages-articles.xml.bz2 \
(e.g. 14 GB of https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2).
It streams through all the XML articles using multiple cores (#cores - 1, by default), \
decompressing on the fly and extracting plain text article sections from each article.
decompressing on the fly and extracting article plain texts.
For each article, it prints its title, section names and section contents, in json-line format.
For each extracted article, it prints its title, section names and plain text section contents, in json-line format.
Examples
--------
Expand All @@ -21,8 +21,8 @@
python -m gensim.scripts.segment_wiki -f enwiki-latest-pages-articles.xml.bz2 -o enwiki-latest.json.gz
Processing the entire English Wikipedia dump takes 2 hours (about 2.5 million articles \
per hour, on 8 core Intel Xeon E3-1275@3.60GHz).
Processing the entire English Wikipedia dump takes 1.7 hours (about 3 million articles per hour, \
or 10 MB of XML per second) on an 8 core Intel i7-7700 @3.60GHz.
You can then read the created output (~6.1 GB gzipped) with:
Expand Down

0 comments on commit 9e16630

Please sign in to comment.