Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Update performance numbers of segment_wiki #1708

Merged
merged 1 commit into from
Nov 11, 2017
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 7 additions & 7 deletions gensim/scripts/segment_wiki.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,14 +5,14 @@
# Copyright (C) 2016 RaRe Technologies

"""
CLI script for extracting plain text out of a raw Wikipedia dump. This is a xml.bz2 file provided by MediaWiki \
and looks like <LANG>wiki-<YYYYMMDD>-pages-articles.xml.bz2 or <LANG>wiki-latest-pages-articles.xml.bz2 \
(e.g. 14 GB: https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2).
CLI script for extracting plain text out of a raw Wikipedia dump. Input is an xml.bz2 file provided by MediaWiki \
that looks like <LANG>wiki-<YYYYMMDD>-pages-articles.xml.bz2 or <LANG>wiki-latest-pages-articles.xml.bz2 \
(e.g. 14 GB of https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2).

It streams through all the XML articles using multiple cores (#cores - 1, by default), \
decompressing on the fly and extracting plain text article sections from each article.
decompressing on the fly and extracting plain text from the articles and their sections.

For each article, it prints its title, section names and section contents, in json-line format.
For each extracted article, it prints its title, section names and plain text section contents, in json-line format.

Examples
--------
Expand All @@ -21,8 +21,8 @@

python -m gensim.scripts.segment_wiki -f enwiki-latest-pages-articles.xml.bz2 -o enwiki-latest.json.gz

Processing the entire English Wikipedia dump takes 2 hours (about 2.5 million articles \
per hour, on 8 core Intel Xeon E3-1275@3.60GHz).
Processing the entire English Wikipedia dump takes 1.7 hours (about 3 million articles per hour, \
or 10 MB of XML per second) on an 8 core Intel i7-7700 @3.60GHz.

You can then read the created output (~6.1 GB gzipped) with:

Expand Down