From 9e1663095b96093a5e286050118d48abee770a07 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Radim=20=C5=98eh=C5=AF=C5=99ek?= Date: Sat, 11 Nov 2017 21:08:50 +0100 Subject: [PATCH] update perf numbers of segment_wiki from a fresh h3 run --- gensim/scripts/segment_wiki.py | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/gensim/scripts/segment_wiki.py b/gensim/scripts/segment_wiki.py index 76fa9b8f15..053dde79e9 100755 --- a/gensim/scripts/segment_wiki.py +++ b/gensim/scripts/segment_wiki.py @@ -5,14 +5,14 @@ # Copyright (C) 2016 RaRe Technologies """ -CLI script for extracting plain text out of a raw Wikipedia dump. This is a xml.bz2 file provided by MediaWiki \ -and looks like wiki--pages-articles.xml.bz2 or wiki-latest-pages-articles.xml.bz2 \ -(e.g. 14 GB: https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2). +CLI script for extracting plain text out of a raw Wikipedia dump. Input is a xml.bz2 file provided by MediaWiki \ +that looks like wiki--pages-articles.xml.bz2 or wiki-latest-pages-articles.xml.bz2 \ +(e.g. 14 GB of https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2). It streams through all the XML articles using multiple cores (#cores - 1, by default), \ -decompressing on the fly and extracting plain text article sections from each article. +decompressing on the fly and extracting article plain texts. -For each article, it prints its title, section names and section contents, in json-line format. +For each extracted article, it prints its title, section names and plain text section contents, in json-line format. Examples -------- @@ -21,8 +21,8 @@ python -m gensim.scripts.segment_wiki -f enwiki-latest-pages-articles.xml.bz2 -o enwiki-latest.json.gz -Processing the entire English Wikipedia dump takes 2 hours (about 2.5 million articles \ -per hour, on 8 core Intel Xeon E3-1275@3.60GHz). +Processing the entire English Wikipedia dump takes 1.7 hours (about 3 million articles per hour, \ +or 10 MB of XML per second) on an 8 core Intel i7-7700 @3.60GHz. You can then read the created output (~6.1 GB gzipped) with: