piskvorky · piskvorky · Nov 11, 2017 · Nov 11, 2017
diff --git a/gensim/scripts/segment_wiki.py b/gensim/scripts/segment_wiki.py
@@ -5,14 +5,14 @@
 # Copyright (C) 2016 RaRe Technologies
 
 """
-CLI script for extracting plain text out of a raw Wikipedia dump. This is a xml.bz2 file provided by MediaWiki \
-and looks like <LANG>wiki-<YYYYMMDD>-pages-articles.xml.bz2 or <LANG>wiki-latest-pages-articles.xml.bz2 \
-(e.g. 14 GB: https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2).
+CLI script for extracting plain text out of a raw Wikipedia dump. Input is an xml.bz2 file provided by MediaWiki \
+that looks like <LANG>wiki-<YYYYMMDD>-pages-articles.xml.bz2 or <LANG>wiki-latest-pages-articles.xml.bz2 \
+(e.g. 14 GB of https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2).
 
 It streams through all the XML articles using multiple cores (#cores - 1, by default), \
-decompressing on the fly and extracting plain text article sections from each article.
+decompressing on the fly and extracting plain text from the articles and their sections.
 
-For each article, it prints its title, section names and section contents, in json-line format.
+For each extracted article, it prints its title, section names and plain text section contents, in json-line format.
 
 Examples
 --------
@@ -21,8 +21,8 @@
 
   python -m gensim.scripts.segment_wiki -f enwiki-latest-pages-articles.xml.bz2 -o enwiki-latest.json.gz
 
-Processing the entire English Wikipedia dump takes 2 hours (about 2.5 million articles \
-per hour, on 8 core Intel Xeon E3-1275@3.60GHz).
+Processing the entire English Wikipedia dump takes 1.7 hours (about 3 million articles per hour, \
+or 10 MB of XML per second) on an 8 core Intel i7-7700 @3.60GHz.
 
 You can then read the created output (~6.1 GB gzipped) with: