-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add segment-wiki script #1483
Add segment-wiki script #1483
Conversation
gensim/scripts/segment_wiki.py
Outdated
os.write(sys.stdout.fileno(), u"\t".join(printed_components).encode('utf-8') + b"\n") | ||
|
||
|
||
# noinspection PyUnresolvedReferences |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noise, please remove (here and elsewhere).
@menshikh-iv the original script was Python3 only -- has this been tested on Python2? We're aiming at dual compatibility (e.g. using |
@piskvorky of course, need to first check all another wiki scripts (as suggested in #1584), after it I'll add python2 compatibility here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor code-style comments; great script!
How as this tested for scale/stability? I remember some issues with multiprocessing.
gensim/scripts/segment_wiki.py
Outdated
article_title<tab>section_heading<tab>section_content<tab>section_heading<tab>section_content | ||
|
||
""" | ||
with open(output_file, 'wb') as outfile: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
smart_open
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
gensim/scripts/segment_wiki.py
Outdated
The documents are extracted on-the-fly, so that the whole (massive) dump | ||
can stay compressed on disk. | ||
|
||
>>> wiki = WikiCorpus('enwiki-20100622-pages-articles.xml.bz2') # create word->word_id mapping, takes almost 8h |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docstring out of date (different class).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done (updated all docstrings, converted to numpy-style, removed outdated things).
gensim/scripts/segment_wiki.py
Outdated
Parse the content inside a page tag, returning its content as a list of tokens | ||
(utf8-encoded strings). | ||
|
||
Returns a 2-tuple (str, list) - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neither google nor numpy docstring format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
gensim/scripts/segment_wiki.py
Outdated
|
||
""" | ||
elem = cElementTree.fromstring(page_xml) | ||
filter_namespaces = ('0',) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deserves a comment -- what is this?
About performance, approximately 5 minutes for 100,000 articles in the output file (SSD, i7 6700HQ, ruwiki), sometimes problems with |
if self.lemmatize: | ||
num_total_tokens += len(utils.lemmatize(section_content)) | ||
else: | ||
num_total_tokens += len(tokenize(section_content)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw I think for the purposes of gensim-data
, we shouldn't do any tokenization or normalization. We should present the sections "as they are", so people can use their own sentence detection / token detection etc. Only remove newlines and tabs just before printing, because of the output format.
It's easy to go from raw section_content
=> tokenize
, but impossible to go from tokenize
=> raw. @menshikh-iv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only for filtering very short articles, all content provided "as is".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, OK, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for misguidance, here's the right way.
The documents are extracted on-the-fly, so that the whole (massive) dump can stay compressed on disk. | ||
|
||
""" | ||
def __init__(self, fileobj, processes=None, lemmatize=utils.has_pattern(), filter_namespaces=('0',)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't mix __init__
and class
annotations. I propose to annotate __init__
from now on.
gensim/scripts/segment_wiki.py
Outdated
lemmatize : bool | ||
If `pattern` package is installed, use fancier shallow parsing to get token lemmas. | ||
Otherwise, use simple regexp tokenization. | ||
filter_namespaces : tuple(int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tuple of int
gensim/scripts/segment_wiki.py
Outdated
|
||
Yields | ||
------ | ||
tuple(str, list of tuple(str, str)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(str, list of (str, str))
gensim/scripts/segment_wiki.py
Outdated
|
||
Returns | ||
------- | ||
tuple(str, list of tuple(str, str)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(str, list of (str, str))
gensim/scripts/segment_wiki.py
Outdated
|
||
Yields | ||
------ | ||
tuple(str, list of tuple(str, str)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(str, list of (str, str))
* add segment wiki script * fix indentation error * Add output file and logging + small fixes * add smart_open * Add numpy-style docstrings & fix .rst * Fix types * Fix docstrings + output file format (json-lines) * Upd .rst
Continued in #1694 |
Construct a corpus from a Wikipedia (or other MediaWiki-based) database dump and extract sections of pages from it.
CC @piskvorky