-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add article interlinks to the output of gensim.scripts.segment_wiki
. Fix #1712
#1839
Conversation
* New output format is (str, list of (str, str), list of str, reflecting structure (title, [(section_heading, section_content), ...], [interlink, ...]) * `filter_wiki` in WikiCorpus will not promote uncaught markup to plain text as this will give up valuable information for the interlink discovery
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great start @steremma!
It will be really nice if you add several simple tests for segment_wiki
script (you already have small wiki dump as test data)
gensim/scripts/segment_wiki.py
Outdated
for idx, (article_title, article_sections) in enumerate(article_stream): | ||
output_data = {"title": article_title, "section_titles": [], "section_texts": []} | ||
for idx, (article_title, article_sections, article_interlinks) in enumerate(article_stream): | ||
output_data = {"title": article_title, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use hanging indents (instead of vertical)
gensim/scripts/segment_wiki.py
Outdated
@@ -321,3 +336,15 @@ def get_texts_with_sections(self): | |||
) | |||
|
|||
logger.info("finished running %s", sys.argv[0]) | |||
|
|||
print("-----Now checking output--------\n\n\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? This isn't needed here.
gensim/scripts/segment_wiki.py
Outdated
output_data = {"title": article_title, | ||
"section_titles": [], | ||
"section_texts": [], | ||
"section_interlinks": article_interlinks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't split interlinks by sections, this should name as "interlinks" instead of "section_interlinks".
…re disregarded * Due to preprocessing in `filter_wiki` interlinks containing alternative names had one of the 2 `[` and `]` characters removed. The regex now takes that into account.
* Initiate unit testing for all scripts. * Check for expected len given article filtering (namespace, size in characters and redirections). * Check for yielded title, section headings and texts as well as interlinks yielded from generator. * Check that the same is correctly persisted in JSON. * Fix PEP 8
* Refactored filtering functions in ``wikicorpus.py` so that uncaught markup can be optionally promoted to plain text * Interlink extraction logic moved to `wikicorpus.py` * Unit tests modified accordingly
gensim.scripts.segment_wiki()
gensim.scripts.segment_wiki
. Fix #1712
gensim/corpora/wikicorpus.py
Outdated
Find all interlinks to other articles in the dump. `raw` is either unicode | ||
or utf-8 encoded string. | ||
""" | ||
interlink_regex_capture = r"\[{1,2}(.*?)\]{1,2}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this regex can be compiled outside the function.
…ual interlink text * Used boolean argument with default argument in `filter_wiki`. The default value keeps the old functionality so that existing code does not brake * Overriding the default argument causes interlinks to not be simplified and lets `find_interlinks` create the mappings
@steremma please don't forget to
|
…mand line argument
…into interlinks * Kept documentation improvements from upstream * Kept interlink support and updated signatures from my branch * Added documentation from my extra arguments in correct format
Looks great, thank you @steremma! |
👍 |
…Fix piskvorky#1712 (piskvorky#1839) * promoting the markup gives up information needed to find the intelinks * Add interlinks to the output of `segment_wiki` * New output format is (str, list of (str, str), list of str, reflecting structure (title, [(section_heading, section_content), ...], [interlink, ...]) * `filter_wiki` in WikiCorpus will not promote uncaught markup to plain text as this will give up valuable information for the interlink discovery * Fixed PEP 8 * Refactoring identation and variable names * Removed debugging code from script * Fixed a bug where interlinks with a description or multiple names where disregarded * Due to preprocessing in `filter_wiki` interlinks containing alternative names had one of the 2 `[` and `]` characters removed. The regex now takes that into account. * Now stripping whitespace off section titles * Unit test `gensim.scripts.segment_wiki` * Initiate unit testing for all scripts. * Check for expected len given article filtering (namespace, size in characters and redirections). * Check for yielded title, section headings and texts as well as interlinks yielded from generator. * Check that the same is correctly persisted in JSON. * Fix PEP 8 * Fix Python 3.5 compatibility * Section text now completely clean from wiki markup * Refactored filtering functions in ``wikicorpus.py` so that uncaught markup can be optionally promoted to plain text * Interlink extraction logic moved to `wikicorpus.py` * Unit tests modified accordingly * Added extra logging info to troublehsoot weird Travis behavior * Fix PEP 8 * pin workers for segment_and_write_all_articles * Get rid of debugging stuff * Get rid of global logger * Interlinks are now mapping from the linked article's title to the actual interlink text * Used boolean argument with default argument in `filter_wiki`. The default value keeps the old functionality so that existing code does not brake * Overriding the default argument causes interlinks to not be simplified and lets `find_interlinks` create the mappings * Moved regex outside function * Interlink extraction is now optional and controlled with the `-i` command line argument * PEP 8 long lines * made scripts tests aware of the optional interlinks argument * Updated script help output for interlinks
Interlinks
This PR adds the feature requested in #1712
The
segment_wiki
script now includes the interlinks (links to other articles in the dump) found in each article it processes.Implementation
The interlink markup parsing takes place in the filtered text rather than the raw one in order to avoid code and logic duplication (similar markup would need to be removed again which is already done in
filter_wiki
). In order to achieve this the functionfilter_wiki
must not promote the uncaught markup to plain text. The usefulness of this operation was already doubted (see comment in old version) so I removed the promotion altogether:Removing this line was important because:
Phrases like "Computer Science" which has a meaning as a phrase will lose its meaning if its promoted to plain text (2 tokens)
the
[[
and]]
are the markup used for interlinks. So if we promote them then we lose this informationSince the output of the script and its helper functions changed, I modified the documentation (source comments) accordingly.