-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add article interlinks to the output of gensim.scripts.segment_wiki
. Fix #1712
#1839
Commits on Jan 13, 2018
-
Configuration menu - View commit details
-
Copy full SHA for f90cd9c - Browse repository at this point
Copy the full SHA f90cd9cView commit details -
Add interlinks to the output of
segment_wiki
* New output format is (str, list of (str, str), list of str, reflecting structure (title, [(section_heading, section_content), ...], [interlink, ...]) * `filter_wiki` in WikiCorpus will not promote uncaught markup to plain text as this will give up valuable information for the interlink discovery
Configuration menu - View commit details
-
Copy full SHA for cdfb26a - Browse repository at this point
Copy the full SHA cdfb26aView commit details -
Configuration menu - View commit details
-
Copy full SHA for acc5221 - Browse repository at this point
Copy the full SHA acc5221View commit details
Commits on Jan 15, 2018
-
Configuration menu - View commit details
-
Copy full SHA for 0057c7b - Browse repository at this point
Copy the full SHA 0057c7bView commit details -
Configuration menu - View commit details
-
Copy full SHA for 107d7f7 - Browse repository at this point
Copy the full SHA 107d7f7View commit details -
Fixed a bug where interlinks with a description or multiple names whe…
…re disregarded * Due to preprocessing in `filter_wiki` interlinks containing alternative names had one of the 2 `[` and `]` characters removed. The regex now takes that into account.
Configuration menu - View commit details
-
Copy full SHA for 4adcf86 - Browse repository at this point
Copy the full SHA 4adcf86View commit details -
Configuration menu - View commit details
-
Copy full SHA for 9bf6b87 - Browse repository at this point
Copy the full SHA 9bf6b87View commit details -
Unit test
gensim.scripts.segment_wiki
* Initiate unit testing for all scripts. * Check for expected len given article filtering (namespace, size in characters and redirections). * Check for yielded title, section headings and texts as well as interlinks yielded from generator. * Check that the same is correctly persisted in JSON. * Fix PEP 8
Configuration menu - View commit details
-
Copy full SHA for 931e138 - Browse repository at this point
Copy the full SHA 931e138View commit details -
Configuration menu - View commit details
-
Copy full SHA for cd37315 - Browse repository at this point
Copy the full SHA cd37315View commit details
Commits on Jan 16, 2018
-
Section text now completely clean from wiki markup
* Refactored filtering functions in ``wikicorpus.py` so that uncaught markup can be optionally promoted to plain text * Interlink extraction logic moved to `wikicorpus.py` * Unit tests modified accordingly
Configuration menu - View commit details
-
Copy full SHA for c681a60 - Browse repository at this point
Copy the full SHA c681a60View commit details -
Configuration menu - View commit details
-
Copy full SHA for ead5386 - Browse repository at this point
Copy the full SHA ead5386View commit details -
Configuration menu - View commit details
-
Copy full SHA for 193861c - Browse repository at this point
Copy the full SHA 193861cView commit details -
Configuration menu - View commit details
-
Copy full SHA for e170c06 - Browse repository at this point
Copy the full SHA e170c06View commit details -
Configuration menu - View commit details
-
Copy full SHA for b68507b - Browse repository at this point
Copy the full SHA b68507bView commit details -
Configuration menu - View commit details
-
Copy full SHA for 0884f6d - Browse repository at this point
Copy the full SHA 0884f6dView commit details -
Configuration menu - View commit details
-
Copy full SHA for 58f63ca - Browse repository at this point
Copy the full SHA 58f63caView commit details
Commits on Jan 20, 2018
-
Interlinks are now mapping from the linked article's title to the act…
…ual interlink text * Used boolean argument with default argument in `filter_wiki`. The default value keeps the old functionality so that existing code does not brake * Overriding the default argument causes interlinks to not be simplified and lets `find_interlinks` create the mappings
Configuration menu - View commit details
-
Copy full SHA for 7682f30 - Browse repository at this point
Copy the full SHA 7682f30View commit details -
Configuration menu - View commit details
-
Copy full SHA for 3b13d3b - Browse repository at this point
Copy the full SHA 3b13d3bView commit details
Commits on Jan 25, 2018
-
Interlink extraction is now optional and controlled with the
-i
com……mand line argument
Configuration menu - View commit details
-
Copy full SHA for e038f52 - Browse repository at this point
Copy the full SHA e038f52View commit details -
Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim …
…into interlinks * Kept documentation improvements from upstream * Kept interlink support and updated signatures from my branch * Added documentation from my extra arguments in correct format
Configuration menu - View commit details
-
Copy full SHA for 68ca8b1 - Browse repository at this point
Copy the full SHA 68ca8b1View commit details -
Configuration menu - View commit details
-
Copy full SHA for 94c2b3d - Browse repository at this point
Copy the full SHA 94c2b3dView commit details -
Configuration menu - View commit details
-
Copy full SHA for 3c838a6 - Browse repository at this point
Copy the full SHA 3c838a6View commit details
Commits on Jan 30, 2018
-
Configuration menu - View commit details
-
Copy full SHA for 7f9ed71 - Browse repository at this point
Copy the full SHA 7f9ed71View commit details