Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add article interlinks to the output of gensim.scripts.segment_wiki. Fix #1712 #1839

Merged
merged 23 commits into from
Jan 31, 2018

Commits on Jan 13, 2018

  1. Configuration menu
    Copy the full SHA
    f90cd9c View commit details
    Browse the repository at this point in the history
  2. Add interlinks to the output of segment_wiki

    * New output format is (str, list of (str, str), list of str, reflecting
    structure (title, [(section_heading, section_content), ...], [interlink, ...])
    
    * `filter_wiki` in WikiCorpus will not promote uncaught markup to plain text
    as this will give up valuable information for the interlink discovery
    steremma committed Jan 13, 2018
    Configuration menu
    Copy the full SHA
    cdfb26a View commit details
    Browse the repository at this point in the history
  3. Fixed PEP 8

    steremma committed Jan 13, 2018
    Configuration menu
    Copy the full SHA
    acc5221 View commit details
    Browse the repository at this point in the history

Commits on Jan 15, 2018

  1. Configuration menu
    Copy the full SHA
    0057c7b View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    107d7f7 View commit details
    Browse the repository at this point in the history
  3. Fixed a bug where interlinks with a description or multiple names whe…

    …re disregarded
    
    * Due to preprocessing in `filter_wiki` interlinks containing alternative names had
    one of the 2 `[` and `]` characters removed. The regex now takes that into account.
    steremma committed Jan 15, 2018
    Configuration menu
    Copy the full SHA
    4adcf86 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    9bf6b87 View commit details
    Browse the repository at this point in the history
  5. Unit test gensim.scripts.segment_wiki

    * Initiate unit testing for all scripts.
    
    * Check for expected len given article filtering (namespace, size in characters and redirections).
    
    * Check for yielded title, section headings and texts as well as interlinks yielded from generator.
    
    * Check that the same is correctly persisted in JSON.
    
    * Fix PEP 8
    steremma committed Jan 15, 2018
    Configuration menu
    Copy the full SHA
    931e138 View commit details
    Browse the repository at this point in the history
  6. Fix Python 3.5 compatibility

    steremma committed Jan 15, 2018
    Configuration menu
    Copy the full SHA
    cd37315 View commit details
    Browse the repository at this point in the history

Commits on Jan 16, 2018

  1. Section text now completely clean from wiki markup

    * Refactored filtering functions in ``wikicorpus.py` so that
    uncaught markup can be optionally promoted to plain text
    
    * Interlink extraction logic moved to `wikicorpus.py`
    
    * Unit tests modified accordingly
    steremma committed Jan 16, 2018
    Configuration menu
    Copy the full SHA
    c681a60 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    ead5386 View commit details
    Browse the repository at this point in the history
  3. Fix PEP 8

    steremma committed Jan 16, 2018
    Configuration menu
    Copy the full SHA
    193861c View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    e170c06 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    b68507b View commit details
    Browse the repository at this point in the history
  6. Get rid of debugging stuff

    steremma committed Jan 16, 2018
    Configuration menu
    Copy the full SHA
    0884f6d View commit details
    Browse the repository at this point in the history
  7. Get rid of global logger

    steremma committed Jan 16, 2018
    Configuration menu
    Copy the full SHA
    58f63ca View commit details
    Browse the repository at this point in the history

Commits on Jan 20, 2018

  1. Interlinks are now mapping from the linked article's title to the act…

    …ual interlink text
    
    * Used boolean argument with default argument in `filter_wiki`. The default value keeps the old functionality
    so that existing code does not brake
    
    * Overriding the default argument causes interlinks to not be simplified and lets `find_interlinks` create the mappings
    steremma committed Jan 20, 2018
    Configuration menu
    Copy the full SHA
    7682f30 View commit details
    Browse the repository at this point in the history
  2. Moved regex outside function

    steremma committed Jan 20, 2018
    Configuration menu
    Copy the full SHA
    3b13d3b View commit details
    Browse the repository at this point in the history

Commits on Jan 25, 2018

  1. Configuration menu
    Copy the full SHA
    e038f52 View commit details
    Browse the repository at this point in the history
  2. Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim

    …into interlinks
    
    * Kept documentation improvements from upstream
    
    * Kept interlink support and updated signatures from my branch
    
    * Added documentation from my extra arguments in correct format
    steremma committed Jan 25, 2018
    Configuration menu
    Copy the full SHA
    68ca8b1 View commit details
    Browse the repository at this point in the history
  3. PEP 8 long lines

    steremma committed Jan 25, 2018
    Configuration menu
    Copy the full SHA
    94c2b3d View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    3c838a6 View commit details
    Browse the repository at this point in the history

Commits on Jan 30, 2018

  1. Configuration menu
    Copy the full SHA
    7f9ed71 View commit details
    Browse the repository at this point in the history