Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update WikiCorpus tokenization. Fix #1534 #1537

Merged
merged 7 commits into from
Sep 18, 2017
Merged

Update WikiCorpus tokenization. Fix #1534 #1537

merged 7 commits into from
Sep 18, 2017

Conversation

roopalgarg
Copy link
Contributor

Resolves: #1534

Adding the ability to define:

  1. Define min and max token length
  2. Define min number of tokens for valid articles
  3. Call a custom function to handle tokenization with the configured
    parameter on the class instance
  4. Control if lowercase is desired

Roopal Garg added 2 commits August 16, 2017 19:10
Adding the ability to define:
1. Define min and max token length
2. Define min number of tokens for valid articles
3. Call a custom function to handle tokenization with the configured
parameter on the class instance
4. Control if lowercase is desired
adding a test case to check "lower" parameter with the custom tokenizer
@roopalgarg roopalgarg changed the title WikiCorpus Tokenization issue #1534 WikiCorpus Tokenization issue Aug 17, 2017
Copy link
Owner

@piskvorky piskvorky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice PR, thanks @roopalgarg !

Please do the final clean up (language, code style) and we can merge (or @menshikh-iv can do it).

"""
Tokenize a piece of text from wikipedia. The input string `content` is assumed
to be mark-up free (see `filter_wiki()`).

set token_min_len, token_max_len as length thresholds for individual tokens
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Start with capital letter + full stop at the end + variables in backticks for varbatim formatting, token_min_len.
Here and elsewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me fix that.

tokenizer_func, token_min_len, token_max_len, lower = args[-1]
args = args[:-1]
return process_article(args, tokenizer_func=tokenizer_func, token_min_len=token_min_len,
token_max_len=token_max_len, lower=lower)
Copy link
Owner

@piskvorky piskvorky Aug 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hanging indent please (vertical is difficult to maintain and read).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question about hanging indentation:

def _process_article(args):
    """Should not be called explicitly. use process_article instead."""
    
    tokenizer_func, token_min_len, token_max_len, lower = args[-1]
    args = args[:-1]
    
    return process_article(
        args, tokenizer_func=tokenizer_func, token_min_len=token_min_len, token_max_len=token_max_len, lower=lower
    )

looks good? If so, then there are other functions(init) for WikiCorpus which should be fixed as well(I can do it), if not then please ignore my comment and can you point to what the expectation is?

Copy link
Owner

@piskvorky piskvorky Aug 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd split those arguments into two lines, and capitalize Use, otherwise looks good 👍

@menshikh-iv is working on cleaning up the code style in other parts of gensim, so I wouldn't worry about that. Let's just not introduce any new issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@roopalgarg
Copy link
Contributor Author

Changes done.

@roopalgarg
Copy link
Contributor Author

are there pending changes needed on this?

@piskvorky
Copy link
Owner

@menshikh-iv will review and take the next steps now. Thanks for the quick PR @roopalgarg !

@roopalgarg
Copy link
Contributor Author

awesome! thanks

@roopalgarg
Copy link
Contributor Author

roopalgarg commented Aug 25, 2017

@menshikh-iv @piskvorky wondering if you had a chance to look at this PR yet?

@menshikh-iv menshikh-iv changed the title WikiCorpus Tokenization issue Update WikiCorpus tokenization. Fix #1534 Sep 18, 2017
@menshikh-iv
Copy link
Contributor

menshikh-iv commented Sep 18, 2017

Congratz with your first PR @roopalgarg 👍
Thanks for your fix, follow up the instruction from #1587 and receive nice gifts 🎁

@menshikh-iv menshikh-iv merged commit 2e58a1c into piskvorky:develop Sep 18, 2017
@roopalgarg
Copy link
Contributor Author

awesome! thanks @menshikh-iv and @piskvorky
really happy contributing (specially being my first PR) and looking forward to future contributions :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants