-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update WikiCorpus tokenization. Fix #1534 #1537
Conversation
Adding the ability to define: 1. Define min and max token length 2. Define min number of tokens for valid articles 3. Call a custom function to handle tokenization with the configured parameter on the class instance 4. Control if lowercase is desired
adding a test case to check "lower" parameter with the custom tokenizer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice PR, thanks @roopalgarg !
Please do the final clean up (language, code style) and we can merge (or @menshikh-iv can do it).
gensim/corpora/wikicorpus.py
Outdated
""" | ||
Tokenize a piece of text from wikipedia. The input string `content` is assumed | ||
to be mark-up free (see `filter_wiki()`). | ||
|
||
set token_min_len, token_max_len as length thresholds for individual tokens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Start with capital letter + full stop at the end + variables in backticks for varbatim formatting, token_min_len
.
Here and elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me fix that.
gensim/corpora/wikicorpus.py
Outdated
tokenizer_func, token_min_len, token_max_len, lower = args[-1] | ||
args = args[:-1] | ||
return process_article(args, tokenizer_func=tokenizer_func, token_min_len=token_min_len, | ||
token_max_len=token_max_len, lower=lower) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hanging indent please (vertical is difficult to maintain and read).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question about hanging indentation:
def _process_article(args):
"""Should not be called explicitly. use process_article instead."""
tokenizer_func, token_min_len, token_max_len, lower = args[-1]
args = args[:-1]
return process_article(
args, tokenizer_func=tokenizer_func, token_min_len=token_min_len, token_max_len=token_max_len, lower=lower
)
looks good? If so, then there are other functions(init) for WikiCorpus which should be fixed as well(I can do it), if not then please ignore my comment and can you point to what the expectation is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd split those arguments into two lines, and capitalize Use
, otherwise looks good 👍
@menshikh-iv is working on cleaning up the code style in other parts of gensim, so I wouldn't worry about that. Let's just not introduce any new issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Changes done. |
are there pending changes needed on this? |
@menshikh-iv will review and take the next steps now. Thanks for the quick PR @roopalgarg ! |
awesome! thanks |
@menshikh-iv @piskvorky wondering if you had a chance to look at this PR yet? |
Congratz with your first PR @roopalgarg 👍 |
awesome! thanks @menshikh-iv and @piskvorky |
Resolves: #1534
Adding the ability to define:
parameter on the class instance