-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Refactor documentation API Reference for gensim.parsing #1684
Conversation
gensim/parsing/preprocessing.py
Outdated
@@ -63,6 +63,26 @@ def strip_punctuation(s): | |||
|
|||
|
|||
def strip_tags(s): | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need the empty line on this position (here and everywhere)
gensim/parsing/preprocessing.py
Outdated
|
||
Examples | ||
-------- | ||
>>>from gensim.parsing.preprocessing import strip_tags |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
>>>from
-> >>> from
for all examples (here and everywhere)
… for preprocess_string & preprocess_documents
gensim/parsing/preprocessing.py
Outdated
@@ -125,13 +244,53 @@ def stem_text(text): | |||
|
|||
|
|||
def preprocess_string(s, filters=DEFAULT_FILTERS): | |||
"""Takes string, applies chosen filters to it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is filters? Can you add more "complicated" example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, describe DEFAULT_FILTERS
.
gensim/parsing/preprocessing.py
Outdated
@@ -96,6 +96,26 @@ def strip_short(s, minsize=3): | |||
|
|||
|
|||
def strip_numeric(s): | |||
|
|||
"""Takes string and removes digits from it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coding style: docstring in Python should be in imperative mode: "Do X", not "Does X".
gensim/parsing/preprocessing.py
Outdated
@@ -96,6 +96,26 @@ def strip_short(s, minsize=3): | |||
|
|||
|
|||
def strip_numeric(s): | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coding style: no empty line before docstring.
@@ -40,6 +40,26 @@ | |||
|
|||
|
|||
def remove_stopwords(s): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're refactoring, the set of stopwords should be a parameter: remove_stopwords(s, stopwords=STOPWORDS)
.
IIRC @menshikh-iv had plans to remove this entire package, so not sure if this is relevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piskvorky as I investigate, this package is actively used, for this reason, this will be moved (and slightly refactored).
gensim/parsing/preprocessing.py
Outdated
s = utils.to_unicode(s) | ||
return RE_PUNCT.sub(" ", s) | ||
|
||
|
||
# unicode.translate cannot delete characters like str can | ||
strip_punctuation2 = strip_punctuation | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That won't work, this is not how docstrings work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need docstring here, this will be removed in refactoring.
gensim/parsing/preprocessing.py
Outdated
@@ -40,6 +40,26 @@ | |||
|
|||
|
|||
def remove_stopwords(s): | |||
"""Takes string, removes all words those are among stopwords. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coding style: docstrings in imperative mode: "Do X", not "Does X".
gensim/parsing/preprocessing.py
Outdated
@@ -40,6 +40,26 @@ | |||
|
|||
|
|||
def remove_stopwords(s): | |||
"""Takes string, removes all words those are among stopwords. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you start the docstring on its own line? Not continue after """
, it's harder to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's numpy-style convention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, ok.
gensim/parsing/preprocessing.py
Outdated
@@ -92,6 +224,27 @@ def strip_non_alphanum(s): | |||
|
|||
|
|||
def strip_multiple_whitespaces(s): | |||
r"""Takes string, removes repeating in a row whitespace characters (spaces, tabs, line breaks) from it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this docstring r"""
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the special case because we used \n
, \r
in Examples
section, from this Sphinx goes mad. Solution: use raw string for this case.
gensim/parsing/preprocessing.py
Outdated
>>> from gensim.parsing.preprocessing import preprocess_string, strip_tags, strip_punctuation | ||
>>> s = "<i>Hel 9lo</i> <b>Wo9 rld</b>! Th3 weather_is really g00d today, isn't it?" | ||
>>> CUSTOM_FILTERS = [lambda x: x.lower(), strip_tags, strip_punctuation] | ||
>>> preprocess_string(s,CUSTOM_FILTERS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coding style: space after comma.
gensim/parsing/preprocessing.py
Outdated
@@ -125,17 +317,84 @@ def stem_text(text): | |||
|
|||
|
|||
def preprocess_string(s, filters=DEFAULT_FILTERS): | |||
"""Takes string, applies list of chosen filters to it, where filters are methods from this module. Default list of filters consists of: strip_tags, strip_punctuation, strip_multiple_whitespaces, strip_numeric, remove_stopwords, strip_short, stem_text. <function <lambda>> in signature means that we use lambda function for applying methods to filters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coding style: line way too long.
gensim/parsing/preprocessing.py
Outdated
s = utils.to_unicode(s) | ||
for f in filters: | ||
s = f(s) | ||
return s.split() | ||
|
||
|
||
def preprocess_documents(docs): | ||
"""Takes list of strings, splits it into sentences, then applies default filters to every sentence. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any splitting into sentences, where does that come from?
gensim/parsing/preprocessing.py
Outdated
u'Better late never, better late.' | ||
|
||
""" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coding style (PEP257): no blank line before or after docstring.
gensim/parsing/preprocessing.py
Outdated
return [preprocess_string(d) for d in docs] | ||
|
||
|
||
def read_file(path): | ||
r"""Reads file in specified directory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This entire function should be removed, it's too trivial.
gensim/parsing/preprocessing.py
Outdated
Returns | ||
------- | ||
list | ||
List of unicode strings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't match the example.
gensim/parsing/preprocessing.py
Outdated
|
||
Parameters | ||
---------- | ||
s : str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add description to argument (for example Input string.
), here and anywhere
gensim/parsing/preprocessing.py
Outdated
@@ -125,17 +317,84 @@ def stem_text(text): | |||
|
|||
|
|||
def preprocess_string(s, filters=DEFAULT_FILTERS): | |||
"""Takes string, applies list of chosen filters to it, where filters are methods from this module. Default list of filters consists of: strip_tags, strip_punctuation, strip_multiple_whitespaces, strip_numeric, remove_stopwords, strip_short, stem_text. <function <lambda>> in signature means that we use lambda function for applying methods to filters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use references to this, not raw text, i.e.
:func:`~gensim.parsing.preprocessing.strip_tags`
instead of strip_tags
(here and anywhere)
gensim/parsing/preprocessing.py
Outdated
Examples | ||
-------- | ||
>>> from gensim.parsing.preprocessing import read_file | ||
>>> path = "/media/work/october_2017/gensim/gensim/test/test_data/mihalcea_tarau.summ.txt" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This path will works only on your filesystem, utils to retrieve path to test files will be ready very soon
@menshikh-iv |
@bsivavenu lda tutorial. |
gensim/parsing/porter.py
Outdated
return not all(self._cons(i) for i in xrange(self.j + 1)) | ||
|
||
def _doublec(self, j): | ||
"""True <=> j,(j-1) contain a double consonant.""" | ||
"""Check if b[j],b[j-1] contain a double consonant. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spaces after ,
(here and anywhere).
gensim/parsing/preprocessing.py
Outdated
@@ -38,77 +38,250 @@ | |||
""" | |||
STOPWORDS = frozenset(w for w in STOPWORDS.split() if w) | |||
|
|||
RE_PUNCT = re.compile(r'([%s])+' % re.escape(string.punctuation), re.UNICODE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No needed empty lines between regexps + please add comment for each regexp # remove punctuation according to string.punctuation
gensim/parsing/porter.py
Outdated
>>> p = PorterStemmer() | ||
>>> print "b = ", p.b," ,k = ", p.k, " ,j = ", p.j | ||
b = ,k = 0 ,j = 0 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add description for the class field (what's b
, j
, k
)
gensim/parsing/preprocessing.py
Outdated
@@ -37,78 +37,252 @@ | |||
your yours yourself yourselves | |||
""" | |||
STOPWORDS = frozenset(w for w in STOPWORDS.split() if w) | |||
# set of stopwords for :func:`~gensim.parsing.preprocessing.remove_stopwords`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before better than after + return endlines (now it's needed for readability).
LGTM! @CLearERR 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good overall, but needs to be fixed a little bit.
gensim/parsing/porter.py
Outdated
self.b = "" # buffer for word to be stemmed | ||
self.k = 0 | ||
self.j = 0 # j is a general offset into the string | ||
|
||
def _cons(self, i): | ||
"""True <=> b[i] is a consonant.""" | ||
"""Take b[i], check if it is a consonant. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not obvious what b[i]
is. You should probably specify attribute b
here and in a class body.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In class description we already have:
b : str : is a buffer holding a word to be stemmed. The letters are in b[0], b[1] ... ending at b[k].
Probably it will be enough to add "letter" after "... it is a consonant".
gensim/parsing/porter.py
Outdated
Returns | ||
------- | ||
bool | ||
True, if b[i] is a consonant, otherwise - False. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too wordily, IMHO, it's obvious what this function returns, the description can be omitted (or the "otherwise - False" part at least)
gensim/parsing/porter.py
Outdated
|
||
The main part of the stemming algorithm (https://en.wikipedia.org/wiki/Stemming) | ||
starts in :func:`~gensim.parsing.porter.PorterStemmer`. | ||
b is a buffer holding a word to be stemmed. The letters are in b[0], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This information should be in Attributes
and Notes
sections of the class/module.
gensim/parsing/porter.py
Outdated
@@ -98,27 +167,130 @@ def _m(self): | |||
i += 1 | |||
|
|||
def _vowelinstem(self): | |||
"""True <=> 0,...j contains a vowel""" | |||
"""Check if b[i] (i = 0,...j) contains a vowel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as in _cons
gensim/parsing/porter.py
Outdated
Returns | ||
------- | ||
bool | ||
True, if b contains a vowel, otherwise - False. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as in _cons
gensim/parsing/preprocessing.py
Outdated
s = utils.to_unicode(s) | ||
s = RE_AL_NUM.sub(r"\1 \2", s) | ||
return RE_NUM_AL.sub(r"\1 \2", s) | ||
|
||
|
||
def stem_text(text): | ||
""" | ||
Return lowercase and (porter-)stemmed version of string `text`. | ||
"""Take string, tranform it into lowercase and (porter-)stemmed version. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...and again.
gensim/parsing/preprocessing.py
Outdated
@@ -125,13 +327,63 @@ def stem_text(text): | |||
|
|||
|
|||
def preprocess_string(s, filters=DEFAULT_FILTERS): | |||
"""Take string, apply list of chosen filters to it, where filters are methods from this module. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...and "Take sting" again.
gensim/parsing/preprocessing.py
Outdated
s = utils.to_unicode(s) | ||
for f in filters: | ||
s = f(s) | ||
return s.split() | ||
|
||
|
||
def preprocess_documents(docs): | ||
"""Take list of strings, then apply default filters to every string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Redundant "Take" again. Also, refer to docs
as documents here, not "list of strings". Write something like "Apply default filters to the documents strings."
gensim/parsing/preprocessing.py
Outdated
|
||
Parameters | ||
---------- | ||
docs : list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the description of docs
here though.
gensim/parsing/preprocessing.py
Outdated
|
||
Returns | ||
------- | ||
list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, you could write list of (list of str)
to specify the exact type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry for being a downer, but we want this to be done properly, right?
gensim/parsing/porter.py
Outdated
b : str | ||
Buffer holding a word to be stemmed. The letters are in b[0], b[1] ... ending at b[k]. | ||
k : int | ||
Readjusted downwards as the stemming progresses. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite sure what that means
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Single backticks for k. :)
gensim/parsing/porter.py
Outdated
@@ -98,39 +132,147 @@ def _m(self): | |||
i += 1 | |||
|
|||
def _vowelinstem(self): | |||
"""True <=> 0,...j contains a vowel""" | |||
"""Check if b[i] (i = 0, ... , j) contains a vowel letter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
b[0:j + 1]
seems clearer.
gensim/parsing/porter.py
Outdated
return not all(self._cons(i) for i in xrange(self.j + 1)) | ||
|
||
def _doublec(self, j): | ||
"""True <=> j,(j-1) contain a double consonant.""" | ||
"""Check if b[j], b[j - 1] contain a double consonant letter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
b[j - 1: j + 1]
:)
gensim/parsing/porter.py
Outdated
"""True <=> i-2,i-1,i has the form consonant - vowel - consonant | ||
and also if the second c is not w,x or y. This is used when trying to | ||
restore an e at the end of a short word, e.g. | ||
"""Check if b[i - 2], b[i - 1], b[i] have the form consonant letter - vowel letter- consonant letter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"have the form consonant letter - vowel letter- consonant letter" - > "make the (consonant, vowel, consonant) pattern"
gensim/parsing/porter.py
Outdated
""" | ||
if i < 2 or not self._cons(i) or self._cons(i - 1) or not self._cons(i - 2): | ||
return False | ||
return self.b[i] not in "wxy" | ||
|
||
def _ends(self, s): | ||
"""True <=> 0,...k ends with the string s.""" | ||
"""Check if sequence of letters b[0], ... , b[k] ends with the string `s`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if b[:k + 1] ends with s
gensim/parsing/porter.py
Outdated
Parameters | ||
---------- | ||
s : str | ||
Input string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be omitted, IMHO
gensim/parsing/porter.py
Outdated
Parameters | ||
---------- | ||
s : str | ||
Input string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks captain :D
gensim/parsing/porter.py
Outdated
|
||
def _setto(self, s): | ||
"""Set (j+1),...k to the characters in the string s, adjusting k.""" | ||
"""Set (j + 1), ... , k based on the characters from the string `s`, adjusting k. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it simply appends s
to the b
. The description is kinda cryptic.
@@ -329,8 +473,7 @@ def _step4(self): | |||
self.k = self.j | |||
|
|||
def _step5(self): | |||
"""Remove a final -e if _m() > 1, and change -ll to -l if m() > 1. | |||
""" | |||
"""Remove a final -e if _m() > 1, and change -ll to -l if m() > 1.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what to put here, but the description "Step 5." would have the same effect. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One little fix and we're done here
gensim/parsing/porter.py
Outdated
and also if the second 'c' is not 'w', 'x' or 'y'. This is used when trying to restore an 'e' | ||
at the end of a short word, e.g. cav(e), lov(e), hop(e), crim(e), | ||
but snow, box, tray. | ||
"""Check if b[j - 2: j + 1] make the (consonant, vowel, consonant) pattern and also |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now it's "makes", cause it's an interval... Sorry. :[
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, a couple of minor issues.
gensim/parsing/preprocessing.py
Outdated
Returns | ||
------- | ||
str | ||
Unicode string without not a word characters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with word characters only?
gensim/parsing/preprocessing.py
Outdated
Returns | ||
------- | ||
list of (list of str) | ||
List of lists, filled by unicode strings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Processed documents split by whitespace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's it.
Horaay, first PR about docstring refactoring finished, congratz @CLearERR @anotherbugmaster 🔥 💣 👷♂️ |
…rky#1684) * Added\fixed docstrings for strip_tags in preprocessing.py * Added docstrings for strip_numeric, strip_non_alphanum & strip_multiple_whitespaces * small fixes * Added docstrings for split_alphanum, stem_text, need additional check for preprocess_string & preprocess_documents * Fix for old stringdocs and even more! * Additional changes for preprocessing.py and some refactoring for porter.py * Added references for functions + some common refactoring * Added annotations for porter.py & preprocessing.py * Fixes for annotations * Refactoring for Attributes and Notes fields * Reduced some extra large docstrings * porter.py , function _ends : changed return type from (int) to (bool) * small fix for sections * Cleanup porter.py * Resolve last review * finish with porter, yay! * Fix preprocessing * small changes * Fix review comments
No description provided.