Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Refactor documentation API Reference for gensim.parsing #1684

Merged
merged 19 commits into from
Nov 13, 2017

Conversation

CLearERR
Copy link
Contributor

@CLearERR CLearERR commented Nov 1, 2017

No description provided.

@menshikh-iv menshikh-iv added the incubator project PR is RaRe incubator project label Nov 2, 2017
@@ -63,6 +63,26 @@ def strip_punctuation(s):


def strip_tags(s):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need the empty line on this position (here and everywhere)


Examples
--------
>>>from gensim.parsing.preprocessing import strip_tags
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

>>>from -> >>> from for all examples (here and everywhere)

@@ -125,13 +244,53 @@ def stem_text(text):


def preprocess_string(s, filters=DEFAULT_FILTERS):
"""Takes string, applies chosen filters to it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is filters? Can you add more "complicated" example?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, describe DEFAULT_FILTERS.

@@ -96,6 +96,26 @@ def strip_short(s, minsize=3):


def strip_numeric(s):

"""Takes string and removes digits from it.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coding style: docstring in Python should be in imperative mode: "Do X", not "Does X".

@@ -96,6 +96,26 @@ def strip_short(s, minsize=3):


def strip_numeric(s):

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coding style: no empty line before docstring.

@@ -40,6 +40,26 @@


def remove_stopwords(s):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're refactoring, the set of stopwords should be a parameter: remove_stopwords(s, stopwords=STOPWORDS).

IIRC @menshikh-iv had plans to remove this entire package, so not sure if this is relevant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piskvorky as I investigate, this package is actively used, for this reason, this will be moved (and slightly refactored).

s = utils.to_unicode(s)
return RE_PUNCT.sub(" ", s)


# unicode.translate cannot delete characters like str can
strip_punctuation2 = strip_punctuation
"""
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That won't work, this is not how docstrings work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need docstring here, this will be removed in refactoring.

@@ -40,6 +40,26 @@


def remove_stopwords(s):
"""Takes string, removes all words those are among stopwords.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coding style: docstrings in imperative mode: "Do X", not "Does X".

@@ -40,6 +40,26 @@


def remove_stopwords(s):
"""Takes string, removes all words those are among stopwords.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you start the docstring on its own line? Not continue after """, it's harder to read.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, ok.

@@ -92,6 +224,27 @@ def strip_non_alphanum(s):


def strip_multiple_whitespaces(s):
r"""Takes string, removes repeating in a row whitespace characters (spaces, tabs, line breaks) from it
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this docstring r"""?

Copy link
Contributor

@menshikh-iv menshikh-iv Nov 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the special case because we used \n, \r in Examples section, from this Sphinx goes mad. Solution: use raw string for this case.

>>> from gensim.parsing.preprocessing import preprocess_string, strip_tags, strip_punctuation
>>> s = "<i>Hel 9lo</i> <b>Wo9 rld</b>! Th3 weather_is really g00d today, isn't it?"
>>> CUSTOM_FILTERS = [lambda x: x.lower(), strip_tags, strip_punctuation]
>>> preprocess_string(s,CUSTOM_FILTERS)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coding style: space after comma.

@@ -125,17 +317,84 @@ def stem_text(text):


def preprocess_string(s, filters=DEFAULT_FILTERS):
"""Takes string, applies list of chosen filters to it, where filters are methods from this module. Default list of filters consists of: strip_tags, strip_punctuation, strip_multiple_whitespaces, strip_numeric, remove_stopwords, strip_short, stem_text. <function <lambda>> in signature means that we use lambda function for applying methods to filters.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coding style: line way too long.

s = utils.to_unicode(s)
for f in filters:
s = f(s)
return s.split()


def preprocess_documents(docs):
"""Takes list of strings, splits it into sentences, then applies default filters to every sentence.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any splitting into sentences, where does that come from?

u'Better late never, better late.'

"""

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coding style (PEP257): no blank line before or after docstring.

return [preprocess_string(d) for d in docs]


def read_file(path):
r"""Reads file in specified directory.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entire function should be removed, it's too trivial.

Returns
-------
list
List of unicode strings.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't match the example.


Parameters
----------
s : str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add description to argument (for example Input string.), here and anywhere

@@ -125,17 +317,84 @@ def stem_text(text):


def preprocess_string(s, filters=DEFAULT_FILTERS):
"""Takes string, applies list of chosen filters to it, where filters are methods from this module. Default list of filters consists of: strip_tags, strip_punctuation, strip_multiple_whitespaces, strip_numeric, remove_stopwords, strip_short, stem_text. <function <lambda>> in signature means that we use lambda function for applying methods to filters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use references to this, not raw text, i.e.

:func:`~gensim.parsing.preprocessing.strip_tags`

instead of strip_tags (here and anywhere)

Examples
--------
>>> from gensim.parsing.preprocessing import read_file
>>> path = "/media/work/october_2017/gensim/gensim/test/test_data/mihalcea_tarau.summ.txt"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This path will works only on your filesystem, utils to retrieve path to test files will be ready very soon

@bsivavenu
Copy link

bsivavenu commented Nov 6, 2017

@menshikh-iv
Hello i'm reading this doc https://radimrehurek.com/gensim/tut2.html which is not helpful. could you direct me for alternative basic LDA tutorial please

@menshikh-iv
Copy link
Contributor

@bsivavenu lda tutorial.
FYI - for similar questions please use maillist

return not all(self._cons(i) for i in xrange(self.j + 1))

def _doublec(self, j):
"""True <=> j,(j-1) contain a double consonant."""
"""Check if b[j],b[j-1] contain a double consonant.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spaces after , (here and anywhere).

@@ -38,77 +38,250 @@
"""
STOPWORDS = frozenset(w for w in STOPWORDS.split() if w)

RE_PUNCT = re.compile(r'([%s])+' % re.escape(string.punctuation), re.UNICODE)
Copy link
Contributor

@menshikh-iv menshikh-iv Nov 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No needed empty lines between regexps + please add comment for each regexp # remove punctuation according to string.punctuation

>>> p = PorterStemmer()
>>> print "b = ", p.b," ,k = ", p.k, " ,j = ", p.j
b = ,k = 0 ,j = 0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add description for the class field (what's b, j, k)

@@ -37,78 +37,252 @@
your yours yourself yourselves
"""
STOPWORDS = frozenset(w for w in STOPWORDS.split() if w)
# set of stopwords for :func:`~gensim.parsing.preprocessing.remove_stopwords`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before better than after + return endlines (now it's needed for readability).

@menshikh-iv
Copy link
Contributor

LGTM! @CLearERR 👍
@anotherbugmaster please review and I'll merge the first doc PR 🔥

Copy link
Contributor

@anotherbugmaster anotherbugmaster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good overall, but needs to be fixed a little bit.

self.b = "" # buffer for word to be stemmed
self.k = 0
self.j = 0 # j is a general offset into the string

def _cons(self, i):
"""True <=> b[i] is a consonant."""
"""Take b[i], check if it is a consonant.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not obvious what b[i] is. You should probably specify attribute b here and in a class body.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In class description we already have:
b : str : is a buffer holding a word to be stemmed. The letters are in b[0], b[1] ... ending at b[k].

Probably it will be enough to add "letter" after "... it is a consonant".

Returns
-------
bool
True, if b[i] is a consonant, otherwise - False.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too wordily, IMHO, it's obvious what this function returns, the description can be omitted (or the "otherwise - False" part at least)


The main part of the stemming algorithm (https://en.wikipedia.org/wiki/Stemming)
starts in :func:`~gensim.parsing.porter.PorterStemmer`.
b is a buffer holding a word to be stemmed. The letters are in b[0],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This information should be in Attributes and Notes sections of the class/module.

@@ -98,27 +167,130 @@ def _m(self):
i += 1

def _vowelinstem(self):
"""True <=> 0,...j contains a vowel"""
"""Check if b[i] (i = 0,...j) contains a vowel.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as in _cons

Returns
-------
bool
True, if b contains a vowel, otherwise - False.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as in _cons

s = utils.to_unicode(s)
s = RE_AL_NUM.sub(r"\1 \2", s)
return RE_NUM_AL.sub(r"\1 \2", s)


def stem_text(text):
"""
Return lowercase and (porter-)stemmed version of string `text`.
"""Take string, tranform it into lowercase and (porter-)stemmed version.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...and again.

@@ -125,13 +327,63 @@ def stem_text(text):


def preprocess_string(s, filters=DEFAULT_FILTERS):
"""Take string, apply list of chosen filters to it, where filters are methods from this module.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...and "Take sting" again.

s = utils.to_unicode(s)
for f in filters:
s = f(s)
return s.split()


def preprocess_documents(docs):
"""Take list of strings, then apply default filters to every string.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant "Take" again. Also, refer to docs as documents here, not "list of strings". Write something like "Apply default filters to the documents strings."


Parameters
----------
docs : list
Copy link
Contributor

@anotherbugmaster anotherbugmaster Nov 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the description of docs here though.


Returns
-------
list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, you could write list of (list of str) to specify the exact type.

Copy link
Contributor

@anotherbugmaster anotherbugmaster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry for being a downer, but we want this to be done properly, right?

b : str
Buffer holding a word to be stemmed. The letters are in b[0], b[1] ... ending at b[k].
k : int
Readjusted downwards as the stemming progresses.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite sure what that means

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Single backticks for k. :)

@@ -98,39 +132,147 @@ def _m(self):
i += 1

def _vowelinstem(self):
"""True <=> 0,...j contains a vowel"""
"""Check if b[i] (i = 0, ... , j) contains a vowel letter.
Copy link
Contributor

@anotherbugmaster anotherbugmaster Nov 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

b[0:j + 1] seems clearer.

return not all(self._cons(i) for i in xrange(self.j + 1))

def _doublec(self, j):
"""True <=> j,(j-1) contain a double consonant."""
"""Check if b[j], b[j - 1] contain a double consonant letter.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

b[j - 1: j + 1] :)

"""True <=> i-2,i-1,i has the form consonant - vowel - consonant
and also if the second c is not w,x or y. This is used when trying to
restore an e at the end of a short word, e.g.
"""Check if b[i - 2], b[i - 1], b[i] have the form consonant letter - vowel letter- consonant letter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"have the form consonant letter - vowel letter- consonant letter" - > "make the (consonant, vowel, consonant) pattern"

"""
if i < 2 or not self._cons(i) or self._cons(i - 1) or not self._cons(i - 2):
return False
return self.b[i] not in "wxy"

def _ends(self, s):
"""True <=> 0,...k ends with the string s."""
"""Check if sequence of letters b[0], ... , b[k] ends with the string `s`.
Copy link
Contributor

@anotherbugmaster anotherbugmaster Nov 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if b[:k + 1] ends with s

Parameters
----------
s : str
Input string.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be omitted, IMHO

Parameters
----------
s : str
Input string.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks captain :D


def _setto(self, s):
"""Set (j+1),...k to the characters in the string s, adjusting k."""
"""Set (j + 1), ... , k based on the characters from the string `s`, adjusting k.
Copy link
Contributor

@anotherbugmaster anotherbugmaster Nov 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it simply appends s to the b. The description is kinda cryptic.

@@ -329,8 +473,7 @@ def _step4(self):
self.k = self.j

def _step5(self):
"""Remove a final -e if _m() > 1, and change -ll to -l if m() > 1.
"""
"""Remove a final -e if _m() > 1, and change -ll to -l if m() > 1."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what to put here, but the description "Step 5." would have the same effect. :)

Copy link
Contributor

@anotherbugmaster anotherbugmaster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One little fix and we're done here

and also if the second 'c' is not 'w', 'x' or 'y'. This is used when trying to restore an 'e'
at the end of a short word, e.g. cav(e), lov(e), hop(e), crim(e),
but snow, box, tray.
"""Check if b[j - 2: j + 1] make the (consonant, vowel, consonant) pattern and also
Copy link
Contributor

@anotherbugmaster anotherbugmaster Nov 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it's "makes", cause it's an interval... Sorry. :[

Copy link
Contributor

@anotherbugmaster anotherbugmaster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, a couple of minor issues.

Returns
-------
str
Unicode string without not a word characters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with word characters only?

Returns
-------
list of (list of str)
List of lists, filled by unicode strings.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Processed documents split by whitespace

Copy link
Contributor

@anotherbugmaster anotherbugmaster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's it.

@menshikh-iv
Copy link
Contributor

Horaay, first PR about docstring refactoring finished, congratz @CLearERR @anotherbugmaster 🔥 💣 👷‍♂️

@menshikh-iv menshikh-iv merged commit 0a88a08 into piskvorky:develop Nov 13, 2017
VaiyeBe pushed a commit to VaiyeBe/gensim that referenced this pull request Nov 26, 2017
…rky#1684)

* Added\fixed docstrings for strip_tags in preprocessing.py

* Added docstrings for strip_numeric, strip_non_alphanum & strip_multiple_whitespaces

* small fixes

* Added docstrings for split_alphanum, stem_text, need additional check for preprocess_string & preprocess_documents

* Fix for old stringdocs and even more!

* Additional changes for preprocessing.py and some refactoring for porter.py

* Added references for functions + some common refactoring

* Added annotations for porter.py & preprocessing.py

* Fixes for annotations

* Refactoring for Attributes and Notes fields

* Reduced some extra large docstrings

* porter.py , function _ends : changed return type from (int) to (bool)

* small fix for sections

* Cleanup porter.py

* Resolve last review

* finish with porter, yay!

* Fix preprocessing

* small changes

* Fix review comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
incubator project PR is RaRe incubator project
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants