[WIP] Refactor documentation API Reference for gensim.parsing #1684

CLearERR · 2017-11-01T20:44:31Z

No description provided.

…le_whitespaces

menshikh-iv · 2017-11-02T04:53:29Z

gensim/parsing/preprocessing.py

@@ -63,6 +63,26 @@ def strip_punctuation(s):


 def strip_tags(s):
+


No need the empty line on this position (here and everywhere)

menshikh-iv · 2017-11-02T04:54:19Z

gensim/parsing/preprocessing.py

+
+    Examples
+    --------
+    >>>from gensim.parsing.preprocessing import strip_tags


>>>from -> >>> from for all examples (here and everywhere)

… for preprocess_string & preprocess_documents

menshikh-iv · 2017-11-03T04:49:00Z

gensim/parsing/preprocessing.py

@@ -125,13 +244,53 @@ def stem_text(text):


 def preprocess_string(s, filters=DEFAULT_FILTERS):
+    """Takes string, applies chosen filters to it.


What is filters? Can you add more "complicated" example?

Also, describe DEFAULT_FILTERS.

piskvorky · 2017-11-04T23:44:04Z

gensim/parsing/preprocessing.py

@@ -96,6 +96,26 @@ def strip_short(s, minsize=3):


 def strip_numeric(s):
+
+    """Takes string and removes digits from it.


Coding style: docstring in Python should be in imperative mode: "Do X", not "Does X".

piskvorky · 2017-11-04T23:44:26Z

gensim/parsing/preprocessing.py

@@ -96,6 +96,26 @@ def strip_short(s, minsize=3):


 def strip_numeric(s):
+


Coding style: no empty line before docstring.

piskvorky · 2017-11-04T23:47:21Z

gensim/parsing/preprocessing.py

@@ -40,6 +40,26 @@


 def remove_stopwords(s):


Since we're refactoring, the set of stopwords should be a parameter: remove_stopwords(s, stopwords=STOPWORDS).

IIRC @menshikh-iv had plans to remove this entire package, so not sure if this is relevant.

@piskvorky as I investigate, this package is actively used, for this reason, this will be moved (and slightly refactored).

piskvorky · 2017-11-04T23:48:06Z

gensim/parsing/preprocessing.py

    s = utils.to_unicode(s)
    return RE_PUNCT.sub(" ", s)


 # unicode.translate cannot delete characters like str can
 strip_punctuation2 = strip_punctuation
+"""


That won't work, this is not how docstrings work.

No need docstring here, this will be removed in refactoring.

piskvorky · 2017-11-04T23:49:16Z

gensim/parsing/preprocessing.py

@@ -40,6 +40,26 @@


 def remove_stopwords(s):
+    """Takes string, removes all words those are among stopwords.


Coding style: docstrings in imperative mode: "Do X", not "Does X".

piskvorky · 2017-11-04T23:50:17Z

gensim/parsing/preprocessing.py

@@ -40,6 +40,26 @@


 def remove_stopwords(s):
+    """Takes string, removes all words those are among stopwords.


Can you start the docstring on its own line? Not continue after """, it's harder to read.

It's numpy-style convention.

piskvorky · 2017-11-04T23:50:52Z

gensim/parsing/preprocessing.py

@@ -92,6 +224,27 @@ def strip_non_alphanum(s):


 def strip_multiple_whitespaces(s):
+    r"""Takes string, removes repeating in a row whitespace characters (spaces, tabs, line breaks) from it


Why is this docstring r"""?

This is the special case because we used \n, \r in Examples section, from this Sphinx goes mad. Solution: use raw string for this case.

piskvorky · 2017-11-04T23:52:09Z

gensim/parsing/preprocessing.py

+    >>> from gensim.parsing.preprocessing import preprocess_string, strip_tags, strip_punctuation
+    >>> s = "<i>Hel 9lo</i> <b>Wo9 rld</b>! Th3     weather_is really g00d today, isn't it?"
+    >>> CUSTOM_FILTERS = [lambda x: x.lower(), strip_tags, strip_punctuation]
+    >>> preprocess_string(s,CUSTOM_FILTERS)


Coding style: space after comma.

piskvorky · 2017-11-04T23:52:29Z

gensim/parsing/preprocessing.py

@@ -125,17 +317,84 @@ def stem_text(text):


 def preprocess_string(s, filters=DEFAULT_FILTERS):
+    """Takes string, applies list of chosen filters to it, where filters are methods from this module. Default list of filters consists of: strip_tags, strip_punctuation, strip_multiple_whitespaces, strip_numeric, remove_stopwords, strip_short, stem_text. <function <lambda>> in signature means that we use lambda function for applying methods to filters.


Coding style: line way too long.

piskvorky · 2017-11-04T23:53:05Z

gensim/parsing/preprocessing.py

    s = utils.to_unicode(s)
    for f in filters:
        s = f(s)
    return s.split()


 def preprocess_documents(docs):
+    """Takes list of strings, splits it into sentences, then applies default filters to every sentence.


I don't see any splitting into sentences, where does that come from?

piskvorky · 2017-11-04T23:54:36Z

gensim/parsing/preprocessing.py

+    u'Better late never, better late.'
+
+    """
+


Coding style (PEP257): no blank line before or after docstring.

piskvorky · 2017-11-04T23:55:49Z

gensim/parsing/preprocessing.py

    return [preprocess_string(d) for d in docs]


 def read_file(path):
+    r"""Reads file in specified directory.


This entire function should be removed, it's too trivial.

piskvorky · 2017-11-04T23:55:58Z

gensim/parsing/preprocessing.py

+    Returns
+    -------
+    list
+        List of unicode strings.


Doesn't match the example.

menshikh-iv · 2017-11-06T15:06:20Z

gensim/parsing/preprocessing.py

+
+    Parameters
+    ----------
+    s : str


Please add description to argument (for example Input string.), here and anywhere

menshikh-iv · 2017-11-06T15:09:43Z

gensim/parsing/preprocessing.py

@@ -125,17 +317,84 @@ def stem_text(text):


 def preprocess_string(s, filters=DEFAULT_FILTERS):
+    """Takes string, applies list of chosen filters to it, where filters are methods from this module. Default list of filters consists of: strip_tags, strip_punctuation, strip_multiple_whitespaces, strip_numeric, remove_stopwords, strip_short, stem_text. <function <lambda>> in signature means that we use lambda function for applying methods to filters.


Use references to this, not raw text, i.e.

:func:`~gensim.parsing.preprocessing.strip_tags`

instead of strip_tags (here and anywhere)

menshikh-iv · 2017-11-06T15:11:15Z

gensim/parsing/preprocessing.py

+    Examples
+    --------
+    >>> from gensim.parsing.preprocessing import read_file
+    >>> path = "/media/work/october_2017/gensim/gensim/test/test_data/mihalcea_tarau.summ.txt"


This path will works only on your filesystem, utils to retrieve path to test files will be ready very soon

bsivavenu · 2017-11-06T15:15:53Z

@menshikh-iv
Hello i'm reading this doc https://radimrehurek.com/gensim/tut2.html which is not helpful. could you direct me for alternative basic LDA tutorial please

menshikh-iv · 2017-11-06T15:26:26Z

@bsivavenu lda tutorial.
FYI - for similar questions please use maillist

…er.py

menshikh-iv · 2017-11-07T07:13:11Z

gensim/parsing/porter.py

        return not all(self._cons(i) for i in xrange(self.j + 1))

    def _doublec(self, j):
-        """True <=> j,(j-1) contain a double consonant."""
+        """Check if b[j],b[j-1] contain a double consonant.


spaces after , (here and anywhere).

menshikh-iv · 2017-11-07T07:15:39Z

gensim/parsing/preprocessing.py

@@ -38,77 +38,250 @@
 """
 STOPWORDS = frozenset(w for w in STOPWORDS.split() if w)

+RE_PUNCT = re.compile(r'([%s])+' % re.escape(string.punctuation), re.UNICODE)


No needed empty lines between regexps + please add comment for each regexp # remove punctuation according to string.punctuation

menshikh-iv · 2017-11-07T07:17:23Z

gensim/parsing/porter.py

+        >>> p = PorterStemmer()
+        >>> print "b = ", p.b," ,k = ", p.k, " ,j = ", p.j
+        b =    ,k =  0  ,j =  0
+


Add description for the class field (what's b, j, k)

menshikh-iv · 2017-11-08T05:10:45Z

gensim/parsing/preprocessing.py

@@ -37,78 +37,252 @@
 your yours yourself yourselves
 """
 STOPWORDS = frozenset(w for w in STOPWORDS.split() if w)
+# set of stopwords for :func:`~gensim.parsing.preprocessing.remove_stopwords`.


Before better than after + return endlines (now it's needed for readability).

menshikh-iv · 2017-11-09T06:31:20Z

LGTM! @CLearERR 👍
@anotherbugmaster please review and I'll merge the first doc PR 🔥

anotherbugmaster

Good overall, but needs to be fixed a little bit.

anotherbugmaster · 2017-11-09T09:26:15Z

gensim/parsing/porter.py

        self.b = ""  # buffer for word to be stemmed
        self.k = 0
        self.j = 0   # j is a general offset into the string

    def _cons(self, i):
-        """True <=> b[i] is a consonant."""
+        """Take b[i], check if it is a consonant.


It's not obvious what b[i] is. You should probably specify attribute b here and in a class body.

In class description we already have:
b : str : is a buffer holding a word to be stemmed. The letters are in b[0], b[1] ... ending at b[k].

Probably it will be enough to add "letter" after "... it is a consonant".

anotherbugmaster · 2017-11-09T09:29:09Z

gensim/parsing/porter.py

+        Returns
+        -------
+        bool
+            True, if b[i] is a consonant, otherwise - False.


Too wordily, IMHO, it's obvious what this function returns, the description can be omitted (or the "otherwise - False" part at least)

anotherbugmaster · 2017-11-09T09:32:01Z

gensim/parsing/porter.py

+
+The main part of the stemming algorithm (https://en.wikipedia.org/wiki/Stemming)
+starts in :func:`~gensim.parsing.porter.PorterStemmer`.
+b is a buffer holding a word to be stemmed. The letters are in b[0],


This information should be in Attributes and Notes sections of the class/module.

anotherbugmaster · 2017-11-09T09:37:53Z

gensim/parsing/porter.py

@@ -98,27 +167,130 @@ def _m(self):
            i += 1

    def _vowelinstem(self):
-        """True <=> 0,...j contains a vowel"""
+        """Check if b[i] (i = 0,...j) contains a vowel.


Same as in _cons

anotherbugmaster · 2017-11-09T09:38:14Z

gensim/parsing/porter.py

+        Returns
+        -------
+        bool
+            True, if b contains a vowel, otherwise - False.


Same as in _cons

anotherbugmaster · 2017-11-09T09:58:32Z

gensim/parsing/preprocessing.py

    s = utils.to_unicode(s)
    s = RE_AL_NUM.sub(r"\1 \2", s)
    return RE_NUM_AL.sub(r"\1 \2", s)


 def stem_text(text):
-    """
-    Return lowercase and (porter-)stemmed version of string `text`.
+    """Take string, tranform it into lowercase and (porter-)stemmed version.


...and again.

anotherbugmaster · 2017-11-09T09:59:30Z

gensim/parsing/preprocessing.py

@@ -125,13 +327,63 @@ def stem_text(text):


 def preprocess_string(s, filters=DEFAULT_FILTERS):
+    """Take string, apply list of chosen filters to it, where filters are methods from this module.


...and "Take sting" again.

anotherbugmaster · 2017-11-09T10:03:55Z

gensim/parsing/preprocessing.py

    s = utils.to_unicode(s)
    for f in filters:
        s = f(s)
    return s.split()


 def preprocess_documents(docs):
+    """Take list of strings, then apply default filters to every string.


Redundant "Take" again. Also, refer to docs as documents here, not "list of strings". Write something like "Apply default filters to the documents strings."

anotherbugmaster · 2017-11-09T10:04:15Z

gensim/parsing/preprocessing.py

+
+    Parameters
+    ----------
+    docs : list


Add the description of docs here though.

anotherbugmaster · 2017-11-09T10:06:21Z

gensim/parsing/preprocessing.py

+
+    Returns
+    -------
+    list


Also, you could write list of (list of str) to specify the exact type.

anotherbugmaster

I'm sorry for being a downer, but we want this to be done properly, right?

anotherbugmaster · 2017-11-13T16:57:57Z

gensim/parsing/porter.py

+    b : str
+        Buffer holding a word to be stemmed. The letters are in b[0], b[1] ... ending at b[k].
+    k : int
+        Readjusted downwards as the stemming progresses.


Not quite sure what that means

Single backticks for k. :)

anotherbugmaster · 2017-11-13T17:05:35Z

gensim/parsing/porter.py

@@ -98,39 +132,147 @@ def _m(self):
            i += 1

    def _vowelinstem(self):
-        """True <=> 0,...j contains a vowel"""
+        """Check if b[i] (i = 0, ... , j) contains a vowel letter.


b[0:j + 1] seems clearer.

anotherbugmaster · 2017-11-13T17:07:38Z

gensim/parsing/porter.py

        return not all(self._cons(i) for i in xrange(self.j + 1))

    def _doublec(self, j):
-        """True <=> j,(j-1) contain a double consonant."""
+        """Check if b[j], b[j - 1] contain a double consonant letter.


b[j - 1: j + 1] :)

anotherbugmaster · 2017-11-13T17:09:43Z

gensim/parsing/porter.py

-        """True <=> i-2,i-1,i has the form consonant - vowel - consonant
-        and also if the second c is not w,x or y. This is used when trying to
-        restore an e at the end of a short word, e.g.
+        """Check if b[i - 2], b[i - 1], b[i] have the form consonant letter - vowel letter- consonant letter


"have the form consonant letter - vowel letter- consonant letter" - > "make the (consonant, vowel, consonant) pattern"

anotherbugmaster · 2017-11-13T17:13:27Z

gensim/parsing/porter.py

        """
        if i < 2 or not self._cons(i) or self._cons(i - 1) or not self._cons(i - 2):
            return False
        return self.b[i] not in "wxy"

    def _ends(self, s):
-        """True <=> 0,...k ends with the string s."""
+        """Check if sequence of letters b[0], ... , b[k] ends with the string `s`.


if b[:k + 1] ends with s

anotherbugmaster · 2017-11-13T17:14:58Z

gensim/parsing/porter.py

+        Parameters
+        ----------
+        s : str
+            Input string.


Could be omitted, IMHO

anotherbugmaster · 2017-11-13T17:16:12Z

gensim/parsing/porter.py

+        Parameters
+        ----------
+        s : str
+            Input string.


Thanks captain :D

anotherbugmaster · 2017-11-13T17:20:37Z

gensim/parsing/porter.py


    def _setto(self, s):
-        """Set (j+1),...k to the characters in the string s, adjusting k."""
+        """Set (j + 1), ... , k based on the characters from the string `s`, adjusting k.


Also, it simply appends s to the b. The description is kinda cryptic.

anotherbugmaster · 2017-11-13T17:22:53Z

gensim/parsing/porter.py

@@ -329,8 +473,7 @@ def _step4(self):
            self.k = self.j

    def _step5(self):
-        """Remove a final -e if _m() > 1, and change -ll to -l if m() > 1.
-        """
+        """Remove a final -e if _m() > 1, and change -ll to -l if m() > 1."""


Not sure what to put here, but the description "Step 5." would have the same effect. :)

anotherbugmaster

One little fix and we're done here

anotherbugmaster · 2017-11-13T17:50:52Z

gensim/parsing/porter.py

-        and also if the second 'c' is not 'w', 'x' or 'y'. This is used when trying to restore an 'e'
-        at the end of a short word, e.g. cav(e), lov(e), hop(e), crim(e),
-        but snow, box, tray.
+        """Check if b[j - 2: j + 1] make the (consonant, vowel, consonant) pattern and also


Now it's "makes", cause it's an interval... Sorry. :[

anotherbugmaster

Nice, a couple of minor issues.

anotherbugmaster · 2017-11-13T19:00:48Z

gensim/parsing/preprocessing.py

+    Returns
+    -------
+    str
+        Unicode string without not a word characters.


with word characters only?

anotherbugmaster · 2017-11-13T19:04:42Z

gensim/parsing/preprocessing.py

+    Returns
+    -------
+    list of (list of str)
+        List of lists, filled by unicode strings.


Processed documents split by whitespace

anotherbugmaster

That's it.

menshikh-iv · 2017-11-13T19:13:04Z

Horaay, first PR about docstring refactoring finished, congratz @CLearERR @anotherbugmaster 🔥 💣 👷‍♂️

…rky#1684) * Added\fixed docstrings for strip_tags in preprocessing.py * Added docstrings for strip_numeric, strip_non_alphanum & strip_multiple_whitespaces * small fixes * Added docstrings for split_alphanum, stem_text, need additional check for preprocess_string & preprocess_documents * Fix for old stringdocs and even more! * Additional changes for preprocessing.py and some refactoring for porter.py * Added references for functions + some common refactoring * Added annotations for porter.py & preprocessing.py * Fixes for annotations * Refactoring for Attributes and Notes fields * Reduced some extra large docstrings * porter.py , function _ends : changed return type from (int) to (bool) * small fix for sections * Cleanup porter.py * Resolve last review * finish with porter, yay! * Fix preprocessing * small changes * Fix review comments

CLearERR added 2 commits November 2, 2017 01:40

Added\fixed docstrings for strip_tags in preprocessing.py

c2d73f4

Added docstrings for strip_numeric, strip_non_alphanum & strip_multip…

9e956a1

…le_whitespaces

menshikh-iv added the incubator project PR is RaRe incubator project label Nov 2, 2017

menshikh-iv suggested changes Nov 2, 2017

View reviewed changes

menshikh-iv and others added 2 commits November 2, 2017 12:07

small fixes

aa24e35

Added docstrings for split_alphanum, stem_text, need additional check…

f07f3d5

… for preprocess_string & preprocess_documents

menshikh-iv suggested changes Nov 3, 2017

View reviewed changes

Fix for old stringdocs and even more!

c45e2e9

piskvorky requested changes Nov 4, 2017

View reviewed changes

menshikh-iv suggested changes Nov 6, 2017

View reviewed changes

Additional changes for preprocessing.py and some refactoring for port…

67e82ab

…er.py

menshikh-iv suggested changes Nov 7, 2017

View reviewed changes

Added references for functions + some common refactoring

fbfe216

menshikh-iv suggested changes Nov 8, 2017

View reviewed changes

CLearERR added 2 commits November 9, 2017 00:22

Added annotations for porter.py & preprocessing.py

f789d6b

Fixes for annotations

609b02b

menshikh-iv requested a review from anotherbugmaster November 9, 2017 06:31

anotherbugmaster suggested changes Nov 9, 2017

View reviewed changes

CLearERR added 3 commits November 10, 2017 01:52

Refactoring for Attributes and Notes fields

96ab01f

Reduced some extra large docstrings

4ad3970

porter.py , function _ends : changed return type from (int) to (bool)

46eb64a

menshikh-iv added 2 commits November 13, 2017 18:22

small fix for sections

777c9e3

Cleanup porter.py

4c01c73

anotherbugmaster suggested changes Nov 13, 2017

View reviewed changes

Resolve last review

8bb3000

anotherbugmaster suggested changes Nov 13, 2017

View reviewed changes

menshikh-iv added 2 commits November 13, 2017 22:57

finish with porter, yay!

83d6efd

Fix preprocessing

41f42b3

anotherbugmaster suggested changes Nov 13, 2017

View reviewed changes

menshikh-iv added 2 commits November 14, 2017 00:06

small changes

34edac9

Fix review comments

b579026

anotherbugmaster approved these changes Nov 13, 2017

View reviewed changes

menshikh-iv approved these changes Nov 13, 2017

View reviewed changes

menshikh-iv merged commit 0a88a08 into piskvorky:develop Nov 13, 2017

piskvorky mentioned this pull request Apr 30, 2018

Documentation fixes #2037

Open

		@@ -63,6 +63,26 @@ def strip_punctuation(s):


		def strip_tags(s):

		@@ -125,13 +244,53 @@ def stem_text(text):


		def preprocess_string(s, filters=DEFAULT_FILTERS):
		"""Takes string, applies chosen filters to it.

		@@ -96,6 +96,26 @@ def strip_short(s, minsize=3):


		def strip_numeric(s):

		"""Takes string and removes digits from it.

		@@ -40,6 +40,26 @@


		def remove_stopwords(s):
		"""Takes string, removes all words those are among stopwords.

		@@ -92,6 +224,27 @@ def strip_non_alphanum(s):


		def strip_multiple_whitespaces(s):
		r"""Takes string, removes repeating in a row whitespace characters (spaces, tabs, line breaks) from it

		@@ -125,17 +317,84 @@ def stem_text(text):


		def preprocess_string(s, filters=DEFAULT_FILTERS):
		"""Takes string, applies list of chosen filters to it, where filters are methods from this module. Default list of filters consists of: strip_tags, strip_punctuation, strip_multiple_whitespaces, strip_numeric, remove_stopwords, strip_short, stem_text. <function <lambda>> in signature means that we use lambda function for applying methods to filters.

		@@ -125,13 +327,63 @@ def stem_text(text):


		def preprocess_string(s, filters=DEFAULT_FILTERS):
		"""Take string, apply list of chosen filters to it, where filters are methods from this module.

[WIP] Refactor documentation API Reference for gensim.parsing #1684

[WIP] Refactor documentation API Reference for gensim.parsing #1684

Conversation

CLearERR commented Nov 1, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

menshikh-iv Nov 5, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bsivavenu commented Nov 6, 2017 • edited Loading

menshikh-iv commented Nov 6, 2017

Choose a reason for hiding this comment

menshikh-iv Nov 7, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

menshikh-iv commented Nov 9, 2017

anotherbugmaster left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anotherbugmaster Nov 9, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anotherbugmaster left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anotherbugmaster Nov 13, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anotherbugmaster Nov 13, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anotherbugmaster Nov 13, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anotherbugmaster left a comment

Choose a reason for hiding this comment

anotherbugmaster Nov 13, 2017 • edited Loading

Choose a reason for hiding this comment

anotherbugmaster left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anotherbugmaster left a comment

Choose a reason for hiding this comment

menshikh-iv commented Nov 13, 2017

menshikh-iv Nov 5, 2017 •

edited

Loading

bsivavenu commented Nov 6, 2017 •

edited

Loading

menshikh-iv Nov 7, 2017 •

edited

Loading

anotherbugmaster Nov 9, 2017 •

edited

Loading

anotherbugmaster Nov 13, 2017 •

edited

Loading

anotherbugmaster Nov 13, 2017 •

edited

Loading

anotherbugmaster Nov 13, 2017 •

edited

Loading

anotherbugmaster Nov 13, 2017 •

edited

Loading