Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Refactor documentation API Reference for gensim.parsing #1684

Merged
merged 19 commits into from
Nov 13, 2017
Merged
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
263 changes: 261 additions & 2 deletions gensim/parsing/preprocessing.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,26 @@


def remove_stopwords(s):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're refactoring, the set of stopwords should be a parameter: remove_stopwords(s, stopwords=STOPWORDS).

IIRC @menshikh-iv had plans to remove this entire package, so not sure if this is relevant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piskvorky as I investigate, this package is actively used, for this reason, this will be moved (and slightly refactored).

"""Takes string, removes all words those are among stopwords.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coding style: docstrings in imperative mode: "Do X", not "Does X".

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you start the docstring on its own line? Not continue after """, it's harder to read.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, ok.


Parameters
----------
s : str

Returns
-------
str
Unicode string without stopwords.

Examples
--------
>>> from gensim.parsing.preprocessing import remove_stopwords
>>> s = "Better late than never, but better never late."
>>> remove_stopwords(s)
u'Better late never, better late.'

"""

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coding style (PEP257): no blank line before or after docstring.

s = utils.to_unicode(s)
return " ".join(w for w in s.split() if w not in STOPWORDS)

Expand All @@ -48,12 +68,36 @@ def remove_stopwords(s):


def strip_punctuation(s):
"""Takes string, replaces all punctuation characters with spaces.

Parameters
----------
s : str

Returns
-------
str
Unicode string without punctuation characters.

Examples
--------
>>> from gensim.parsing.preprocessing import strip_punctuation
>>> s = "A semicolon is a stronger break than a comma, but not as much as a full stop!"
>>> strip_punctuation(s)
u'A semicolon is a stronger break than a comma but not as much as a full stop '

"""

s = utils.to_unicode(s)
return RE_PUNCT.sub(" ", s)


# unicode.translate cannot delete characters like str can
strip_punctuation2 = strip_punctuation
"""
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That won't work, this is not how docstrings work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need docstring here, this will be removed in refactoring.

Same as strip_punctuation
"""

# def strip_punctuation2(s):
# s = utils.to_unicode(s)
# return s.translate(None, string.punctuation)
Expand All @@ -63,11 +107,58 @@ def strip_punctuation(s):


def strip_tags(s):
"""Takes string and removes tags.

Parameters
----------
s : str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add description to argument (for example Input string.), here and anywhere


Returns
-------
str
Unicode string without tags.

Examples
--------
>>> from gensim.parsing.preprocessing import strip_tags
>>> s = "<i>Hello</i> <b>World</b>!"
>>> strip_tags(s)
u'Hello World!'

"""

s = utils.to_unicode(s)
return RE_TAGS.sub("", s)


def strip_short(s, minsize=3):
"""Takes string and removes words with length lesser than minsize (default = 3).

Parameters
----------
s : str
minsize : int, optional

Returns
-------
str
Unicode string without words with length lesser than minsize.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant newline.


Examples
--------
>>> from gensim.parsing.preprocessing import strip_short
>>> s = "salut les amis du 59"
>>> strip_short(s)
u'salut les amis'

>>> from gensim.parsing.preprocessing import strip_short
>>> s = "one two three four five six seven eight nine ten"
>>> strip_short(s,5)
u'three seven eight'

"""

s = utils.to_unicode(s)
return " ".join(e for e in s.split() if len(e) >= minsize)

Expand All @@ -76,6 +167,26 @@ def strip_short(s, minsize=3):


def strip_numeric(s):
"""Takes string and removes digits from it.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coding style: docstring in Python should be in imperative mode: "Do X", not "Does X".


Parameters
----------
s : str

Returns
-------
str
Unicode string without digits.

Examples
--------
>>> from gensim.parsing.preprocessing import strip_numeric
>>> s = "0text24gensim365test"
>>> strip_numeric(s)
u'textgensimtest'

"""

s = utils.to_unicode(s)
return RE_NUMERIC.sub("", s)

Expand All @@ -84,6 +195,27 @@ def strip_numeric(s):


def strip_non_alphanum(s):
"""Takes string and removes not a word characters from it.
(Word characters - alphanumeric & underscore)

Parameters
----------
s : str

Returns
-------
str
Unicode string without not a word characters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with word characters only?


Examples
--------
>>> from gensim.parsing.preprocessing import strip_non_alphanum
>>> s = "if-you#can%read$this&then@this#method^works"
>>> strip_non_alphanum(s)
u'if you can read this then this method works'

"""

s = utils.to_unicode(s)
return RE_NONALPHA.sub(" ", s)

Expand All @@ -92,6 +224,27 @@ def strip_non_alphanum(s):


def strip_multiple_whitespaces(s):
r"""Takes string, removes repeating in a row whitespace characters (spaces, tabs, line breaks) from it
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this docstring r"""?

Copy link
Contributor

@menshikh-iv menshikh-iv Nov 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the special case because we used \n, \r in Examples section, from this Sphinx goes mad. Solution: use raw string for this case.

and turns tabs & line breaks into spaces.

Parameters
----------
s : str

Returns
-------
str
Unicode string without repeating in a row whitespace characters.

Examples
--------
>>> from gensim.parsing.preprocessing import strip_multiple_whitespaces
>>> s = "salut" + '\r' + " les" + '\n' + " loulous!"
>>> strip_multiple_whitespaces(s)
u'salut les loulous!'

"""

s = utils.to_unicode(s)
return RE_WHITESPACE.sub(" ", s)

Expand All @@ -101,22 +254,61 @@ def strip_multiple_whitespaces(s):


def split_alphanum(s):
"""Takes string, adds spaces between digits & letters.

Parameters
----------
s : str

Returns
-------
str
Unicode string with spaces between digits & letters.

Examples
--------
>>> from gensim.parsing.preprocessing import split_alphanum
>>> s = "24.0hours7 days365 a1b2c3"
>>> split_alphanum(s)
u'24.0 hours 7 days 365 a 1 b 2 c 3'

"""

s = utils.to_unicode(s)
s = RE_AL_NUM.sub(r"\1 \2", s)
return RE_NUM_AL.sub(r"\1 \2", s)


def stem_text(text):
"""Takes string, tranforms it into lowercase and (porter-)stemmed version.

Parameters
----------
text : str

Returns
-------
str
Lowercase and (porter-)stemmed version of string `text`.

Examples
--------
>>> from gensim.parsing.preprocessing import stem_text
>>> text = "While it is quite useful to be able to search a large collection of documents almost instantly for a joint occurrence of a collection of exact words, for many searching purposes, a little fuzziness would help. "
>>> stem_text(text)
u'while it is quit us to be abl to search a larg collect of document almost instantli for a joint occurr of a collect of exact words, for mani search purposes, a littl fuzzi would help.'

"""
Return lowercase and (porter-)stemmed version of string `text`.
"""

text = utils.to_unicode(text)
p = PorterStemmer()
return ' '.join(p.stem(word) for word in text.split())


stem = stem_text



DEFAULT_FILTERS = [
lambda x: x.lower(), strip_tags, strip_punctuation,
strip_multiple_whitespaces, strip_numeric,
Expand All @@ -125,17 +317,84 @@ def stem_text(text):


def preprocess_string(s, filters=DEFAULT_FILTERS):
"""Takes string, applies list of chosen filters to it, where filters are methods from this module. Default list of filters consists of: strip_tags, strip_punctuation, strip_multiple_whitespaces, strip_numeric, remove_stopwords, strip_short, stem_text. <function <lambda>> in signature means that we use lambda function for applying methods to filters.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coding style: line way too long.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use references to this, not raw text, i.e.

:func:`~gensim.parsing.preprocessing.strip_tags`

instead of strip_tags (here and anywhere)


Parameters
----------
s : str
filters: list, optional

Returns
-------
list
List of unicode strings.

Examples
--------
>>> from gensim.parsing.preprocessing import preprocess_string
>>> s = "<i>Hel 9lo</i> <b>Wo9 rld</b>! Th3 weather_is really g00d today, isn't it?"
>>> preprocess_string(s)
[u'hel', u'rld', u'weather', u'todai', u'isn']

>>> from gensim.parsing.preprocessing import preprocess_string, strip_tags, strip_punctuation
>>> s = "<i>Hel 9lo</i> <b>Wo9 rld</b>! Th3 weather_is really g00d today, isn't it?"
>>> CUSTOM_FILTERS = [lambda x: x.lower(), strip_tags, strip_punctuation]
>>> preprocess_string(s,CUSTOM_FILTERS)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coding style: space after comma.

[u'hel', u'9lo', u'wo9', u'rld', u'th3', u'weather', u'is', u'really', u'g00d', u'today', u'isn', u't', u'it']

"""

s = utils.to_unicode(s)
for f in filters:
s = f(s)
return s.split()


def preprocess_documents(docs):
"""Takes list of strings, splits it into sentences, then applies default filters to every sentence.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any splitting into sentences, where does that come from?


Parameters
----------
docs : list
Copy link
Contributor

@anotherbugmaster anotherbugmaster Nov 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the description of docs here though.


Returns
-------
list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, you could write list of (list of str) to specify the exact type.

List of lists, filled by unicode strings.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Processed documents split by whitespace


Examples
--------
>>> from gensim.parsing.preprocessing import preprocess_documents
>>> s = ["<i>Hel 9lo</i> <b>Wo9 rld</b>!", "Th3 weather_is really g00d today, isn't it?"]
>>> preprocess_documents(s)
[[u'hel', u'rld'], [u'weather', u'todai', u'isn']]

"""

return [preprocess_string(d) for d in docs]


def read_file(path):
r"""Reads file in specified directory.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entire function should be removed, it's too trivial.


Parameters
----------
path : str

Returns
-------
list
List of unicode strings.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't match the example.


Examples
--------
>>> from gensim.parsing.preprocessing import read_file
>>> path = "/media/work/october_2017/gensim/gensim/test/test_data/mihalcea_tarau.summ.txt"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This path will works only on your filesystem, utils to retrieve path to test files will be ready very soon

>>> read_file(path)
"Hurricane Gilbert swept toward the Dominican Republic Sunday, and the Civil Defense alerted its heavily populated south coast to prepare for high winds, heavy rains and high seas.\nThe National Hurricane Center in Miami reported its position at 2 a.m. Sunday at latitude 16.1 north, longitude 67.5 west, about 140 miles south of Ponce, Puerto Rico, and 200 miles southeast of Santo Domingo.\nThe National Weather Service in San Juan, Puerto Rico, said Gilbert was moving westward at 15 mph with a ``broad area of cloudiness and heavy weather'' rotating around the center of the storm.\nStrong winds associated with the Gilbert brought coastal flooding, strong southeast winds and up to 12 feet feet to Puerto Rico's south coast."

"""

with utils.smart_open(path) as fin:
return fin.read()

Expand Down