-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Refactor documentation API Reference for gensim.parsing #1684
Changes from 5 commits
c2d73f4
9e956a1
aa24e35
f07f3d5
c45e2e9
67e82ab
fbfe216
f789d6b
609b02b
96ab01f
4ad3970
46eb64a
777c9e3
4c01c73
8bb3000
83d6efd
41f42b3
34edac9
b579026
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -40,6 +40,26 @@ | |
|
||
|
||
def remove_stopwords(s): | ||
"""Takes string, removes all words those are among stopwords. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Coding style: docstrings in imperative mode: "Do X", not "Does X". There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you start the docstring on its own line? Not continue after There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's numpy-style convention. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm, ok. |
||
|
||
Parameters | ||
---------- | ||
s : str | ||
|
||
Returns | ||
------- | ||
str | ||
Unicode string without stopwords. | ||
|
||
Examples | ||
-------- | ||
>>> from gensim.parsing.preprocessing import remove_stopwords | ||
>>> s = "Better late than never, but better never late." | ||
>>> remove_stopwords(s) | ||
u'Better late never, better late.' | ||
|
||
""" | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Coding style (PEP257): no blank line before or after docstring. |
||
s = utils.to_unicode(s) | ||
return " ".join(w for w in s.split() if w not in STOPWORDS) | ||
|
||
|
@@ -48,12 +68,36 @@ def remove_stopwords(s): | |
|
||
|
||
def strip_punctuation(s): | ||
"""Takes string, replaces all punctuation characters with spaces. | ||
|
||
Parameters | ||
---------- | ||
s : str | ||
|
||
Returns | ||
------- | ||
str | ||
Unicode string without punctuation characters. | ||
|
||
Examples | ||
-------- | ||
>>> from gensim.parsing.preprocessing import strip_punctuation | ||
>>> s = "A semicolon is a stronger break than a comma, but not as much as a full stop!" | ||
>>> strip_punctuation(s) | ||
u'A semicolon is a stronger break than a comma but not as much as a full stop ' | ||
|
||
""" | ||
|
||
s = utils.to_unicode(s) | ||
return RE_PUNCT.sub(" ", s) | ||
|
||
|
||
# unicode.translate cannot delete characters like str can | ||
strip_punctuation2 = strip_punctuation | ||
""" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That won't work, this is not how docstrings work. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No need docstring here, this will be removed in refactoring. |
||
Same as strip_punctuation | ||
""" | ||
|
||
# def strip_punctuation2(s): | ||
# s = utils.to_unicode(s) | ||
# return s.translate(None, string.punctuation) | ||
|
@@ -63,11 +107,58 @@ def strip_punctuation(s): | |
|
||
|
||
def strip_tags(s): | ||
"""Takes string and removes tags. | ||
|
||
Parameters | ||
---------- | ||
s : str | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please add description to argument (for example |
||
|
||
Returns | ||
------- | ||
str | ||
Unicode string without tags. | ||
|
||
Examples | ||
-------- | ||
>>> from gensim.parsing.preprocessing import strip_tags | ||
>>> s = "<i>Hello</i> <b>World</b>!" | ||
>>> strip_tags(s) | ||
u'Hello World!' | ||
|
||
""" | ||
|
||
s = utils.to_unicode(s) | ||
return RE_TAGS.sub("", s) | ||
|
||
|
||
def strip_short(s, minsize=3): | ||
"""Takes string and removes words with length lesser than minsize (default = 3). | ||
|
||
Parameters | ||
---------- | ||
s : str | ||
minsize : int, optional | ||
|
||
Returns | ||
------- | ||
str | ||
Unicode string without words with length lesser than minsize. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Redundant newline. |
||
|
||
Examples | ||
-------- | ||
>>> from gensim.parsing.preprocessing import strip_short | ||
>>> s = "salut les amis du 59" | ||
>>> strip_short(s) | ||
u'salut les amis' | ||
|
||
>>> from gensim.parsing.preprocessing import strip_short | ||
>>> s = "one two three four five six seven eight nine ten" | ||
>>> strip_short(s,5) | ||
u'three seven eight' | ||
|
||
""" | ||
|
||
s = utils.to_unicode(s) | ||
return " ".join(e for e in s.split() if len(e) >= minsize) | ||
|
||
|
@@ -76,6 +167,26 @@ def strip_short(s, minsize=3): | |
|
||
|
||
def strip_numeric(s): | ||
"""Takes string and removes digits from it. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Coding style: docstring in Python should be in imperative mode: "Do X", not "Does X". |
||
|
||
Parameters | ||
---------- | ||
s : str | ||
|
||
Returns | ||
------- | ||
str | ||
Unicode string without digits. | ||
|
||
Examples | ||
-------- | ||
>>> from gensim.parsing.preprocessing import strip_numeric | ||
>>> s = "0text24gensim365test" | ||
>>> strip_numeric(s) | ||
u'textgensimtest' | ||
|
||
""" | ||
|
||
s = utils.to_unicode(s) | ||
return RE_NUMERIC.sub("", s) | ||
|
||
|
@@ -84,6 +195,27 @@ def strip_numeric(s): | |
|
||
|
||
def strip_non_alphanum(s): | ||
"""Takes string and removes not a word characters from it. | ||
(Word characters - alphanumeric & underscore) | ||
|
||
Parameters | ||
---------- | ||
s : str | ||
|
||
Returns | ||
------- | ||
str | ||
Unicode string without not a word characters. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. with word characters only? |
||
|
||
Examples | ||
-------- | ||
>>> from gensim.parsing.preprocessing import strip_non_alphanum | ||
>>> s = "if-you#can%read$this&then@this#method^works" | ||
>>> strip_non_alphanum(s) | ||
u'if you can read this then this method works' | ||
|
||
""" | ||
|
||
s = utils.to_unicode(s) | ||
return RE_NONALPHA.sub(" ", s) | ||
|
||
|
@@ -92,6 +224,27 @@ def strip_non_alphanum(s): | |
|
||
|
||
def strip_multiple_whitespaces(s): | ||
r"""Takes string, removes repeating in a row whitespace characters (spaces, tabs, line breaks) from it | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why is this docstring There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is the special case because we used |
||
and turns tabs & line breaks into spaces. | ||
|
||
Parameters | ||
---------- | ||
s : str | ||
|
||
Returns | ||
------- | ||
str | ||
Unicode string without repeating in a row whitespace characters. | ||
|
||
Examples | ||
-------- | ||
>>> from gensim.parsing.preprocessing import strip_multiple_whitespaces | ||
>>> s = "salut" + '\r' + " les" + '\n' + " loulous!" | ||
>>> strip_multiple_whitespaces(s) | ||
u'salut les loulous!' | ||
|
||
""" | ||
|
||
s = utils.to_unicode(s) | ||
return RE_WHITESPACE.sub(" ", s) | ||
|
||
|
@@ -101,22 +254,61 @@ def strip_multiple_whitespaces(s): | |
|
||
|
||
def split_alphanum(s): | ||
"""Takes string, adds spaces between digits & letters. | ||
|
||
Parameters | ||
---------- | ||
s : str | ||
|
||
Returns | ||
------- | ||
str | ||
Unicode string with spaces between digits & letters. | ||
|
||
Examples | ||
-------- | ||
>>> from gensim.parsing.preprocessing import split_alphanum | ||
>>> s = "24.0hours7 days365 a1b2c3" | ||
>>> split_alphanum(s) | ||
u'24.0 hours 7 days 365 a 1 b 2 c 3' | ||
|
||
""" | ||
|
||
s = utils.to_unicode(s) | ||
s = RE_AL_NUM.sub(r"\1 \2", s) | ||
return RE_NUM_AL.sub(r"\1 \2", s) | ||
|
||
|
||
def stem_text(text): | ||
"""Takes string, tranforms it into lowercase and (porter-)stemmed version. | ||
|
||
Parameters | ||
---------- | ||
text : str | ||
|
||
Returns | ||
------- | ||
str | ||
Lowercase and (porter-)stemmed version of string `text`. | ||
|
||
Examples | ||
-------- | ||
>>> from gensim.parsing.preprocessing import stem_text | ||
>>> text = "While it is quite useful to be able to search a large collection of documents almost instantly for a joint occurrence of a collection of exact words, for many searching purposes, a little fuzziness would help. " | ||
>>> stem_text(text) | ||
u'while it is quit us to be abl to search a larg collect of document almost instantli for a joint occurr of a collect of exact words, for mani search purposes, a littl fuzzi would help.' | ||
|
||
""" | ||
Return lowercase and (porter-)stemmed version of string `text`. | ||
""" | ||
|
||
text = utils.to_unicode(text) | ||
p = PorterStemmer() | ||
return ' '.join(p.stem(word) for word in text.split()) | ||
|
||
|
||
stem = stem_text | ||
|
||
|
||
|
||
DEFAULT_FILTERS = [ | ||
lambda x: x.lower(), strip_tags, strip_punctuation, | ||
strip_multiple_whitespaces, strip_numeric, | ||
|
@@ -125,17 +317,84 @@ def stem_text(text): | |
|
||
|
||
def preprocess_string(s, filters=DEFAULT_FILTERS): | ||
"""Takes string, applies list of chosen filters to it, where filters are methods from this module. Default list of filters consists of: strip_tags, strip_punctuation, strip_multiple_whitespaces, strip_numeric, remove_stopwords, strip_short, stem_text. <function <lambda>> in signature means that we use lambda function for applying methods to filters. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Coding style: line way too long. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Use references to this, not raw text, i.e.
instead of |
||
|
||
Parameters | ||
---------- | ||
s : str | ||
filters: list, optional | ||
|
||
Returns | ||
------- | ||
list | ||
List of unicode strings. | ||
|
||
Examples | ||
-------- | ||
>>> from gensim.parsing.preprocessing import preprocess_string | ||
>>> s = "<i>Hel 9lo</i> <b>Wo9 rld</b>! Th3 weather_is really g00d today, isn't it?" | ||
>>> preprocess_string(s) | ||
[u'hel', u'rld', u'weather', u'todai', u'isn'] | ||
|
||
>>> from gensim.parsing.preprocessing import preprocess_string, strip_tags, strip_punctuation | ||
>>> s = "<i>Hel 9lo</i> <b>Wo9 rld</b>! Th3 weather_is really g00d today, isn't it?" | ||
>>> CUSTOM_FILTERS = [lambda x: x.lower(), strip_tags, strip_punctuation] | ||
>>> preprocess_string(s,CUSTOM_FILTERS) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Coding style: space after comma. |
||
[u'hel', u'9lo', u'wo9', u'rld', u'th3', u'weather', u'is', u'really', u'g00d', u'today', u'isn', u't', u'it'] | ||
|
||
""" | ||
|
||
s = utils.to_unicode(s) | ||
for f in filters: | ||
s = f(s) | ||
return s.split() | ||
|
||
|
||
def preprocess_documents(docs): | ||
"""Takes list of strings, splits it into sentences, then applies default filters to every sentence. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't see any splitting into sentences, where does that come from? |
||
|
||
Parameters | ||
---------- | ||
docs : list | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add the description of |
||
|
||
Returns | ||
------- | ||
list | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, you could write |
||
List of lists, filled by unicode strings. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Processed documents split by whitespace |
||
|
||
Examples | ||
-------- | ||
>>> from gensim.parsing.preprocessing import preprocess_documents | ||
>>> s = ["<i>Hel 9lo</i> <b>Wo9 rld</b>!", "Th3 weather_is really g00d today, isn't it?"] | ||
>>> preprocess_documents(s) | ||
[[u'hel', u'rld'], [u'weather', u'todai', u'isn']] | ||
|
||
""" | ||
|
||
return [preprocess_string(d) for d in docs] | ||
|
||
|
||
def read_file(path): | ||
r"""Reads file in specified directory. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This entire function should be removed, it's too trivial. |
||
|
||
Parameters | ||
---------- | ||
path : str | ||
|
||
Returns | ||
------- | ||
list | ||
List of unicode strings. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Doesn't match the example. |
||
|
||
Examples | ||
-------- | ||
>>> from gensim.parsing.preprocessing import read_file | ||
>>> path = "/media/work/october_2017/gensim/gensim/test/test_data/mihalcea_tarau.summ.txt" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This path will works only on your filesystem, utils to retrieve path to test files will be ready very soon |
||
>>> read_file(path) | ||
"Hurricane Gilbert swept toward the Dominican Republic Sunday, and the Civil Defense alerted its heavily populated south coast to prepare for high winds, heavy rains and high seas.\nThe National Hurricane Center in Miami reported its position at 2 a.m. Sunday at latitude 16.1 north, longitude 67.5 west, about 140 miles south of Ponce, Puerto Rico, and 200 miles southeast of Santo Domingo.\nThe National Weather Service in San Juan, Puerto Rico, said Gilbert was moving westward at 15 mph with a ``broad area of cloudiness and heavy weather'' rotating around the center of the storm.\nStrong winds associated with the Gilbert brought coastal flooding, strong southeast winds and up to 12 feet feet to Puerto Rico's south coast." | ||
|
||
""" | ||
|
||
with utils.smart_open(path) as fin: | ||
return fin.read() | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're refactoring, the set of stopwords should be a parameter:
remove_stopwords(s, stopwords=STOPWORDS)
.IIRC @menshikh-iv had plans to remove this entire package, so not sure if this is relevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piskvorky as I investigate, this package is actively used, for this reason, this will be moved (and slightly refactored).