Skip to content

Commit

Permalink
Document that preprocessing.strip_punctuation is limited to ASCII (#2964
Browse files Browse the repository at this point in the history
)

* Clarifying strip_punctuation limited to ASCII

Add ASCII as qualification on `strip_punctuation` doc string. 
This is "option 1" fix for issue #2962

* Added code comment pointing to issue 2962

Code comment added linking to issue #2962 as a reminder of enhancement possibilities.

* update CHANGELOG.md

Co-authored-by: Michael Penkov <misha.penkov@gmail.com>
  • Loading branch information
sciatro and mpenkov committed Jun 29, 2021
1 parent d59a241 commit dab0369
Show file tree
Hide file tree
Showing 2 changed files with 3 additions and 2 deletions.
2 changes: 1 addition & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ Changes
* [#3141](https://github.com/RaRe-Technologies/gensim/pull/3141): Update link for online LDA paper, by [@dymil](https://github.com/dymil)
* [#3148](https://github.com/RaRe-Technologies/gensim/pull/3148): Fix broken link in documentation, by [@rohit901](https://github.com/rohit901)
* [#3155](https://github.com/RaRe-Technologies/gensim/pull/3155): Correct parameter name in documentation of fasttext.py, by [@bizzyvinci](https://github.com/bizzyvinci)

* [#2964](https://github.com/RaRe-Technologies/gensim/pull/2964): Document that preprocessing.strip_punctuation is limited to ASCII, by [@sciatro](https://github.com/sciatro)
## 4.0.1, 2021-04-01

Bugfix release to address issues with Wheels on Windows:
Expand Down
3 changes: 2 additions & 1 deletion gensim/parsing/preprocessing.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ def remove_stopwords(s):


def strip_punctuation(s):
"""Replace punctuation characters with spaces in `s` using :const:`~gensim.parsing.preprocessing.RE_PUNCT`.
"""Replace ASCII punctuation characters with spaces in `s` using :const:`~gensim.parsing.preprocessing.RE_PUNCT`.
Parameters
----------
Expand All @@ -115,6 +115,7 @@ def strip_punctuation(s):
"""
s = utils.to_unicode(s)
# For unicode enhancement options see https://github.com/RaRe-Technologies/gensim/issues/2962
return RE_PUNCT.sub(" ", s)


Expand Down

0 comments on commit dab0369

Please sign in to comment.