str.replace('.','') should replace every character? (fix) #24809

alarangeiras · 2019-01-16T21:42:14Z

[ X] closes str.replace('.','') should replace every character? #24804
[X ] tests added / passed
[ X] passes git diff upstream/master -u -- "*.py" | flake8 --diff
[ X] fix replace pattern problem

pep8speaks · 2019-01-16T21:42:19Z

Hello @alarangeiras! Thanks for updating the PR.

In the file pandas/core/strings.py, following are the PEP8 issues :

Line 580:17: E225 missing whitespace around operator
Line 580:17: E711 comparison to None should be 'if cond is None:'
Line 581:80: E501 line too long (92 > 79 characters)
Line 581:93: W291 trailing whitespace
Line 582:29: E127 continuation line over-indented for visual indent
Line 582:80: E501 line too long (98 > 79 characters)

In the file pandas/tests/test_strings.py, following are the PEP8 issues :

Line 1024:49: W291 trailing whitespace

Comment last updated on January 17, 2019 at 17:34 Hours UTC

WillAyd

Looks generally good. Can you add a whatsnew note as well?

WillAyd · 2019-01-16T21:45:02Z

pandas/tests/test_strings.py

+        values = Series(['abc','123'])
+
+        result = values.str.replace('.', 'foo')
+        exp = Series(['foofoofoo', 'foofoofoo'])


Can you name this expected?

Sure, should I make another PR?

No just add as a commit and push on the same branch.

Looks like CI failed too -haven’t checked but make sure tests pass locally before pushing

Yes, i've seem that.
Actually, i know what is the problem.
The problem is the test test_pipe_failures. It was built to test a char replacement: pipe to white space.
But, pipe is a regex code too.
When i fixed the replace behavior, this test was broken.
My proposal is change this test to pass the regex=False parameter. Like below:

def test_pipe_failures(self): # #2119 s = Series(['A|B|C']) result = s.str.split('|') exp = Series([['A', 'B', 'C']]) tm.assert_series_equal(result, exp) result = s.str.replace('|', ' ', regex=False) exp = Series(['A B C']) tm.assert_series_equal(result, exp)

What to think about that?

WillAyd · 2019-01-16T21:45:24Z

pandas/tests/test_strings.py

@@ -1008,6 +1008,13 @@ def test_replace(self):
                    values = klass(data)
                    pytest.raises(TypeError, values.str.replace, 'a', repl)

+    def test_replace_single_pattern(self):
+        values = Series(['abc','123'])


Can you add a comment for the issue (# GH 24804)

WillAyd · 2019-01-16T21:46:04Z

pandas/core/strings.py

@@ -564,7 +564,7 @@ def str_replace(arr, pat, repl, n=-1, case=None, flags=0, regex=True):
            # add case flag, if provided
            if case is False:
                flags |= re.IGNORECASE
-        if is_compiled_re or len(pat) > 1 or flags or callable(repl):
+        if is_compiled_re or len(pat) > 0 or flags or callable(repl):


Do we even need the len(pat) condition? Can it just be pat instead?

It can be just pat, the only issue is case pat is empty.

Wouldn’t that be False in either case? If so shouldn’t need the len expression

- fixing test_pipe_failures (it's not a regex test, it's a char test)

codecov · 2019-01-16T23:34:57Z

Codecov Report

Merging #24809 into master will decrease coverage by 49.46%.
The diff coverage is 0%.

@@             Coverage Diff             @@
##           master   #24809       +/-   ##
===========================================
- Coverage   92.38%   42.92%   -49.47%     
===========================================
  Files         166      166               
  Lines       52382    52382               
===========================================
- Hits        48395    22485    -25910     
- Misses       3987    29897    +25910

Flag	Coverage Δ
#multiple	`?`
#single	`42.92% <0%> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/core/strings.py	`33% <0%> (-65.59%)`	⬇️
pandas/io/formats/latex.py	`0% <0%> (-100%)`	⬇️
pandas/core/categorical.py	`0% <0%> (-100%)`	⬇️
pandas/io/sas/sas_constants.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/plotting.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/converter.py	`0% <0%> (-100%)`	⬇️
pandas/io/formats/html.py	`0% <0%> (-99.35%)`	⬇️
pandas/core/groupby/categorical.py	`0% <0%> (-95.46%)`	⬇️
pandas/io/sas/sas7bdat.py	`0% <0%> (-91.17%)`	⬇️
pandas/io/sas/sas_xport.py	`0% <0%> (-90.15%)`	⬇️
... and 124 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 17a6bc5...340ae89. Read the comment docs.

codecov · 2019-01-16T23:34:57Z

Codecov Report

Merging #24809 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #24809      +/-   ##
==========================================
- Coverage   92.38%   92.38%   -0.01%     
==========================================
  Files         166      166              
  Lines       52382    52382              
==========================================
- Hits        48395    48394       -1     
- Misses       3987     3988       +1

Flag	Coverage Δ
#multiple	`90.8% <100%> (-0.01%)`	⬇️
#single	`42.92% <0%> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/core/strings.py	`98.44% <100%> (-0.15%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 17a6bc5...93a0715. Read the comment docs.

WillAyd

Can you add a whatsnew entry? Think this is ok for 0.24 but cc @jreback

WillAyd · 2019-01-17T05:49:10Z

pandas/tests/test_strings.py

+        values = Series(['abc', '123'])
+
+        result = values.str.replace('.', 'foo')
+        chars_replaced_expected = Series(['foofoofoo', 'foofoofoo'])


Can just called this expected

WillAyd · 2019-01-17T05:51:52Z

pandas/tests/test_strings.py

@@ -2924,7 +2932,7 @@ def test_pipe_failures(self):

        tm.assert_series_equal(result, exp)

-        result = s.str.replace('|', ' ')
+        result = s.str.replace('|', ' ', regex=False)


Hmm Ok. I think this is correct but arguably an API breaking change so make sure we make note of that in the whatsnew

how is this an api change? regex=True is the default

Was just thinking of this particular instance and any where a user was passing in a single character that may have special meaning with a regex. This previously would directly replace a pipe but now requires regex=False in user code, so it could cause some breakage.

Being extra conservative but not tied to the request then if you feel its over communicating.

Exactly, but if you think this documentation is not necessary, let me know and I can change.

Hmm. This is a potentially disruptive change...

Can we:

Change the default regex=None for .str.replace

Detect when a length-1 character is a regex symbol

Warn that it'll change in the future to interpret that character as a regex, not a literal

set regex=False for now to preserve the old (buggy) behavior?

OK I agree - that's probably the best go-forward path, save the first point which I don't understand.

@alarangeiras can you raise a FutureWarning here instead?

save the first point which I don't understand.

Changing the default regex=None? That's so we can detect if we need a warning or not.

If the user passes .str.replace('.', 'b', regex=True), we know to interpret the . as re.compile('.'), so the output would be 'bbb'.

If the user passes .str.replace('.', 'b', regex=False), we know that they want a literal ., so the output is 'abc'.

We'll use regex=None to see if the user is explicit or not.

Why not just warn when the length of the pattern is 1 and regex=True? Whether or not the user explicitly typed that or relied on the default argument they'd hit the same bug at the end of the day. Don't see value in introducing a None value into a True/False field currently

Why not just warn when the length of the pattern is 1 and regex=True?

Then I think there would be no way to have

In [5]: pd.Series(['a.c']).str.replace('.', 'b') # Warning: Interpreting '.' as a literal, not a regex... The default will change in the future. Out[5]: 0 abc dtype: object

# no warning In [5]: pd.Series(['a.c']).str.replace('.', 'b', regex=True) Out[5]: 0 bbb dtype: object

unless I"m missing something.

Following this line of reasoning, from what I understand, every bug found should issue a warning of future adjustment?

- adding whatsnew entry and a note for API breaking change

alarangeiras · 2019-01-17T10:54:26Z

I think now it's ok.

WillAyd · 2019-01-17T11:03:16Z

doc/source/whatsnew/v0.24.0.rst

@@ -795,6 +795,9 @@ Now, the return type is consistently a :class:`DataFrame`.
   and a :class:`DataFrame` with sparse values. The memory usage will
   be the same as in the previous version of pandas.

+   Be sure to perform a replace of literal strings by passing the


Can you move this to the section on breaking changes and show a before / after of the behavior?

- adding before and after example

jreback · 2019-01-17T12:30:57Z

doc/source/whatsnew/v0.24.0.rst

+
+Be sure to perform a replace of literal strings by passing the
+regex=False parameter to func:`str.replace`. Mainly when the 
+pattern is 1 size string (:issue:`24809`)


this is not needed, this is a simple bug fix

jreback · 2019-01-17T12:32:08Z

pandas/tests/test_strings.py

@@ -2924,7 +2932,7 @@ def test_pipe_failures(self):

        tm.assert_series_equal(result, exp)

-        result = s.str.replace('|', ' ')
+        result = s.str.replace('|', ' ', regex=False)


how is this an api change? regex=True is the default

alarangeiras · 2019-01-17T13:32:36Z

@WillAyd, is there a consensus about how document this issue?

TomAugspurger · 2019-01-17T13:29:52Z

doc/source/whatsnew/v0.24.0.rst

@@ -1645,6 +1669,7 @@ Strings
 - Bug in :meth:`Index.str.split` was not nan-safe (:issue:`23677`).
 - Bug :func:`Series.str.contains` not respecting the ``na`` argument for a ``Categorical`` dtype ``Series`` (:issue:`22158`)
 - Bug in :meth:`Index.str.cat` when the result contained only ``NaN`` (:issue:`24044`)
+- Bug in :func:`Series.str.replace` not applying regex in patterns of len size = 1 (:issue:`24809`)


"len size = 1" -> "length 1".

TomAugspurger · 2019-01-17T13:33:06Z

pandas/tests/test_strings.py

@@ -2924,7 +2932,7 @@ def test_pipe_failures(self):

        tm.assert_series_equal(result, exp)

-        result = s.str.replace('|', ' ')
+        result = s.str.replace('|', ' ', regex=False)


Hmm. This is a potentially disruptive change...

Can we:

Change the default regex=None for .str.replace

Detect when a length-1 character is a regex symbol

Warn that it'll change in the future to interpret that character as a regex, not a literal

set regex=False for now to preserve the old (buggy) behavior?

WillAyd · 2019-01-17T15:54:52Z

Ah I see your point now - you’d essentially be doing that on top of the change here. I was assuming we would hold off on this change in lieu of the warning.

…

Sent from my iPhone

On Jan 17, 2019, at 10:42 AM, Tom Augspurger ***@***.***> wrote: @TomAugspurger commented on this pull request. In pandas/tests/test_strings.py: > @@ -2924,7 +2932,7 @@ def test_pipe_failures(self): tm.assert_series_equal(result, exp) - result = s.str.replace('|', ' ') + result = s.str.replace('|', ' ', regex=False) Why not just warn when the length of the pattern is 1 and regex=True? Then I think there would be no way to have In [5]: pd.Series(['a.c']).str.replace('.', 'b') # Warning: Interpreting '.' as a literal, not a regex... Out[5]: 0 abc dtype: object # no warning In [5]: pd.Series(['a.c']).str.replace('.', 'b', regex=True) Out[5]: 0 bbb dtype: object unless I"m missing something. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

TomAugspurger · 2019-01-17T16:02:37Z

No. Just ones as potentially disruptive as this.

…

On Thu, Jan 17, 2019 at 9:57 AM Allan Larangeiras ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In pandas/tests/test_strings.py <#24809 (comment)>: > @@ -2924,7 +2932,7 @@ def test_pipe_failures(self): tm.assert_series_equal(result, exp) - result = s.str.replace('|', ' ') + result = s.str.replace('|', ' ', regex=False) Following this line of reasoning, from what I understand, every bug found should issue a warning of future adjustment? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#24809 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIslncrVV0b7yBOox-uSCtXW_OyW7ks5vEJ2AgaJpZM4aEAxN> .

alarangeiras · 2019-01-17T17:59:01Z

Sorry, I don't agree with this solution. I think it's the best if someone else make this fix.

Liam3851 · 2019-01-17T18:06:03Z

Just noting the potential for this confusion was brought up when we added the regex parameter, though it didn't generate much discussion: #16808 (comment)

At the time I noted that changing this behavior would break back-compat (since the undocumented behavior that had been there since the beginning was literal replacement for 1-character strings and regex replacement for >1 character strings).

I'm totally on board with changing either the documentation or the behavior to be more consistent, but it definitely needs a deprecation cycle as suggested by @TomAugspurger. The behavior of .str.replace('.', '') without regex specified to replace periods, rather than every character, has been constant since at least <=0.16.

TomAugspurger · 2019-01-17T20:41:47Z

Thanks for the context @Liam3851, that's valuable.

@alarangeiras, does that make sense? Or are you done working on this?

alarangeiras added 2 commits January 16, 2019 16:34

string replace pattern size fix

c984582

adding test case to replace pattern problem

67b0870

WillAyd requested changes Jan 16, 2019

View reviewed changes

WillAyd added the Strings String extension data type and string data label Jan 16, 2019

- adding comments and refactoring

340ae89

- fixing test_pipe_failures (it's not a regex test, it's a char test)

WillAyd requested changes Jan 17, 2019

View reviewed changes

alarangeiras added 3 commits January 17, 2019 05:07

removing len check from replace pattern

5a4e131

- refactoring test case variable name

cf8dc79

- adding whatsnew entry and a note for API breaking change

removing whitespace in the end of the line

924ecc8

WillAyd requested changes Jan 17, 2019

View reviewed changes

alarangeiras added 2 commits January 17, 2019 09:45

- changing the position of the API breaking note

97bf73a

- adding before and after example

removing whitespace from the documentation

da16172

jreback requested changes Jan 17, 2019

View reviewed changes

TomAugspurger reviewed Jan 17, 2019

View reviewed changes

- making the changes requested by the project members

93a0715

alarangeiras closed this Jan 24, 2019

TomAugspurger mentioned this pull request Jan 24, 2019

str.replace('.','') should replace every character? #24804

Closed

charlesdong1991 mentioned this pull request Jan 26, 2019

BUG: fix str.replace('.','') should replace every character #24935

Closed

4 tasks

dsaxton mentioned this pull request Sep 28, 2020

API: Deprecate regex=True default in Series.str.replace #36695

Merged

str.replace('.','') should replace every character? (fix) #24809

str.replace('.','') should replace every character? (fix) #24809

Conversation

alarangeiras commented Jan 16, 2019

pep8speaks commented Jan 16, 2019 • edited Loading

Comment last updated on January 17, 2019 at 17:34 Hours UTC

WillAyd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jan 16, 2019

Codecov Report

codecov bot commented Jan 16, 2019 • edited Loading

Codecov Report

WillAyd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger Jan 17, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alarangeiras commented Jan 17, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alarangeiras commented Jan 17, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd commented Jan 17, 2019 via email

TomAugspurger commented Jan 17, 2019 via email

alarangeiras commented Jan 17, 2019

Liam3851 commented Jan 17, 2019 • edited Loading

TomAugspurger commented Jan 17, 2019

pep8speaks commented Jan 16, 2019 •

edited

Loading

codecov bot commented Jan 16, 2019 •

edited

Loading

TomAugspurger Jan 17, 2019 •

edited

Loading

Liam3851 commented Jan 17, 2019 •

edited

Loading