Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

str.replace('.','') should replace every character? (fix) #24809

Closed
wants to merge 9 commits into from
Closed

str.replace('.','') should replace every character? (fix) #24809

wants to merge 9 commits into from

Conversation

alarangeiras
Copy link

@pep8speaks
Copy link

pep8speaks commented Jan 16, 2019

Hello @alarangeiras! Thanks for updating the PR.

Line 580:17: E225 missing whitespace around operator
Line 580:17: E711 comparison to None should be 'if cond is None:'
Line 581:80: E501 line too long (92 > 79 characters)
Line 581:93: W291 trailing whitespace
Line 582:29: E127 continuation line over-indented for visual indent
Line 582:80: E501 line too long (98 > 79 characters)

Line 1024:49: W291 trailing whitespace

Comment last updated on January 17, 2019 at 17:34 Hours UTC

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks generally good. Can you add a whatsnew note as well?

values = Series(['abc','123'])

result = values.str.replace('.', 'foo')
exp = Series(['foofoofoo', 'foofoofoo'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you name this expected?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, should I make another PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No just add as a commit and push on the same branch.

Looks like CI failed too -haven’t checked but make sure tests pass locally before pushing

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, i've seem that.
Actually, i know what is the problem.
The problem is the test test_pipe_failures. It was built to test a char replacement: pipe to white space.
But, pipe is a regex code too.
When i fixed the replace behavior, this test was broken.
My proposal is change this test to pass the regex=False parameter. Like below:

    def test_pipe_failures(self):
        # #2119
        s = Series(['A|B|C'])

        result = s.str.split('|')
        exp = Series([['A', 'B', 'C']])

        tm.assert_series_equal(result, exp)

        result = s.str.replace('|', ' ', regex=False)
        exp = Series(['A B C'])

        tm.assert_series_equal(result, exp)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What to think about that?

@@ -1008,6 +1008,13 @@ def test_replace(self):
values = klass(data)
pytest.raises(TypeError, values.str.replace, 'a', repl)

def test_replace_single_pattern(self):
values = Series(['abc','123'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment for the issue (# GH 24804)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

@@ -564,7 +564,7 @@ def str_replace(arr, pat, repl, n=-1, case=None, flags=0, regex=True):
# add case flag, if provided
if case is False:
flags |= re.IGNORECASE
if is_compiled_re or len(pat) > 1 or flags or callable(repl):
if is_compiled_re or len(pat) > 0 or flags or callable(repl):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we even need the len(pat) condition? Can it just be pat instead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be just pat, the only issue is case pat is empty.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn’t that be False in either case? If so shouldn’t need the len expression

@WillAyd WillAyd added the Strings String extension data type and string data label Jan 16, 2019
- fixing test_pipe_failures (it's not a regex test, it's a char test)
@codecov
Copy link

codecov bot commented Jan 16, 2019

Codecov Report

Merging #24809 into master will decrease coverage by 49.46%.
The diff coverage is 0%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #24809       +/-   ##
===========================================
- Coverage   92.38%   42.92%   -49.47%     
===========================================
  Files         166      166               
  Lines       52382    52382               
===========================================
- Hits        48395    22485    -25910     
- Misses       3987    29897    +25910
Flag Coverage Δ
#multiple ?
#single 42.92% <0%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/strings.py 33% <0%> (-65.59%) ⬇️
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/core/categorical.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-99.35%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-95.46%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.17%) ⬇️
pandas/io/sas/sas_xport.py 0% <0%> (-90.15%) ⬇️
... and 124 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 17a6bc5...340ae89. Read the comment docs.

@codecov
Copy link

codecov bot commented Jan 16, 2019

Codecov Report

Merging #24809 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24809      +/-   ##
==========================================
- Coverage   92.38%   92.38%   -0.01%     
==========================================
  Files         166      166              
  Lines       52382    52382              
==========================================
- Hits        48395    48394       -1     
- Misses       3987     3988       +1
Flag Coverage Δ
#multiple 90.8% <100%> (-0.01%) ⬇️
#single 42.92% <0%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/strings.py 98.44% <100%> (-0.15%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 17a6bc5...93a0715. Read the comment docs.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a whatsnew entry? Think this is ok for 0.24 but cc @jreback

values = Series(['abc', '123'])

result = values.str.replace('.', 'foo')
chars_replaced_expected = Series(['foofoofoo', 'foofoofoo'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can just called this expected

@@ -2924,7 +2932,7 @@ def test_pipe_failures(self):

tm.assert_series_equal(result, exp)

result = s.str.replace('|', ' ')
result = s.str.replace('|', ' ', regex=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm Ok. I think this is correct but arguably an API breaking change so make sure we make note of that in the whatsnew

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is this an api change? regex=True is the default

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was just thinking of this particular instance and any where a user was passing in a single character that may have special meaning with a regex. This previously would directly replace a pipe but now requires regex=False in user code, so it could cause some breakage.

Being extra conservative but not tied to the request then if you feel its over communicating.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, but if you think this documentation is not necessary, let me know and I can change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. This is a potentially disruptive change...

Can we:

  1. Change the default regex=None for .str.replace
  2. Detect when a length-1 character is a regex symbol
  3. Warn that it'll change in the future to interpret that character as a regex, not a literal
  4. set regex=False for now to preserve the old (buggy) behavior?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I agree - that's probably the best go-forward path, save the first point which I don't understand.

@alarangeiras can you raise a FutureWarning here instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

save the first point which I don't understand.

Changing the default regex=None? That's so we can detect if we need a warning or not.

If the user passes .str.replace('.', 'b', regex=True), we know to interpret the . as re.compile('.'), so the output would be 'bbb'.

If the user passes .str.replace('.', 'b', regex=False), we know that they want a literal ., so the output is 'abc'.

We'll use regex=None to see if the user is explicit or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just warn when the length of the pattern is 1 and regex=True? Whether or not the user explicitly typed that or relied on the default argument they'd hit the same bug at the end of the day. Don't see value in introducing a None value into a True/False field currently

Copy link
Contributor

@TomAugspurger TomAugspurger Jan 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just warn when the length of the pattern is 1 and regex=True?

Then I think there would be no way to have

In [5]: pd.Series(['a.c']).str.replace('.', 'b')
# Warning: Interpreting '.' as a literal, not a regex... The default will change in the future.
Out[5]:
0    abc
dtype: object
# no warning
In [5]: pd.Series(['a.c']).str.replace('.', 'b', regex=True)
Out[5]:
0    bbb
dtype: object

unless I"m missing something.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following this line of reasoning, from what I understand, every bug found should issue a warning of future adjustment?

@alarangeiras
Copy link
Author

I think now it's ok.

@@ -795,6 +795,9 @@ Now, the return type is consistently a :class:`DataFrame`.
and a :class:`DataFrame` with sparse values. The memory usage will
be the same as in the previous version of pandas.

Be sure to perform a replace of literal strings by passing the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this to the section on breaking changes and show a before / after of the behavior?


Be sure to perform a replace of literal strings by passing the
regex=False parameter to func:`str.replace`. Mainly when the
pattern is 1 size string (:issue:`24809`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not needed, this is a simple bug fix

@@ -2924,7 +2932,7 @@ def test_pipe_failures(self):

tm.assert_series_equal(result, exp)

result = s.str.replace('|', ' ')
result = s.str.replace('|', ' ', regex=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is this an api change? regex=True is the default

@alarangeiras
Copy link
Author

@WillAyd, is there a consensus about how document this issue?

@@ -1645,6 +1669,7 @@ Strings
- Bug in :meth:`Index.str.split` was not nan-safe (:issue:`23677`).
- Bug :func:`Series.str.contains` not respecting the ``na`` argument for a ``Categorical`` dtype ``Series`` (:issue:`22158`)
- Bug in :meth:`Index.str.cat` when the result contained only ``NaN`` (:issue:`24044`)
- Bug in :func:`Series.str.replace` not applying regex in patterns of len size = 1 (:issue:`24809`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"len size = 1" -> "length 1".

@@ -2924,7 +2932,7 @@ def test_pipe_failures(self):

tm.assert_series_equal(result, exp)

result = s.str.replace('|', ' ')
result = s.str.replace('|', ' ', regex=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. This is a potentially disruptive change...

Can we:

  1. Change the default regex=None for .str.replace
  2. Detect when a length-1 character is a regex symbol
  3. Warn that it'll change in the future to interpret that character as a regex, not a literal
  4. set regex=False for now to preserve the old (buggy) behavior?

@WillAyd
Copy link
Member

WillAyd commented Jan 17, 2019 via email

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 17, 2019 via email

@alarangeiras
Copy link
Author

Sorry, I don't agree with this solution. I think it's the best if someone else make this fix.

@Liam3851
Copy link
Contributor

Liam3851 commented Jan 17, 2019

Just noting the potential for this confusion was brought up when we added the regex parameter, though it didn't generate much discussion: #16808 (comment)

At the time I noted that changing this behavior would break back-compat (since the undocumented behavior that had been there since the beginning was literal replacement for 1-character strings and regex replacement for >1 character strings).

I'm totally on board with changing either the documentation or the behavior to be more consistent, but it definitely needs a deprecation cycle as suggested by @TomAugspurger. The behavior of .str.replace('.', '') without regex specified to replace periods, rather than every character, has been constant since at least <=0.16.

@TomAugspurger
Copy link
Contributor

Thanks for the context @Liam3851, that's valuable.

@alarangeiras, does that make sense? Or are you done working on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

str.replace('.','') should replace every character?
6 participants