Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: inconsistency between replace dict using integers and using strings (#20656) #21477

Merged
merged 8 commits into from
Aug 9, 2018

Conversation

peterpanmj
Copy link
Contributor

@peterpanmj peterpanmj commented Jun 14, 2018

@codecov
Copy link

codecov bot commented Jun 14, 2018

Codecov Report

Merging #21477 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #21477      +/-   ##
==========================================
+ Coverage   92.07%   92.08%   +<.01%     
==========================================
  Files         169      169              
  Lines       50684    50703      +19     
==========================================
+ Hits        46668    46689      +21     
+ Misses       4016     4014       -2
Flag Coverage Δ
#multiple 90.49% <100%> (ø) ⬆️
#single 42.33% <12.12%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/internals/blocks.py 94.63% <100%> (+0.17%) ⬆️
pandas/core/internals/managers.py 96.48% <100%> (+0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fb6116f...4729fc5. Read the comment docs.

@gfyoung gfyoung added Bug Strings String extension data type and string data Compat pandas objects compatability with Numpy or Python functions labels Jun 15, 2018
@gfyoung
Copy link
Member

gfyoung commented Jun 15, 2018

@peterpanmj : Good start. Need a whatsnew entry in 0.23.2.

@peterpanmj
Copy link
Contributor Author

@gfyoung : Which subsection under Bug fixes should I add my entry ?

@gfyoung
Copy link
Member

gfyoung commented Jun 15, 2018

@peterpanmj : v0.23.2.txt - Other section

# result will be ['b', b'] after searching for pattern r'a'
# and then changed to ['a', 'a'] for pattern r'b*'
if regex:
if b.dtype == np.object_:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use is_object_dtype here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you change this

result = b.replace(s, d, inplace=inplace,
regex=regex,
mgr=mgr, convert=convert)
new_rb = _extend_blocks(result, new_rb)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of this, can you add a new (private) method on the Block itself (and then override for object dtype). It will be much cleaner code and this part becomes really generic.

@@ -79,4 +79,4 @@ Bug Fixes

**Other**

-
- Bug in :meth:`Series.replace` and meth:`DataFrame.replace` when dict is used as the `to_replace` value and one key in the dict is is another key's value, the results were inconsistent between using integer key and using string key (:issue:`20656`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to 0.24.0

@@ -1690,6 +1690,13 @@ def _nanpercentile(values, q, axis, **kw):
placement=np.arange(len(result)),
ndim=ndim)

def _coerce_replace(self, mask=None, dst=None, convert=False):
if mask.any():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a doc-string

call this: _replace_coerce

if mask.any():
self = self.coerce_to_target_dtype(dst)
return self.putmask(mask, dst, inplace=True)
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need the else (just return self)

block = [b.convert(by_item=True, numeric=False, copy=True)
for b in block]
return block
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

return block

def _coerce_replace(self, mask=None, dst=None, convert=False):
if mask.any():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a doc-string

regex=regex,
mgr=mgr, convert=convert)
new_rb = _extend_blocks(result, new_rb)
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason you are not using the newly defined _replace_coerce here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept the logic in master untouched for regex mode. This means regex replace still behave incorrectly (just like how dict replace is behaving now)
_maybe_compare can only do equality compare for now. That is the cause of it. I haven't figure out how to add regex support in _maybe_compare without breaking any existing test. Once _maybe_compare is fixed, this part can be removed. I think @Licht-T is working on this part #20656.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well what I would do is add a regex= param to _replace_coerce and push the logic to the block

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to regex= param, it also needs src= param for the regex pattern to match. I think the _replace_coerce should only take a boolean array (mask=) indicating where to put the new value. Meanwhile, regex matching should be done at generating the mask before passing it to the _replace_coerce. Adding regex logic to it might makes it less clear for others and probably let someone else override _replace_coerce or reuse it for regex replace. What I want is keep _replace_coerce just like putmask and handle the regex comparison at _maybe_compare(values, s, operator.eq)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree, this is crazy code here and needs simplification. all of the real work should be done in the blocks themselves, e .g. object is different than the other blocks.

Here should be a simple

rb = [b._replace(.....) for b in rb]

you can pass whatever args you want, but the top level logic is just too complicated here

@@ -1690,6 +1690,17 @@ def _nanpercentile(values, q, axis, **kw):
placement=np.arange(len(result)),
ndim=ndim)

def _replace_coerce(self, mask=None, dst=None, convert=False):
"""replace value to dst where mask is true, value is coerce to target
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a full Parameters section here

Copy link
Contributor Author

@peterpanmj peterpanmj Jun 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Do you mean add **kwargs and *args ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Or do you mean add a full docsstring with all parameters ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes full doc string

return block

def _replace_coerce(self, mask=None, dst=None, convert=False):
if mask.any():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

# result will be ['b', b'] after searching for pattern r'a'
# and then changed to ['a', 'a'] for pattern r'b*'
if regex:
if b.dtype == np.object_:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you change this

regex=regex,
mgr=mgr, convert=convert)
new_rb = _extend_blocks(result, new_rb)
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well what I would do is add a regex= param to _replace_coerce and push the logic to the block

@peterpanmj
Copy link
Contributor Author

Wondering am I working in the right direction ?

def _maybe_compare(a, b, regex=False):
if not regex:
op = lambda x: operator.eq(x, b)
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to add regex support in _maybe_compare. The comparing behavior is decided by param regex ( whether to use regex match or equality comparison) . The result will be a mask that will be passed on to _replace_coerce. This can avoid the result of previous round of comparing overwritten in the succeeding ones, e.g in, {"a":"b", "b":"a"} .

@@ -5155,9 +5240,8 @@ def _maybe_compare(a, b, op):
# numpy deprecation warning if comparing numeric vs string-like
elif is_numeric_v_string_like(a, b):
result = False

else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what changed here?

Copy link
Contributor Author

@peterpanmj peterpanmj Jul 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed a blank line. Should I keep it ?

@jreback
Copy link
Contributor

jreback commented Jul 26, 2018

@peterpanmj yes going in good direction! need to rebase as we moved internals around a bit.

@jreback
Copy link
Contributor

jreback commented Jul 28, 2018

can you rebase

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel can you have a look

if hasattr(s, 'asm8'):
return _maybe_compare(maybe_convert_objects(values),
getattr(s, 'asm8'), reg)
if reg and is_re_compilable(s):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are the same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My mistake. At first, I wanted to raise a ValueError when regex is True but replacer is not reg compilable. It might not be a good idea to do it here. I will delete the if condition.

@@ -1890,7 +1894,12 @@ def _consolidate(blocks):
return new_blocks


def _maybe_compare(a, b, op):
def _maybe_compare(a, b, regex=False):
if not regex:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a doc-string here

@@ -2464,7 +2502,7 @@ def replace(self, to_replace, value, inplace=False, filter=None,
regex=regex, mgr=mgr)

def _replace_single(self, to_replace, value, inplace=False, filter=None,
regex=False, convert=True, mgr=None):
regex=False, convert=True, mgr=None, mask=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a doc-string here

@@ -573,6 +573,5 @@ Other
- :meth: `~pandas.io.formats.style.Styler.background_gradient` now takes a ``text_color_threshold`` parameter to automatically lighten the text color based on the luminance of the background color. This improves readability with dark background colors without the need to limit the background colormap range. (:issue:`21258`)
- Require at least 0.28.2 version of ``cython`` to support read-only memoryviews (:issue:`21688`)
- :meth: `~pandas.io.formats.style.Styler.background_gradient` now also supports tablewise application (in addition to rowwise and columnwise) with ``axis=None`` (:issue:`15204`)
-
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to reshaping

@jbrockmendel
Copy link
Member

Looks like a nice bit of cleanup in Manager. For Block I wonder if it could share code with Index,
(Or an object EA?) but that’s a question for another day.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. some cosmetic things


Parameters
----------
mask : array_like of bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add optional to all of these args

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you reorder these args to match as much as possible _replace_single

dst : object
The value to be replaced with.
convert : bool
If true, try to coerce any object types to better types.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the inplace arg

value.

Parameters
----------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above (you can make a shared doc-string if you want here)

@@ -571,12 +573,15 @@ def replace_list(self, src_list, dest_list, inplace=False, regex=False,
# figure out our mask a-priori to avoid repeated replacements
values = self.as_array()

def comp(s):
def comp(s, reg=False):
if isna(s):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a doc-string, rename reg -> regex

@@ -1890,7 +1891,28 @@ def _consolidate(blocks):
return new_blocks


def _maybe_compare(a, b, op):
def _maybe_compare(a, b, regex=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make this slightly more verbose name, can you move to pandas/core/ops.py (cc @jbrockmendel good location)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we anticipate using it elsewhere? If not I'd leave it here at least for now.

Copy link
Contributor Author

@peterpanmj peterpanmj Aug 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like name it _compare_or_regex_match ? @jreback

@@ -243,6 +243,13 @@ def test_replace_string_with_number(self):
expected = pd.Series([1, 2, 3])
tm.assert_series_equal(expected, result)

def test_repace_intertwined_key_value_dict(self):
# GH 20656
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a 1-liner explaining the test in a bit more detail

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo repace --> replace (?)

@jreback jreback added this to the 0.24.0 milestone Jul 31, 2018
@jreback jreback merged commit 720d263 into pandas-dev:master Aug 9, 2018
@jreback
Copy link
Contributor

jreback commented Aug 9, 2018

thanks @peterpanmj nice patch!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Compat pandas objects compatability with Numpy or Python functions Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DataFrame.replace with dict behaves inconsistently for integers and strings
4 participants