-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: inconsistency between replace dict using integers and using strings (#20656) #21477
Conversation
Codecov Report
@@ Coverage Diff @@
## master #21477 +/- ##
==========================================
+ Coverage 92.07% 92.08% +<.01%
==========================================
Files 169 169
Lines 50684 50703 +19
==========================================
+ Hits 46668 46689 +21
+ Misses 4016 4014 -2
Continue to review full report at Codecov.
|
@peterpanmj : Good start. Need a |
@gfyoung : Which subsection under Bug fixes should I add my entry ? |
@peterpanmj : |
pandas/core/internals.py
Outdated
# result will be ['b', b'] after searching for pattern r'a' | ||
# and then changed to ['a', 'a'] for pattern r'b*' | ||
if regex: | ||
if b.dtype == np.object_: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use is_object_dtype here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you change this
pandas/core/internals.py
Outdated
result = b.replace(s, d, inplace=inplace, | ||
regex=regex, | ||
mgr=mgr, convert=convert) | ||
new_rb = _extend_blocks(result, new_rb) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of this, can you add a new (private) method on the Block itself (and then override for object dtype). It will be much cleaner code and this part becomes really generic.
doc/source/whatsnew/v0.23.2.txt
Outdated
@@ -79,4 +79,4 @@ Bug Fixes | |||
|
|||
**Other** | |||
|
|||
- | |||
- Bug in :meth:`Series.replace` and meth:`DataFrame.replace` when dict is used as the `to_replace` value and one key in the dict is is another key's value, the results were inconsistent between using integer key and using string key (:issue:`20656`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to 0.24.0
pandas/core/internals.py
Outdated
@@ -1690,6 +1690,13 @@ def _nanpercentile(values, q, axis, **kw): | |||
placement=np.arange(len(result)), | |||
ndim=ndim) | |||
|
|||
def _coerce_replace(self, mask=None, dst=None, convert=False): | |||
if mask.any(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a doc-string
call this: _replace_coerce
pandas/core/internals.py
Outdated
if mask.any(): | ||
self = self.coerce_to_target_dtype(dst) | ||
return self.putmask(mask, dst, inplace=True) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need the else (just return self)
pandas/core/internals.py
Outdated
block = [b.convert(by_item=True, numeric=False, copy=True) | ||
for b in block] | ||
return block | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
pandas/core/internals.py
Outdated
return block | ||
|
||
def _coerce_replace(self, mask=None, dst=None, convert=False): | ||
if mask.any(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a doc-string
pandas/core/internals.py
Outdated
regex=regex, | ||
mgr=mgr, convert=convert) | ||
new_rb = _extend_blocks(result, new_rb) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason you are not using the newly defined _replace_coerce
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept the logic in master untouched for regex mode. This means regex replace still behave incorrectly (just like how dict replace is behaving now)
_maybe_compare
can only do equality compare for now. That is the cause of it. I haven't figure out how to add regex support in _maybe_compare
without breaking any existing test. Once _maybe_compare is fixed, this part can be removed. I think @Licht-T is working on this part #20656.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well what I would do is add a regex=
param to _replace_coerce
and push the logic to the block
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to regex=
param, it also needs src=
param for the regex pattern to match. I think the _replace_coerce
should only take a boolean array (mask=
) indicating where to put the new value. Meanwhile, regex matching should be done at generating the mask before passing it to the _replace_coerce
. Adding regex logic to it might makes it less clear for others and probably let someone else override _replace_coerce
or reuse it for regex replace. What I want is keep _replace_coerce
just like putmask
and handle the regex comparison at _maybe_compare(values, s, operator.eq)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree, this is crazy code here and needs simplification. all of the real work should be done in the blocks themselves, e .g. object is different than the other blocks.
Here should be a simple
rb = [b._replace(.....) for b in rb]
you can pass whatever args you want, but the top level logic is just too complicated here
pandas/core/internals.py
Outdated
@@ -1690,6 +1690,17 @@ def _nanpercentile(values, q, axis, **kw): | |||
placement=np.arange(len(result)), | |||
ndim=ndim) | |||
|
|||
def _replace_coerce(self, mask=None, dst=None, convert=False): | |||
"""replace value to dst where mask is true, value is coerce to target |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a full Parameters section here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback Do you mean add **kwargs and *args ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback Or do you mean add a full docsstring with all parameters ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes full doc string
pandas/core/internals.py
Outdated
return block | ||
|
||
def _replace_coerce(self, mask=None, dst=None, convert=False): | ||
if mask.any(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
pandas/core/internals.py
Outdated
# result will be ['b', b'] after searching for pattern r'a' | ||
# and then changed to ['a', 'a'] for pattern r'b*' | ||
if regex: | ||
if b.dtype == np.object_: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you change this
pandas/core/internals.py
Outdated
regex=regex, | ||
mgr=mgr, convert=convert) | ||
new_rb = _extend_blocks(result, new_rb) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well what I would do is add a regex=
param to _replace_coerce
and push the logic to the block
Wondering am I working in the right direction ? |
pandas/core/internals.py
Outdated
def _maybe_compare(a, b, regex=False): | ||
if not regex: | ||
op = lambda x: operator.eq(x, b) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to add regex support in _maybe_compare. The comparing behavior is decided by param regex ( whether to use regex match or equality comparison) . The result will be a mask that will be passed on to _replace_coerce. This can avoid the result of previous round of comparing overwritten in the succeeding ones, e.g in, {"a":"b", "b":"a"} .
pandas/core/internals.py
Outdated
@@ -5155,9 +5240,8 @@ def _maybe_compare(a, b, op): | |||
# numpy deprecation warning if comparing numeric vs string-like | |||
elif is_numeric_v_string_like(a, b): | |||
result = False | |||
|
|||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what changed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed a blank line. Should I keep it ?
@peterpanmj yes going in good direction! need to rebase as we moved internals around a bit. |
can you rebase |
818fec4
to
d8f2d70
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel can you have a look
pandas/core/internals/managers.py
Outdated
if hasattr(s, 'asm8'): | ||
return _maybe_compare(maybe_convert_objects(values), | ||
getattr(s, 'asm8'), reg) | ||
if reg and is_re_compilable(s): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My mistake. At first, I wanted to raise a ValueError when regex is True but replacer is not reg compilable. It might not be a good idea to do it here. I will delete the if condition.
pandas/core/internals/managers.py
Outdated
@@ -1890,7 +1894,12 @@ def _consolidate(blocks): | |||
return new_blocks | |||
|
|||
|
|||
def _maybe_compare(a, b, op): | |||
def _maybe_compare(a, b, regex=False): | |||
if not regex: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a doc-string here
@@ -2464,7 +2502,7 @@ def replace(self, to_replace, value, inplace=False, filter=None, | |||
regex=regex, mgr=mgr) | |||
|
|||
def _replace_single(self, to_replace, value, inplace=False, filter=None, | |||
regex=False, convert=True, mgr=None): | |||
regex=False, convert=True, mgr=None, mask=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a doc-string here
doc/source/whatsnew/v0.24.0.txt
Outdated
@@ -573,6 +573,5 @@ Other | |||
- :meth: `~pandas.io.formats.style.Styler.background_gradient` now takes a ``text_color_threshold`` parameter to automatically lighten the text color based on the luminance of the background color. This improves readability with dark background colors without the need to limit the background colormap range. (:issue:`21258`) | |||
- Require at least 0.28.2 version of ``cython`` to support read-only memoryviews (:issue:`21688`) | |||
- :meth: `~pandas.io.formats.style.Styler.background_gradient` now also supports tablewise application (in addition to rowwise and columnwise) with ``axis=None`` (:issue:`15204`) | |||
- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to reshaping
Looks like a nice bit of cleanup in Manager. For Block I wonder if it could share code with Index, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. some cosmetic things
pandas/core/internals/blocks.py
Outdated
|
||
Parameters | ||
---------- | ||
mask : array_like of bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add optional to all of these args
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you reorder these args to match as much as possible _replace_single
pandas/core/internals/blocks.py
Outdated
dst : object | ||
The value to be replaced with. | ||
convert : bool | ||
If true, try to coerce any object types to better types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the inplace arg
value. | ||
|
||
Parameters | ||
---------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above (you can make a shared doc-string if you want here)
pandas/core/internals/managers.py
Outdated
@@ -571,12 +573,15 @@ def replace_list(self, src_list, dest_list, inplace=False, regex=False, | |||
# figure out our mask a-priori to avoid repeated replacements | |||
values = self.as_array() | |||
|
|||
def comp(s): | |||
def comp(s, reg=False): | |||
if isna(s): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a doc-string, rename reg -> regex
pandas/core/internals/managers.py
Outdated
@@ -1890,7 +1891,28 @@ def _consolidate(blocks): | |||
return new_blocks | |||
|
|||
|
|||
def _maybe_compare(a, b, op): | |||
def _maybe_compare(a, b, regex=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make this slightly more verbose name, can you move to pandas/core/ops.py (cc @jbrockmendel good location)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we anticipate using it elsewhere? If not I'd leave it here at least for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like name it _compare_or_regex_match
? @jreback
pandas/tests/series/test_replace.py
Outdated
@@ -243,6 +243,13 @@ def test_replace_string_with_number(self): | |||
expected = pd.Series([1, 2, 3]) | |||
tm.assert_series_equal(expected, result) | |||
|
|||
def test_repace_intertwined_key_value_dict(self): | |||
# GH 20656 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a 1-liner explaining the test in a bit more detail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo repace --> replace (?)
thanks @peterpanmj nice patch! |
git diff upstream/master -u -- "*.py" | flake8 --diff