BUG: Fix value setting in case of merging via index on one side and column on other side. #34496

phofl · 2020-05-31T13:52:12Z

xref BUG? merging on column of empty frame with index of right frame #15692
xref Right/outer merge behaviour on left column and right index is unexpected #17257
xref pd.merge regression when doing a left-join with missing data on the right. Result has a Float64Index #28220
xref BUG: Left join on index and column gives incorrect output #28243
xref merge() outer with left_on column and right_index=True produces unexpected results #33232
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Until now the _maybe_add_join_keys function changed the target column in the result, if the join was done over index on one side and the column on the other side. This resulted in taking values from the index and setting them for the target column, which explained the weird behavior in the issues referenced. Together with #34468 this will fix the issues. Both combined will transfer the merged index in these cases and the merged columns with the correct name. As of now this fix will only transfer the values, so the index will be wrong here.

We can not skip the method completly, because this would cause issues in case of the column is in both DataFrames we have to run through this steps to ensure that the target values are right. So I check if this is only contained in one side.

…ht, outer

MarcoGorelli

Wow, that's a lot of issues :)

Just making a small comment ahead of core devs' reviews

MarcoGorelli · 2020-05-31T15:34:43Z

pandas/core/reshape/merge.py

+                if right_in and left_in or array_like or self.how == "asof":

-                if left_indexer is not None and right_indexer is not None:
-                    if name in self.left:
+                    if left_indexer is not None and right_indexer is not None:
+                        if name in self.left:

-                        if left_has_missing is None:
-                            left_has_missing = (left_indexer == -1).any()
+                            if left_has_missing is None:
+                                left_has_missing = (left_indexer == -1).any()

-                        if left_has_missing:
-                            take_right = self.right_join_keys[i]
+                            if left_has_missing:
+                                take_right = self.right_join_keys[i]

-                            if not is_dtype_equal(
-                                result[name].dtype, self.left[name].dtype
-                            ):
-                                take_left = self.left[name]._values
+                                if not is_dtype_equal(
+                                    result[name].dtype, self.left[name].dtype
+                                ):
+                                    take_left = self.left[name]._values

-                    elif name in self.right:
+                        elif name in self.right:

-                        if right_has_missing is None:
-                            right_has_missing = (right_indexer == -1).any()
+                            if right_has_missing is None:
+                                right_has_missing = (right_indexer == -1).any()

-                        if right_has_missing:
-                            take_left = self.left_join_keys[i]
+                            if right_has_missing:
+                                take_left = self.left_join_keys[i]

-                            if not is_dtype_equal(
-                                result[name].dtype, self.right[name].dtype
-                            ):
-                                take_right = self.right[name]._values
+                                if not is_dtype_equal(
+                                    result[name].dtype, self.right[name].dtype
+                                ):
+                                    take_right = self.right[name]._values
+                else:
+                    continue


Is it possible to write this as

if <condition>: continue <keep existing code here as it is without having to indent it futher>

?

@MarcoGorelli Yeah, changed it around. I think this is a bit more elegant too. I did it the other way round because imo the if condition is a bit more readable, but was split.. Maybe you could have a look again with the changed condition?

thx :)

phofl · 2020-05-31T16:17:25Z

Wow, that's a lot of issues :)

Just making a small comment ahead of core devs' reviews

Yeah they all had the same underlying issue. The index think and this caused these errors. Hope we can find a solution to merge this :)

jreback · 2020-05-31T21:58:06Z

doc/source/whatsnew/v1.1.0.rst

@@ -936,6 +936,7 @@ Reshaping
 - Bug in :meth:`DataFrame.replace` casts columns to ``object`` dtype if items in ``to_replace`` not in values (:issue:`32988`)
 - Ensure only named functions can be used in :func:`eval()` (:issue:`32460`)
 - Fixed bug in :func:`melt` where melting MultiIndex columns with ``col_level`` > 0 would raise a ``KeyError`` on ``id_vars`` (:issue:`34129`)
+- Fixed bug setting wrong values in result when joining one side over index and other side over column in case of join type not equal to inner (:issue:`17257`, :issue:`28220`, :issue:`28243` and :issue:`33232`)


I have no idea what you are trying to fix by reading this. You are xrefing lots of issues, are you actually closing them?

The issues have to different problems.

1: The index is not handled right
2: The values of the join columns are wrong.

This PR fixes 2, #34468 should fix the other part. That is why I am only referencing them.

jreback · 2020-05-31T21:59:04Z

pandas/core/reshape/merge.py

@@ -787,7 +787,25 @@ def _maybe_add_join_keys(self, result, left_indexer, right_indexer):
            take_left, take_right = None, None

            if name in result:
-
+                array_like = is_array_like(rname) or is_array_like(lname)


what are you trying to do here?

@jreback

If you mean only line 790: If an array is given as join key, then this should be handled like a column. So we have to set the values into the result.

Generally this block should handle the following:

If we join an index with an column this part sets the index values as the column values in the result DataFrame. To avoid this I check if we should run in there.

We should only run in there if both DataFrames are joined via the same columns, if a array was given as join condition.

Actually I realised that the condition with the index is crap. That would keep the initial problem if the index is accidentally named as the join column. This will break a few more tests unfortunately.

phofl · 2020-05-31T22:20:02Z

@jreback

Do you think it is preferably to do #34468 and this PR together?

jreback · 2021-01-01T22:04:22Z

closing this. @phofl I am pretty sure you actually broke up the fixes into new PRs. if that is not the case, can you address a specific issue here.

phofl added 3 commits May 31, 2020 15:41

BUG: Fix wrong values when joining via index and column for left, rig…

ef2d11f

…ht, outer

Add whats new entry

f1f5647

Fix whats new typo

3a5a292

MarcoGorelli reviewed May 31, 2020

View reviewed changes

Change if condition

a3e6623

Run black pandas

e759612

jreback requested changes May 31, 2020

View reviewed changes

jreback added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label May 31, 2020

TomAugspurger added Needs Review and removed Waiting for author labels Sep 4, 2020

jreback closed this Jan 1, 2021

phofl deleted the 28243_values branch April 26, 2021 22:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Fix value setting in case of merging via index on one side and column on other side. #34496

BUG: Fix value setting in case of merging via index on one side and column on other side. #34496

phofl commented May 31, 2020 •

edited

Loading

MarcoGorelli left a comment

MarcoGorelli May 31, 2020

phofl May 31, 2020 •

edited

Loading

phofl commented May 31, 2020

jreback May 31, 2020

phofl May 31, 2020

jreback May 31, 2020

phofl May 31, 2020

phofl commented May 31, 2020

jreback commented Jan 1, 2021

BUG: Fix value setting in case of merging via index on one side and column on other side. #34496

BUG: Fix value setting in case of merging via index on one side and column on other side. #34496

Conversation

phofl commented May 31, 2020 • edited Loading

MarcoGorelli left a comment

Choose a reason for hiding this comment

MarcoGorelli May 31, 2020

Choose a reason for hiding this comment

phofl May 31, 2020 • edited Loading

Choose a reason for hiding this comment

phofl commented May 31, 2020

jreback May 31, 2020

Choose a reason for hiding this comment

phofl May 31, 2020

Choose a reason for hiding this comment

jreback May 31, 2020

Choose a reason for hiding this comment

phofl May 31, 2020

Choose a reason for hiding this comment

phofl commented May 31, 2020

jreback commented Jan 1, 2021

phofl commented May 31, 2020 •

edited

Loading

phofl May 31, 2020 •

edited

Loading