Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix value setting in case of merging via index on one side and column on other side. #34496

Closed
wants to merge 5 commits into from

Conversation

phofl
Copy link
Member

@phofl phofl commented May 31, 2020

Until now the _maybe_add_join_keys function changed the target column in the result, if the join was done over index on one side and the column on the other side. This resulted in taking values from the index and setting them for the target column, which explained the weird behavior in the issues referenced. Together with #34468 this will fix the issues. Both combined will transfer the merged index in these cases and the merged columns with the correct name. As of now this fix will only transfer the values, so the index will be wrong here.

We can not skip the method completly, because this would cause issues in case of the column is in both DataFrames we have to run through this steps to ensure that the target values are right. So I check if this is only contained in one side.

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, that's a lot of issues :)

Just making a small comment ahead of core devs' reviews

Comment on lines 803 to 832
if right_in and left_in or array_like or self.how == "asof":

if left_indexer is not None and right_indexer is not None:
if name in self.left:
if left_indexer is not None and right_indexer is not None:
if name in self.left:

if left_has_missing is None:
left_has_missing = (left_indexer == -1).any()
if left_has_missing is None:
left_has_missing = (left_indexer == -1).any()

if left_has_missing:
take_right = self.right_join_keys[i]
if left_has_missing:
take_right = self.right_join_keys[i]

if not is_dtype_equal(
result[name].dtype, self.left[name].dtype
):
take_left = self.left[name]._values
if not is_dtype_equal(
result[name].dtype, self.left[name].dtype
):
take_left = self.left[name]._values

elif name in self.right:
elif name in self.right:

if right_has_missing is None:
right_has_missing = (right_indexer == -1).any()
if right_has_missing is None:
right_has_missing = (right_indexer == -1).any()

if right_has_missing:
take_left = self.left_join_keys[i]
if right_has_missing:
take_left = self.left_join_keys[i]

if not is_dtype_equal(
result[name].dtype, self.right[name].dtype
):
take_right = self.right[name]._values
if not is_dtype_equal(
result[name].dtype, self.right[name].dtype
):
take_right = self.right[name]._values
else:
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to write this as

if <condition>:
    continue
<keep existing code here as it is without having to indent it futher>

?

Copy link
Member Author

@phofl phofl May 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarcoGorelli Yeah, changed it around. I think this is a bit more elegant too. I did it the other way round because imo the if condition is a bit more readable, but was split.. Maybe you could have a look again with the changed condition?

thx :)

@phofl
Copy link
Member Author

phofl commented May 31, 2020

Wow, that's a lot of issues :)

Just making a small comment ahead of core devs' reviews

Yeah they all had the same underlying issue. The index think and this caused these errors. Hope we can find a solution to merge this :)

@@ -936,6 +936,7 @@ Reshaping
- Bug in :meth:`DataFrame.replace` casts columns to ``object`` dtype if items in ``to_replace`` not in values (:issue:`32988`)
- Ensure only named functions can be used in :func:`eval()` (:issue:`32460`)
- Fixed bug in :func:`melt` where melting MultiIndex columns with ``col_level`` > 0 would raise a ``KeyError`` on ``id_vars`` (:issue:`34129`)
- Fixed bug setting wrong values in result when joining one side over index and other side over column in case of join type not equal to inner (:issue:`17257`, :issue:`28220`, :issue:`28243` and :issue:`33232`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea what you are trying to fix by reading this. You are xrefing lots of issues, are you actually closing them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issues have to different problems.

1: The index is not handled right
2: The values of the join columns are wrong.

This PR fixes 2, #34468 should fix the other part. That is why I am only referencing them.

@@ -787,7 +787,25 @@ def _maybe_add_join_keys(self, result, left_indexer, right_indexer):
take_left, take_right = None, None

if name in result:

array_like = is_array_like(rname) or is_array_like(lname)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are you trying to do here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback

If you mean only line 790: If an array is given as join key, then this should be handled like a column. So we have to set the values into the result.

Generally this block should handle the following:

If we join an index with an column this part sets the index values as the column values in the result DataFrame. To avoid this I check if we should run in there.

We should only run in there if both DataFrames are joined via the same columns, if a array was given as join condition.

Actually I realised that the condition with the index is crap. That would keep the initial problem if the index is accidentally named as the join column. This will break a few more tests unfortunately.

@jreback jreback added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label May 31, 2020
@phofl
Copy link
Member Author

phofl commented May 31, 2020

@jreback

Do you think it is preferably to do #34468 and this PR together?

@jreback
Copy link
Contributor

jreback commented Jan 1, 2021

closing this. @phofl I am pretty sure you actually broke up the fixes into new PRs. if that is not the case, can you address a specific issue here.

@jreback jreback closed this Jan 1, 2021
@phofl phofl deleted the 28243_values branch April 26, 2021 22:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants