-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fix value setting in case of merging via index on one side and column on other side. #34496
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, that's a lot of issues :)
Just making a small comment ahead of core devs' reviews
pandas/core/reshape/merge.py
Outdated
if right_in and left_in or array_like or self.how == "asof": | ||
|
||
if left_indexer is not None and right_indexer is not None: | ||
if name in self.left: | ||
if left_indexer is not None and right_indexer is not None: | ||
if name in self.left: | ||
|
||
if left_has_missing is None: | ||
left_has_missing = (left_indexer == -1).any() | ||
if left_has_missing is None: | ||
left_has_missing = (left_indexer == -1).any() | ||
|
||
if left_has_missing: | ||
take_right = self.right_join_keys[i] | ||
if left_has_missing: | ||
take_right = self.right_join_keys[i] | ||
|
||
if not is_dtype_equal( | ||
result[name].dtype, self.left[name].dtype | ||
): | ||
take_left = self.left[name]._values | ||
if not is_dtype_equal( | ||
result[name].dtype, self.left[name].dtype | ||
): | ||
take_left = self.left[name]._values | ||
|
||
elif name in self.right: | ||
elif name in self.right: | ||
|
||
if right_has_missing is None: | ||
right_has_missing = (right_indexer == -1).any() | ||
if right_has_missing is None: | ||
right_has_missing = (right_indexer == -1).any() | ||
|
||
if right_has_missing: | ||
take_left = self.left_join_keys[i] | ||
if right_has_missing: | ||
take_left = self.left_join_keys[i] | ||
|
||
if not is_dtype_equal( | ||
result[name].dtype, self.right[name].dtype | ||
): | ||
take_right = self.right[name]._values | ||
if not is_dtype_equal( | ||
result[name].dtype, self.right[name].dtype | ||
): | ||
take_right = self.right[name]._values | ||
else: | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to write this as
if <condition>:
continue
<keep existing code here as it is without having to indent it futher>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MarcoGorelli Yeah, changed it around. I think this is a bit more elegant too. I did it the other way round because imo the if condition is a bit more readable, but was split.. Maybe you could have a look again with the changed condition?
thx :)
Yeah they all had the same underlying issue. The index think and this caused these errors. Hope we can find a solution to merge this :) |
@@ -936,6 +936,7 @@ Reshaping | |||
- Bug in :meth:`DataFrame.replace` casts columns to ``object`` dtype if items in ``to_replace`` not in values (:issue:`32988`) | |||
- Ensure only named functions can be used in :func:`eval()` (:issue:`32460`) | |||
- Fixed bug in :func:`melt` where melting MultiIndex columns with ``col_level`` > 0 would raise a ``KeyError`` on ``id_vars`` (:issue:`34129`) | |||
- Fixed bug setting wrong values in result when joining one side over index and other side over column in case of join type not equal to inner (:issue:`17257`, :issue:`28220`, :issue:`28243` and :issue:`33232`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea what you are trying to fix by reading this. You are xrefing lots of issues, are you actually closing them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issues have to different problems.
1: The index is not handled right
2: The values of the join columns are wrong.
This PR fixes 2, #34468 should fix the other part. That is why I am only referencing them.
@@ -787,7 +787,25 @@ def _maybe_add_join_keys(self, result, left_indexer, right_indexer): | |||
take_left, take_right = None, None | |||
|
|||
if name in result: | |||
|
|||
array_like = is_array_like(rname) or is_array_like(lname) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are you trying to do here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you mean only line 790: If an array is given as join key, then this should be handled like a column. So we have to set the values into the result.
Generally this block should handle the following:
If we join an index with an column this part sets the index values as the column values in the result DataFrame. To avoid this I check if we should run in there.
We should only run in there if both DataFrames are joined via the same columns, if a array was given as join condition.
Actually I realised that the condition with the index is crap. That would keep the initial problem if the index is accidentally named as the join column. This will break a few more tests unfortunately.
closing this. @phofl I am pretty sure you actually broke up the fixes into new PRs. if that is not the case, can you address a specific issue here. |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
Until now the
_maybe_add_join_keys
function changed the target column in the result, if the join was done over index on one side and the column on the other side. This resulted in taking values from the index and setting them for the target column, which explained the weird behavior in the issues referenced. Together with #34468 this will fix the issues. Both combined will transfer the merged index in these cases and the merged columns with the correct name. As of now this fix will only transfer the values, so the index will be wrong here.We can not skip the method completly, because this would cause issues in case of the column is in both
DataFrames
we have to run through this steps to ensure that the target values are right. So I check if this is only contained in one side.