-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Merge on CategoricalIndex fails if left_index=True & right_index=True, but not if on={index} #28189 #28296
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
pandas/tests/test_join.py
Outdated
def test_left_index_and_right_index_true(): | ||
# From issue 28189 | ||
|
||
pdf = DataFrame({"idx": Categorical(["1"] * 4), "value": [1, 2, 3, 4]}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Construction here is easier if you don't use a dict. should just be able to do:
pd.DataFrame(range(4), columns=["value"], index=pd.Index(pd.Categorical(["1"] * 4), name="idx"))
(I changed 1, 2, 3, 4 to 0, 1, 2, 3 in this example but otherwise equivalent)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
pandas/tests/test_join.py
Outdated
|
||
pdf = DataFrame({"idx": Categorical(["1"] * 4), "value": [1, 2, 3, 4]}) | ||
pdf = pdf.set_index("idx") | ||
agg = pdf.groupby("idx").agg(np.sum)["value"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just construct this directly - the less moving parts in a test the better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the problems you pointed out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agg = pdf.groupby("idx").agg(np.sum)["value"] | |
df2 = pd.DataFrame([[6]], columns=["value"], index=pd.Index(pd.Categorical([1]), name="idx")) |
Minor but the less moving parts the better, so would be good to get rid of the groupby stuff here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
doc/source/whatsnew/v1.0.0.rst
Outdated
@@ -89,7 +89,7 @@ Categorical | |||
^^^^^^^^^^^ | |||
|
|||
- Added test to assert the :func:`fillna` raises the correct ValueError message when the value isn't a value from categories (:issue:`13628`) | |||
- | |||
- Bug in merge on CategoricalIndex fails if left_index=True & right_index=True, but not if on={index} (:issue:`28189`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put code like left_index=True in double backticks.
What does on={index}
mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the case that work on the example of the issue! #28189
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I still don't understand the {index}
part. It looks like you're passing a set with a single item, which I don't think is the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could maybe also be changed to df.index.name
pandas/tests/test_join.py
Outdated
def test_left_index_and_right_index_true(): | ||
# From issue 28189 | ||
|
||
pdf = DataFrame( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you name this df
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure!
pandas/tests/test_join.py
Outdated
|
||
pdf = DataFrame({"idx": Categorical(["1"] * 4), "value": [1, 2, 3, 4]}) | ||
pdf = pdf.set_index("idx") | ||
agg = pdf.groupby("idx").agg(np.sum)["value"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agg = pdf.groupby("idx").agg(np.sum)["value"] | |
df2 = pd.DataFrame([[6]], columns=["value"], index=pd.Index(pd.Categorical([1]), name="idx")) |
Minor but the less moving parts the better, so would be good to get rid of the groupby stuff here
agg = pdf.groupby("idx").agg(np.sum)["value"] | ||
|
||
result = merge(pdf, agg, how="left", left_index=True, right_index=True) | ||
result = result.reset_index(drop=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the reset_index required or can the index be included as part of expected
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left it out of the test because of the issues @hugoecarl mentioned on the PR. If the index are included, the test will break once they're fixed, but if you want, it could be included and I would need to change the expected
df.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by "the test will break once they're fixed"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What @guipleite is trying to say is that when you use the left-join, there is a problem with the index values output. The bug is related with the issues @hugoecarl pointed here in the PR. With that been said, if we include index as a part of the expected df, we would have to force the wrong output to prove that the bug pointed on this PR is solved, then when those other issues are corrected, this test would be outdated. Does it make sense to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC you are saying there are multiple bugs and this solves one of them, but the other would ned to be solved separately right? If so is there an open issue for that? Might have missed it in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm but don't those issues have to do with missing data on merge? Maybe missing the point but don't see how applicable here. Isn't the expected index still just Index(Categorical(["1"] * 4), name="idx")
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think maybe I'm missing the point here, so I'll take a step back. We worked on issue #28189 and the problem was that merged = pd.merge(pdf, agg, how="left", left_index=True, right_index=True)
with groupby and categorical index was returning an error when it should do the same thing as pd.merge(pdf, agg, how="left", on="idx")
which worked. We understood that in a Cython file there was no support ford type int8 and int16, so we added. After that we discovered the for some unknown reason the index values output was int64 where it should be categorical. After a lot of debugging we concluded that we could not find the problem but we decided do make this pull request for the int8 and int16 support. This other issues mentioned show some bugs related to merge function and index output. We thought that maybe this is relatable with this bugs we are having, that were actually hidden by the int8 and int16 non-support.
At this point we think we reached our limit and to continue this, help would be needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm OK - I think I understand now. So this change doesn't raise an error but left_index=True
and right_index=True
still would not yield the same output as on="idx"
since the former would drop the index, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly that! We just added the support to int8 and int16 type that uncovered this output bug.
agg = pdf.groupby("idx").agg(np.sum)["value"] | ||
|
||
result = merge(pdf, agg, how="left", left_index=True, right_index=True) | ||
result = result.reset_index(drop=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm but don't those issues have to do with missing data on merge? Maybe missing the point but don't see how applicable here. Isn't the expected index still just Index(Categorical(["1"] * 4), name="idx")
?
@hugoecarl can you merge master and we'll have another look. |
By merging master you mean updating our branch with upstream? |
yes
|
@hugoecarl @gabriellm1 is this still active? Can you resolve merge conflicts? |
Closing as stale but ping if you'd like to pick back up |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
This modification resolves the error in issue #28189 but still not working as expected. It seems that there is a bug related with left-join, as you can see in issues #28220 and #28243.
I'm making this pull request for the #28189 in case you want to resolve this bug separately from the left join problem. On the other hand, I can work on this and help is welcome.