Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Merge on CategoricalIndex fails if left_index=True & right_index=True, but not if on={index} #28189 #28296

Closed
wants to merge 9 commits into from

Conversation

hugoecarl
Copy link

This modification resolves the error in issue #28189 but still not working as expected. It seems that there is a bug related with left-join, as you can see in issues #28220 and #28243.

I'm making this pull request for the #28189 in case you want to resolve this bug separately from the left join problem. On the other hand, I can work on this and help is welcome.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

pandas/tests/test_join.py Outdated Show resolved Hide resolved
def test_left_index_and_right_index_true():
# From issue 28189

pdf = DataFrame({"idx": Categorical(["1"] * 4), "value": [1, 2, 3, 4]})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Construction here is easier if you don't use a dict. should just be able to do:

pd.DataFrame(range(4), columns=["value"], index=pd.Index(pd.Categorical(["1"] * 4), name="idx"))

(I changed 1, 2, 3, 4 to 0, 1, 2, 3 in this example but otherwise equivalent)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!


pdf = DataFrame({"idx": Categorical(["1"] * 4), "value": [1, 2, 3, 4]})
pdf = pdf.set_index("idx")
agg = pdf.groupby("idx").agg(np.sum)["value"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just construct this directly - the less moving parts in a test the better

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the problems you pointed out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
agg = pdf.groupby("idx").agg(np.sum)["value"]
df2 = pd.DataFrame([[6]], columns=["value"], index=pd.Index(pd.Categorical([1]), name="idx"))

Minor but the less moving parts the better, so would be good to get rid of the groupby stuff here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@WillAyd WillAyd added Categorical Categorical Data Type Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Sep 5, 2019
pandas/_libs/join.pyx Show resolved Hide resolved
@@ -89,7 +89,7 @@ Categorical
^^^^^^^^^^^

- Added test to assert the :func:`fillna` raises the correct ValueError message when the value isn't a value from categories (:issue:`13628`)
-
- Bug in merge on CategoricalIndex fails if left_index=True & right_index=True, but not if on={index} (:issue:`28189`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put code like left_index=True in double backticks.

What does on={index} mean?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the case that work on the example of the issue! #28189

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I still don't understand the {index} part. It looks like you're passing a set with a single item, which I don't think is the case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could maybe also be changed to df.index.name

def test_left_index_and_right_index_true():
# From issue 28189

pdf = DataFrame(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you name this df?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!


pdf = DataFrame({"idx": Categorical(["1"] * 4), "value": [1, 2, 3, 4]})
pdf = pdf.set_index("idx")
agg = pdf.groupby("idx").agg(np.sum)["value"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
agg = pdf.groupby("idx").agg(np.sum)["value"]
df2 = pd.DataFrame([[6]], columns=["value"], index=pd.Index(pd.Categorical([1]), name="idx"))

Minor but the less moving parts the better, so would be good to get rid of the groupby stuff here

agg = pdf.groupby("idx").agg(np.sum)["value"]

result = merge(pdf, agg, how="left", left_index=True, right_index=True)
result = result.reset_index(drop=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the reset_index required or can the index be included as part of expected?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left it out of the test because of the issues @hugoecarl mentioned on the PR. If the index are included, the test will break once they're fixed, but if you want, it could be included and I would need to change the expected df.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "the test will break once they're fixed"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What @guipleite is trying to say is that when you use the left-join, there is a problem with the index values output. The bug is related with the issues @hugoecarl pointed here in the PR. With that been said, if we include index as a part of the expected df, we would have to force the wrong output to prove that the bug pointed on this PR is solved, then when those other issues are corrected, this test would be outdated. Does it make sense to you?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC you are saying there are multiple bugs and this solves one of them, but the other would ned to be solved separately right? If so is there an open issue for that? Might have missed it in this PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! This is particularly solving issue #28189 . This other bug that we accidentally found have already been reported in issues #28220 and #28243 .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm but don't those issues have to do with missing data on merge? Maybe missing the point but don't see how applicable here. Isn't the expected index still just Index(Categorical(["1"] * 4), name="idx")?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe I'm missing the point here, so I'll take a step back. We worked on issue #28189 and the problem was that merged = pd.merge(pdf, agg, how="left", left_index=True, right_index=True) with groupby and categorical index was returning an error when it should do the same thing as pd.merge(pdf, agg, how="left", on="idx") which worked. We understood that in a Cython file there was no support ford type int8 and int16, so we added. After that we discovered the for some unknown reason the index values output was int64 where it should be categorical. After a lot of debugging we concluded that we could not find the problem but we decided do make this pull request for the int8 and int16 support. This other issues mentioned show some bugs related to merge function and index output. We thought that maybe this is relatable with this bugs we are having, that were actually hidden by the int8 and int16 non-support.
At this point we think we reached our limit and to continue this, help would be needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm OK - I think I understand now. So this change doesn't raise an error but left_index=True and right_index=True still would not yield the same output as on="idx" since the former would drop the index, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly that! We just added the support to int8 and int16 type that uncovered this output bug.

agg = pdf.groupby("idx").agg(np.sum)["value"]

result = merge(pdf, agg, how="left", left_index=True, right_index=True)
result = result.reset_index(drop=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm but don't those issues have to do with missing data on merge? Maybe missing the point but don't see how applicable here. Isn't the expected index still just Index(Categorical(["1"] * 4), name="idx")?

@jreback
Copy link
Contributor

jreback commented Oct 6, 2019

@hugoecarl can you merge master and we'll have another look.

@gabriellm1
Copy link
Contributor

By merging master you mean updating our branch with upstream?

@jreback
Copy link
Contributor

jreback commented Oct 8, 2019

By merging master you mean updating our branch with upstream?

yes

git merge upstream/master

@WillAyd
Copy link
Member

WillAyd commented Nov 7, 2019

@hugoecarl @gabriellm1 is this still active? Can you resolve merge conflicts?

@WillAyd
Copy link
Member

WillAyd commented Dec 17, 2019

Closing as stale but ping if you'd like to pick back up

@WillAyd WillAyd closed this Dec 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Merge on CategoricalIndex fails if left_index=True & right_index=True, but not if on={index}
6 participants