BUG: Merge on CategoricalIndex fails if left_index=True & right_index=True, but not if on={index} #28189 #28296

hugoecarl · 2019-09-05T14:01:21Z

closes Merge on CategoricalIndex fails if left_index=True & right_index=True, but not if on={index} #28189
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

This modification resolves the error in issue #28189 but still not working as expected. It seems that there is a bug related with left-join, as you can see in issues #28220 and #28243.

I'm making this pull request for the #28189 in case you want to resolve this bug separately from the left join problem. On the other hand, I can work on this and help is welcome.

WillAyd

Thanks for the PR!

pandas/tests/test_join.py

WillAyd · 2019-09-05T15:28:41Z

pandas/tests/test_join.py

+def test_left_index_and_right_index_true():
+    # From issue 28189
+
+    pdf = DataFrame({"idx": Categorical(["1"] * 4), "value": [1, 2, 3, 4]})


Construction here is easier if you don't use a dict. should just be able to do:

pd.DataFrame(range(4), columns=["value"], index=pd.Index(pd.Categorical(["1"] * 4), name="idx"))

(I changed 1, 2, 3, 4 to 0, 1, 2, 3 in this example but otherwise equivalent)

WillAyd · 2019-09-05T15:29:27Z

pandas/tests/test_join.py

+
+    pdf = DataFrame({"idx": Categorical(["1"] * 4), "value": [1, 2, 3, 4]})
+    pdf = pdf.set_index("idx")
+    agg = pdf.groupby("idx").agg(np.sum)["value"]


Just construct this directly - the less moving parts in a test the better

Fixed the problems you pointed out.

Suggested change

agg = pdf.groupby("idx").agg(np.sum)["value"]

df2 = pd.DataFrame([[6]], columns=["value"], index=pd.Index(pd.Categorical([1]), name="idx"))

Minor but the less moving parts the better, so would be good to get rid of the groupby stuff here

pandas/_libs/join.pyx

TomAugspurger · 2019-09-05T15:35:59Z

doc/source/whatsnew/v1.0.0.rst

@@ -89,7 +89,7 @@ Categorical
 ^^^^^^^^^^^

 - Added test to assert the :func:`fillna` raises the correct ValueError message when the value isn't a value from categories (:issue:`13628`)
-
+- Bug in merge on CategoricalIndex fails if left_index=True & right_index=True, but not if on={index} (:issue:`28189`)


Put code like left_index=True in double backticks.

What does on={index} mean?

It's the case that work on the example of the issue! #28189

Sorry, I still don't understand the {index} part. It looks like you're passing a set with a single item, which I don't think is the case.

This could maybe also be changed to df.index.name

WillAyd · 2019-09-05T17:36:56Z

pandas/tests/test_join.py

+def test_left_index_and_right_index_true():
+    # From issue 28189
+
+    pdf = DataFrame(


Can you name this df?

WillAyd · 2019-09-05T17:38:42Z

pandas/tests/test_join.py

+
+    pdf = DataFrame({"idx": Categorical(["1"] * 4), "value": [1, 2, 3, 4]})
+    pdf = pdf.set_index("idx")
+    agg = pdf.groupby("idx").agg(np.sum)["value"]


Suggested change

agg = pdf.groupby("idx").agg(np.sum)["value"]

df2 = pd.DataFrame([[6]], columns=["value"], index=pd.Index(pd.Categorical([1]), name="idx"))

Minor but the less moving parts the better, so would be good to get rid of the groupby stuff here

WillAyd · 2019-09-05T17:41:21Z

pandas/tests/test_join.py

+    agg = pdf.groupby("idx").agg(np.sum)["value"]
+
+    result = merge(pdf, agg, how="left", left_index=True, right_index=True)
+    result = result.reset_index(drop=True)


Is the reset_index required or can the index be included as part of expected?

I left it out of the test because of the issues @hugoecarl mentioned on the PR. If the index are included, the test will break once they're fixed, but if you want, it could be included and I would need to change the expected df.

What do you mean by "the test will break once they're fixed"?

What @guipleite is trying to say is that when you use the left-join, there is a problem with the index values output. The bug is related with the issues @hugoecarl pointed here in the PR. With that been said, if we include index as a part of the expected df, we would have to force the wrong output to prove that the bug pointed on this PR is solved, then when those other issues are corrected, this test would be outdated. Does it make sense to you?

IIUC you are saying there are multiple bugs and this solves one of them, but the other would ned to be solved separately right? If so is there an open issue for that? Might have missed it in this PR

Yes! This is particularly solving issue #28189 . This other bug that we accidentally found have already been reported in issues #28220 and #28243 .

Hmm but don't those issues have to do with missing data on merge? Maybe missing the point but don't see how applicable here. Isn't the expected index still just Index(Categorical(["1"] * 4), name="idx")?

I think maybe I'm missing the point here, so I'll take a step back. We worked on issue #28189 and the problem was that merged = pd.merge(pdf, agg, how="left", left_index=True, right_index=True) with groupby and categorical index was returning an error when it should do the same thing as pd.merge(pdf, agg, how="left", on="idx") which worked. We understood that in a Cython file there was no support ford type int8 and int16, so we added. After that we discovered the for some unknown reason the index values output was int64 where it should be categorical. After a lot of debugging we concluded that we could not find the problem but we decided do make this pull request for the int8 and int16 support. This other issues mentioned show some bugs related to merge function and index output. We thought that maybe this is relatable with this bugs we are having, that were actually hidden by the int8 and int16 non-support.
At this point we think we reached our limit and to continue this, help would be needed.

Hmm OK - I think I understand now. So this change doesn't raise an error but left_index=True and right_index=True still would not yield the same output as on="idx" since the former would drop the index, right?

Exactly that! We just added the support to int8 and int16 type that uncovered this output bug.

WillAyd · 2019-09-11T01:38:21Z

pandas/tests/test_join.py

+    agg = pdf.groupby("idx").agg(np.sum)["value"]
+
+    result = merge(pdf, agg, how="left", left_index=True, right_index=True)
+    result = result.reset_index(drop=True)


Hmm but don't those issues have to do with missing data on merge? Maybe missing the point but don't see how applicable here. Isn't the expected index still just Index(Categorical(["1"] * 4), name="idx")?

jreback · 2019-10-06T23:51:23Z

@hugoecarl can you merge master and we'll have another look.

gabriellm1 · 2019-10-07T12:29:21Z

By merging master you mean updating our branch with upstream?

jreback · 2019-10-08T12:48:13Z

By merging master you mean updating our branch with upstream?

yes

git merge upstream/master

WillAyd · 2019-11-07T23:27:06Z

@hugoecarl @gabriellm1 is this still active? Can you resolve merge conflicts?

WillAyd · 2019-12-17T17:42:54Z

Closing as stale but ping if you'd like to pick back up

gabriellm1 and others added 3 commits August 29, 2019 11:51

cython fix

11cb1f8

added unit test

95c5cf8

cython fix

fd1d3f1

WillAyd requested changes Sep 5, 2019

View reviewed changes

WillAyd added Categorical Categorical Data Type Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Sep 5, 2019

TomAugspurger reviewed Sep 5, 2019

View reviewed changes

Fixing reviwed unit test

8321075

WillAyd requested changes Sep 5, 2019

View reviewed changes

guipleite and others added 3 commits September 5, 2019 15:30

Futher fixing reviwed unit test

d938b1b

int16fix

2bbc491

whatsnew format

4e8bf1f

WillAyd requested changes Sep 11, 2019

View reviewed changes

gabriellm1 added 2 commits October 10, 2019 07:48

merge

be85da6

merge conflits

9915105

AbhijeetKrishnan mentioned this pull request Oct 20, 2019

Fix mypy errors tests.series.test_constructors #29108

Merged

5 tasks

WillAyd closed this Dec 17, 2019

MarcoGorelli mentioned this pull request Feb 18, 2020

BUG: fix in categorical merges #32079

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Merge on CategoricalIndex fails if left_index=True & right_index=True, but not if on={index} #28189 #28296

BUG: Merge on CategoricalIndex fails if left_index=True & right_index=True, but not if on={index} #28189 #28296

hugoecarl commented Sep 5, 2019

WillAyd left a comment

WillAyd Sep 5, 2019

guipleite Sep 10, 2019

WillAyd Sep 5, 2019

guipleite Sep 5, 2019

WillAyd Sep 5, 2019

guipleite Sep 10, 2019

TomAugspurger Sep 5, 2019

hugoecarl Sep 6, 2019

TomAugspurger Sep 12, 2019

WillAyd Sep 12, 2019

WillAyd Sep 5, 2019

guipleite Sep 5, 2019

WillAyd Sep 5, 2019

WillAyd Sep 5, 2019

guipleite Sep 5, 2019

WillAyd Sep 5, 2019

gabriellm1 Sep 5, 2019

WillAyd Sep 7, 2019

gabriellm1 Sep 8, 2019

WillAyd Sep 11, 2019

gabriellm1 Sep 12, 2019

WillAyd Sep 12, 2019

gabriellm1 Sep 13, 2019

WillAyd Sep 11, 2019

jreback commented Oct 6, 2019

gabriellm1 commented Oct 7, 2019

jreback commented Oct 8, 2019

WillAyd commented Nov 7, 2019

WillAyd commented Dec 17, 2019

	agg = pdf.groupby("idx").agg(np.sum)["value"]
	df2 = pd.DataFrame([[6]], columns=["value"], index=pd.Index(pd.Categorical([1]), name="idx"))

BUG: Merge on CategoricalIndex fails if left_index=True & right_index=True, but not if on={index} #28189 #28296

BUG: Merge on CategoricalIndex fails if left_index=True & right_index=True, but not if on={index} #28189 #28296

Conversation

hugoecarl commented Sep 5, 2019

WillAyd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Oct 6, 2019

gabriellm1 commented Oct 7, 2019

jreback commented Oct 8, 2019

WillAyd commented Nov 7, 2019

WillAyd commented Dec 17, 2019