Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix regression for groupby.indices in case of unused categories #38649

Merged
merged 7 commits into from
Dec 29, 2020

Conversation

phofl
Copy link
Member

@phofl phofl commented Dec 22, 2020

To show the unused we categories, we can dispatch back as we did before #36842. Since Categorical does not support nan, the reason for the switch does not exist in this case. I think handling unused categories in get_indexer_dict is not desirable, because it would introduce a lot of convolution for one corner case we could handle here pretty easily.
Since we are only using indices in grouper only for categoricals, we do not have to worry about sorting as the test shows.

cc @jorisvandenbossche @mroeschke

@phofl phofl added Categorical Categorical Data Type Groupby Regression Functionality that used to work in a prior pandas version labels Dec 22, 2020
@@ -241,6 +242,11 @@ def apply(self, f: F, data: FrameOrSeries, axis: int = 0):
@cache_readonly
def indices(self):
""" dict {group name -> group indices} """
if len(self.groupings) == 1 and isinstance(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woa, this is so special casing here. why is this added?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the explanation in the top post

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read it and don't buy it. Why are we supporting this case at all? I would rather simply fix this breaking or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean not showing unused categories for indices?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i c we did thave this special case before, but onl was for len(self.groupins) == 1. its the special isinstance i object.

@simonjayhawkins simonjayhawkins mentioned this pull request Dec 23, 2020
@jorisvandenbossche
Copy link
Member

Since Categorical does not support nan, the reason for the switch does not exist in this case.

@phofl Categorical does actually support NaNs, but maybe this is not covered by the tests?

@jorisvandenbossche
Copy link
Member

It seems this might actually also be broken for Categorical (or not fixed by #36842). This is what I see on master:

In [8]: df = pd.DataFrame({'key': pd.Categorical(['a', 'b', 'a', 'b', np.nan]), 'col': range(5)})

In [9]: df.groupby('key', dropna=False).indices
Out[9]: {'b': array([1, 3]), 'a': array([0, 2])}

In [10]: df.astype({"key": object}).groupby('key', dropna=False).indices
Out[10]: {'a': array([0, 2]), 'b': array([1, 3]), nan: array([4])}

@phofl
Copy link
Member Author

phofl commented Dec 23, 2020

Hm, I misunderstood #35646 (comment) this then. Have to look into this then

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Dec 23, 2020

Ah, no, I was not aware of that categorical with dropna=False might be knowingly not supported at the moment.
cc @mroeschke is there a reason that we wouldn't want to support that?

BTW, if this is the case, the docs of the dropna keyword should mention this.

@@ -241,6 +242,11 @@ def apply(self, f: F, data: FrameOrSeries, axis: int = 0):
@cache_readonly
def indices(self):
""" dict {group name -> group indices} """
if len(self.groupings) == 1 and isinstance(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i c we did thave this special case before, but onl was for len(self.groupins) == 1. its the special isinstance i object.

@mroeschke
Copy link
Member

My comment was assuming that that adding a new dropna keyword argument to Categorical indicating that nan shouldn't be a missing value but a category itself wouldn't be a popular change. That assumption might be incorrect though

@jreback
Copy link
Contributor

jreback commented Dec 23, 2020

My comment was assuming that that adding a new dropna keyword argument to Categorical indicating that nan shouldn't be a missing value but a category itself wouldn't be a popular change. That assumption might be incorrect though

not happening, we already removed this case thru a deprecation cycle and no reason to add it back

@jorisvandenbossche
Copy link
Member

Categorical indeed does not support NaN as a category, but does that mean it could not be supported in groupby(dropna=False)? A categorical can have missing values, so the resulting index of groupby(dropna=False) would not have NaN as category, but as missing value (as -1 in the codes)

@mroeschke
Copy link
Member

I imagine it should be supported in groupby(dropna=False).

At the time I was probably focusing too much on Grouper.indices unaware that there was also BaseGrouper.indices (and the relationship between them) because Grouper.indices was relevant for a groupby rolling bug I was hunting down.

@simonjayhawkins simonjayhawkins added this to the 1.2.1 milestone Dec 29, 2020
@jreback
Copy link
Contributor

jreback commented Dec 29, 2020

@phofl can you merge master and add a note for 1.2.1

@phofl
Copy link
Member Author

phofl commented Dec 29, 2020

Done

@jreback jreback merged commit 2ae017b into pandas-dev:master Dec 29, 2020
@jreback
Copy link
Contributor

jreback commented Dec 29, 2020

thanks @phofl very nice

@jreback
Copy link
Contributor

jreback commented Dec 29, 2020

@meeseeksdev backport 1.2.x

@phofl phofl deleted the 38642 branch December 30, 2020 09:10
simonjayhawkins pushed a commit that referenced this pull request Dec 30, 2020
…f unused categories (#38790)

Co-authored-by: patrick <61934744+phofl@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Groupby Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

REGR: GroupBy.indices no longer includes unobserved categories
5 participants