BUG: Fix regression for groupby.indices in case of unused categories #38649

phofl · 2020-12-22T23:58:34Z

closes REGR: GroupBy.indices no longer includes unobserved categories #38642
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff

To show the unused we categories, we can dispatch back as we did before #36842. Since Categorical does not support nan, the reason for the switch does not exist in this case. I think handling unused categories in get_indexer_dict is not desirable, because it would introduce a lot of convolution for one corner case we could handle here pretty easily.
Since we are only using indices in grouper only for categoricals, we do not have to worry about sorting as the test shows.

cc @jorisvandenbossche @mroeschke

jreback · 2020-12-23T15:56:01Z

pandas/core/groupby/ops.py

@@ -241,6 +242,11 @@ def apply(self, f: F, data: FrameOrSeries, axis: int = 0):
    @cache_readonly
    def indices(self):
        """ dict {group name -> group indices} """
+        if len(self.groupings) == 1 and isinstance(


woa, this is so special casing here. why is this added?

See the explanation in the top post

I read it and don't buy it. Why are we supporting this case at all? I would rather simply fix this breaking or not.

You mean not showing unused categories for indices?

i c we did thave this special case before, but onl was for len(self.groupins) == 1. its the special isinstance i object.

jorisvandenbossche · 2020-12-23T18:40:12Z

Since Categorical does not support nan, the reason for the switch does not exist in this case.

@phofl Categorical does actually support NaNs, but maybe this is not covered by the tests?

jorisvandenbossche · 2020-12-23T18:41:53Z

It seems this might actually also be broken for Categorical (or not fixed by #36842). This is what I see on master:

In [8]: df = pd.DataFrame({'key': pd.Categorical(['a', 'b', 'a', 'b', np.nan]), 'col': range(5)})

In [9]: df.groupby('key', dropna=False).indices
Out[9]: {'b': array([1, 3]), 'a': array([0, 2])}

In [10]: df.astype({"key": object}).groupby('key', dropna=False).indices
Out[10]: {'a': array([0, 2]), 'b': array([1, 3]), nan: array([4])}

phofl · 2020-12-23T18:43:08Z

Hm, I misunderstood #35646 (comment) this then. Have to look into this then

jorisvandenbossche · 2020-12-23T18:45:17Z

Ah, no, I was not aware of that categorical with dropna=False might be knowingly not supported at the moment.
cc @mroeschke is there a reason that we wouldn't want to support that?

BTW, if this is the case, the docs of the dropna keyword should mention this.

jreback · 2020-12-23T18:52:15Z

pandas/core/groupby/ops.py

@@ -241,6 +242,11 @@ def apply(self, f: F, data: FrameOrSeries, axis: int = 0):
    @cache_readonly
    def indices(self):
        """ dict {group name -> group indices} """
+        if len(self.groupings) == 1 and isinstance(


i c we did thave this special case before, but onl was for len(self.groupins) == 1. its the special isinstance i object.

mroeschke · 2020-12-23T18:53:53Z

My comment was assuming that that adding a new dropna keyword argument to Categorical indicating that nan shouldn't be a missing value but a category itself wouldn't be a popular change. That assumption might be incorrect though

jreback · 2020-12-23T18:55:06Z

My comment was assuming that that adding a new dropna keyword argument to Categorical indicating that nan shouldn't be a missing value but a category itself wouldn't be a popular change. That assumption might be incorrect though

not happening, we already removed this case thru a deprecation cycle and no reason to add it back

jorisvandenbossche · 2020-12-23T19:07:34Z

Categorical indeed does not support NaN as a category, but does that mean it could not be supported in groupby(dropna=False)? A categorical can have missing values, so the resulting index of groupby(dropna=False) would not have NaN as category, but as missing value (as -1 in the codes)

mroeschke · 2020-12-23T19:30:14Z

I imagine it should be supported in groupby(dropna=False).

At the time I was probably focusing too much on Grouper.indices unaware that there was also BaseGrouper.indices (and the relationship between them) because Grouper.indices was relevant for a groupby rolling bug I was hunting down.

jreback · 2020-12-29T18:57:50Z

@phofl can you merge master and add a note for 1.2.1

phofl · 2020-12-29T20:58:02Z

Done

jreback · 2020-12-29T23:16:34Z

thanks @phofl very nice

jreback · 2020-12-29T23:16:44Z

@meeseeksdev backport 1.2.x

… in case of unused categories

…f unused categories (#38790) Co-authored-by: patrick <61934744+phofl@users.noreply.github.com>

…andas-dev#38649)

phofl added 2 commits December 23, 2020 00:54

BUG: Fix regression for groupby.indices in case of unused categories

523a862

Add comment

80952f8

phofl added Categorical Categorical Data Type Groupby Regression Functionality that used to work in a prior pandas version labels Dec 22, 2020

Remove pd

530361f

jreback requested changes Dec 23, 2020

View reviewed changes

simonjayhawkins mentioned this pull request Dec 23, 2020

RLS: 1.2 #37784

Closed

jreback requested changes Dec 23, 2020

View reviewed changes

simonjayhawkins added this to the 1.2.1 milestone Dec 29, 2020

phofl added 4 commits December 29, 2020 21:55

Change test

c689f63

Merge branch 'master' of https://github.com/pandas-dev/pandas into 38642

e1c9aa8

Add whatsnew

fac8985

Change gh reference

86df640

jreback approved these changes Dec 29, 2020

View reviewed changes

jreback merged commit 2ae017b into pandas-dev:master Dec 29, 2020

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Dec 29, 2020

Backport PR pandas-dev#38649: BUG: Fix regression for groupby.indices…

e10a87d

… in case of unused categories

meeseeksmachine mentioned this pull request Dec 29, 2020

Backport PR #38649 on branch 1.2.x (BUG: Fix regression for groupby.indices in case of unused categories) #38790

Merged

phofl deleted the 38642 branch December 30, 2020 09:10

simonjayhawkins pushed a commit that referenced this pull request Dec 30, 2020

Backport PR #38649: BUG: Fix regression for groupby.indices in case o…

7550eed

…f unused categories (#38790) Co-authored-by: patrick <61934744+phofl@users.noreply.github.com>

phofl mentioned this pull request Jan 1, 2021

Major Performance regression of df.groupby(..).indices #38495

Closed

luckyvs1 pushed a commit to luckyvs1/pandas that referenced this pull request Jan 20, 2021

BUG: Fix regression for groupby.indices in case of unused categories (p…

34e4aea

…andas-dev#38649)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Fix regression for groupby.indices in case of unused categories #38649

BUG: Fix regression for groupby.indices in case of unused categories #38649

phofl commented Dec 22, 2020

jreback Dec 23, 2020

jorisvandenbossche Dec 23, 2020

phofl Dec 23, 2020

jreback Dec 23, 2020

phofl Dec 23, 2020

jreback Dec 23, 2020

jorisvandenbossche commented Dec 23, 2020

jorisvandenbossche commented Dec 23, 2020

phofl commented Dec 23, 2020

jorisvandenbossche commented Dec 23, 2020 •

edited

Loading

jreback Dec 23, 2020

mroeschke commented Dec 23, 2020

jreback commented Dec 23, 2020

jorisvandenbossche commented Dec 23, 2020

mroeschke commented Dec 23, 2020

jreback commented Dec 29, 2020

phofl commented Dec 29, 2020

jreback commented Dec 29, 2020

jreback commented Dec 29, 2020

BUG: Fix regression for groupby.indices in case of unused categories #38649

BUG: Fix regression for groupby.indices in case of unused categories #38649

Conversation

phofl commented Dec 22, 2020

jreback Dec 23, 2020

Choose a reason for hiding this comment

jorisvandenbossche Dec 23, 2020

Choose a reason for hiding this comment

phofl Dec 23, 2020

Choose a reason for hiding this comment

jreback Dec 23, 2020

Choose a reason for hiding this comment

phofl Dec 23, 2020

Choose a reason for hiding this comment

jreback Dec 23, 2020

Choose a reason for hiding this comment

jorisvandenbossche commented Dec 23, 2020

jorisvandenbossche commented Dec 23, 2020

phofl commented Dec 23, 2020

jorisvandenbossche commented Dec 23, 2020 • edited Loading

jreback Dec 23, 2020

Choose a reason for hiding this comment

mroeschke commented Dec 23, 2020

jreback commented Dec 23, 2020

jorisvandenbossche commented Dec 23, 2020

mroeschke commented Dec 23, 2020

jreback commented Dec 29, 2020

phofl commented Dec 29, 2020

jreback commented Dec 29, 2020

jreback commented Dec 29, 2020

jorisvandenbossche commented Dec 23, 2020 •

edited

Loading