REGR: GroupBy.indices no longer includes unobserved categories #38642

jorisvandenbossche · 2020-12-22T21:32:49Z

Does anybody know if this was an intentional change? (I don't directly find something about it in the whatsnew)

In [9]: pd.__version__ 
Out[9]: '1.0.5'

In [10]: df = pd.DataFrame({"key": pd.Categorical(["b"]*5, categories=["a", "b", "c", "d"]), "col": range(5)}) 

In [11]: gb = df.groupby("key")   

In [12]: list(gb.indices)  
Out[12]: ['a', 'b', 'c', 'd']

vs

In [1]: pd.__version__
Out[1]: '1.3.0.dev0+92.ga2d10ba88a'

In [2]: df = pd.DataFrame({"key": pd.Categorical(["b"]*5, categories=["a", "b", "c", "d"]), "col": range(5)})

In [3]: gb = df.groupby("key")

In [4]: list(gb.indices)
Out[4]: ['b']

This already changed in pandas 1.1, so not a recent change.

The consequence of this is that iterating over gb vs iterating over gb.indices is not consistent anymore.

cc @mroeschke @rhshadrach

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2020-12-22T21:40:00Z

So for example, those two APIs still return all values (both for pandas 1.0 and master):

In [6]: gb.groups
Out[6]: {'a': [], 'b': [0, 1, 2, 3, 4], 'c': [], 'd': []}

In [7]: [key for key, group in gb]
Out[7]: ['a', 'b', 'c', 'd']

So it seems .indices should be consistent with it?

mroeschke · 2020-12-22T22:40:30Z

This may have been unintentionally changed by me in https://github.com/pandas-dev/pandas/pull/36911/files

phofl · 2020-12-22T22:40:34Z

I looked into this, I think the new case is more consistent maybe?
on 1.0.5 and master we get:

df = pd.DataFrame({"key": pd.Categorical(["b"]*5 + ["c"], categories=["a", "b", "c", "d"]), "key1": [1,1,1,2,2,3], "col": range(6)})

gb = df.groupby(["key", "key1"])
gb.groups
gb.indices

returned

{('b', 1): Int64Index([0, 1, 2], dtype='int64'), ('b', 2): Int64Index([3, 4], dtype='int64'), ('c', 3): Int64Index([5], dtype='int64')}
{('b', 1): array([0, 1, 2]), ('b', 2): array([3, 4]), ('c', 3): array([5])}

While a one dimensional group key returned what you showed above. The missing categories case would be tricky to handle with multidimensional keys. Maybe it would be better to remove unused categories from groups too? Or should the one-dimensional case be special here?

phofl · 2020-12-22T22:44:12Z

@mroeschke no that was not the reason. I think this was caused by c4226d4

mroeschke · 2020-12-22T22:45:14Z

Thanks for confirming @phofl

phofl · 2020-12-22T22:45:44Z

Addition: We are no longer running through there since #36842

jorisvandenbossche · 2020-12-22T22:50:36Z

@phofl thanks for looking at it!

I looked into this, I think the new case is more consistent maybe?

Indeed for multiple keys, we seem to not include unobserved categories. But, here both .indices as .groups (and GroupBy.__iter__ do that), so they are also consistent with each other.
While for a single key, while maybe inconsistent or not with the multiple key case, it is now inconsistent between .indices and .groups.

So to fully make it consistent, then for example also .groups and GroupBy.__iter__ should change for the single key case. But that would also be a breaking change ..
And I am not fully sure that is necessarily the good change, since we actually include the unobserved categories in the output if you do eg an aggregation on the groupby object, so then I would expect to include those empty groups as well when iterating over the groupby object?

jsignell · 2020-12-22T22:57:21Z

Passing observed has no impact on .indices, but maybe it should? I think this behavior would not be surprising:

>>>df = pd.DataFrame({"key": pd.Categorical(["b"]*5, categories=["a", "b", "c", "d"]), "col": range(5)})
>>> gb = df.groupby("key", observed=True)
>>> list(gb.indices)
['b']
>>> gb = df.groupby("key", observed=False)
>>> list(gb.indices)
['a', b', 'c', 'd']

phofl · 2020-12-22T22:57:42Z

Have to correct myself, this was changed by #36842

@jorisvandenbossche When testing this on 1.1.0 and 1.1.5 I get

{'a': [], 'b': [0, 1, 2, 3, 4], 'c': [5], 'd': []}
{'b': array([0, 1, 2, 3, 4]), 'c': array([5])}

both times.

Edit: Changed the example a bit.

df = pd.DataFrame({"key": pd.Categorical(["b"]*5 + ["c"], categories=["a", "b", "c", "d"]), "col": range(6)})

jorisvandenbossche · 2020-12-23T12:09:31Z

I think the pointer of @mroeschke to #36911 might be more correct, since that was a PR for 1.1.4, while #36842 only for 1.2.0.

And unlike what I said earlier (I thought it was only working on 1.0, and not in 1.1.x), this actually only changed from 1.1.3 to 1.1.4.

simonjayhawkins · 2020-12-23T13:22:12Z

I think the pointer of @mroeschke to #36911 might be more correct, since that was a PR for 1.1.4, while #36842 only for 1.2.0.

can confirm, first bad commit: [345efdd] BUG: RollingGroupby not respecting sort=False (#36911)

phofl · 2020-12-23T14:24:32Z

Hm just looked at the pr numbers, not when they were merged.

Nevertheless, we have to change both commits to get the original result, because the code path from #36911 is currently not used on master.

jorisvandenbossche added Groupby Regression Functionality that used to work in a prior pandas version labels Dec 22, 2020

jorisvandenbossche mentioned this issue Dec 22, 2020

Test pandas 1.1.x / 1.2.0 releases and pandas nightly dask/dask#6996

Merged

phofl mentioned this issue Dec 22, 2020

BUG: Fix regression for groupby.indices in case of unused categories #38649

Merged

4 tasks

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Dec 23, 2020

code sample for pandas-dev#38642

bfc2da2

jorisvandenbossche added this to the 1.2.1 milestone Dec 28, 2020

jreback closed this as completed in #38649 Dec 29, 2020

ggold7046 mentioned this issue Aug 10, 2023

Modified doc/make.py to run sphinx-build -b linkcheck #54265

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REGR: GroupBy.indices no longer includes unobserved categories #38642

REGR: GroupBy.indices no longer includes unobserved categories #38642

jorisvandenbossche commented Dec 22, 2020

jorisvandenbossche commented Dec 22, 2020

mroeschke commented Dec 22, 2020

phofl commented Dec 22, 2020

phofl commented Dec 22, 2020

mroeschke commented Dec 22, 2020

phofl commented Dec 22, 2020

jorisvandenbossche commented Dec 22, 2020

jsignell commented Dec 22, 2020

phofl commented Dec 22, 2020 •

edited

Loading

jorisvandenbossche commented Dec 23, 2020

simonjayhawkins commented Dec 23, 2020

phofl commented Dec 23, 2020 •

edited

Loading

REGR: GroupBy.indices no longer includes unobserved categories #38642

REGR: GroupBy.indices no longer includes unobserved categories #38642

Comments

jorisvandenbossche commented Dec 22, 2020

jorisvandenbossche commented Dec 22, 2020

mroeschke commented Dec 22, 2020

phofl commented Dec 22, 2020

phofl commented Dec 22, 2020

mroeschke commented Dec 22, 2020

phofl commented Dec 22, 2020

jorisvandenbossche commented Dec 22, 2020

jsignell commented Dec 22, 2020

phofl commented Dec 22, 2020 • edited Loading

jorisvandenbossche commented Dec 23, 2020

simonjayhawkins commented Dec 23, 2020

phofl commented Dec 23, 2020 • edited Loading

phofl commented Dec 22, 2020 •

edited

Loading

phofl commented Dec 23, 2020 •

edited

Loading