Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix regression for groupby.indices in case of unused categories #38649

Merged
merged 7 commits into from
Dec 29, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v1.2.1.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ Fixed regressions
- :meth:`to_csv` created corrupted zip files when there were more rows than ``chunksize`` (issue:`38714`)
- Fixed a regression in ``groupby().rolling()`` where :class:`MultiIndex` levels were dropped (:issue:`38523`)
- Bug in repr of float-like strings of an ``object`` dtype having trailing 0's truncated after the decimal (:issue:`38708`)
-
- Fixed regression in :meth:`DataFrame.groupby()` with :class:`Categorical` grouping column not showing unused categories for ``grouped.indices`` (:issue:`38642`)

.. ---------------------------------------------------------------------------

Expand Down
9 changes: 2 additions & 7 deletions pandas/core/groupby/grouper.py
Original file line number Diff line number Diff line change
Expand Up @@ -556,13 +556,8 @@ def indices(self):
if isinstance(self.grouper, ops.BaseGrouper):
return self.grouper.indices

# Return a dictionary of {group label: [indices belonging to the group label]}
# respecting whether sort was specified
codes, uniques = algorithms.factorize(self.grouper, sort=self.sort)
return {
category: np.flatnonzero(codes == i)
for i, category in enumerate(Index(uniques))
}
values = Categorical(self.grouper)
return values._reverse_indexer()

@property
def codes(self) -> np.ndarray:
Expand Down
6 changes: 6 additions & 0 deletions pandas/core/groupby/ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@
is_timedelta64_dtype,
needs_i8_conversion,
)
from pandas.core.dtypes.generic import ABCCategoricalIndex
from pandas.core.dtypes.missing import isna, maybe_fill

import pandas.core.algorithms as algorithms
Expand Down Expand Up @@ -241,6 +242,11 @@ def apply(self, f: F, data: FrameOrSeries, axis: int = 0):
@cache_readonly
def indices(self):
""" dict {group name -> group indices} """
if len(self.groupings) == 1 and isinstance(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woa, this is so special casing here. why is this added?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the explanation in the top post

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read it and don't buy it. Why are we supporting this case at all? I would rather simply fix this breaking or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean not showing unused categories for indices?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i c we did thave this special case before, but onl was for len(self.groupins) == 1. its the special isinstance i object.

self.result_index, ABCCategoricalIndex
):
# This shows unused categories in indices GH#38642
return self.groupings[0].indices
codes_list = [ping.codes for ping in self.groupings]
keys = [ping.group_index for ping in self.groupings]
return get_indexer_dict(codes_list, keys)
Expand Down
20 changes: 20 additions & 0 deletions pandas/tests/groupby/test_categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -1678,3 +1678,23 @@ def test_df_groupby_first_on_categorical_col_grouped_on_2_categoricals(
df_grp = df.groupby(["a", "b"], observed=observed)
result = getattr(df_grp, func)()
tm.assert_frame_equal(result, expected)


def test_groupby_categorical_indices_unused_categories():
# GH#38642
df = DataFrame(
{
"key": Categorical(["b", "b", "a"], categories=["a", "b", "c"]),
"col": range(3),
}
)
grouped = df.groupby("key", sort=False)
result = grouped.indices
expected = {
"b": np.array([0, 1], dtype="int64"),
"a": np.array([2], dtype="int64"),
"c": np.array([], dtype="int64"),
}
assert result.keys() == expected.keys()
for key in result.keys():
tm.assert_numpy_array_equal(result[key], expected[key])