Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Cannot find maxima of a categorical series #15641

Closed
rjzamora opened this issue May 2, 2024 · 5 comments · Fixed by #15701
Closed

[BUG] Cannot find maxima of a categorical series #15641

rjzamora opened this issue May 2, 2024 · 5 comments · Fixed by #15701
Labels
bug Something isn't working

Comments

@rjzamora
Copy link
Member

rjzamora commented May 2, 2024

Describe the bug
While debugging some strange dask-expr + cudf sorting behavior, I realized that I cannot call ser.min() when ser is a categorical Series. This is a problem for the sort_values logic used in dask.

Steps/Code to reproduce bug

import cudf

ser = cudf.Series(range(10), dtype="category")
ser.min()
# Same problem with `ser.cat.as_ordered().min()`
...
TypeError: Cannot interpret 'CategoricalDtype(categories=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], ordered=False, categories_dtype=int64)' as a data type

Expected behavior
I'd expect for min/max to work (assuming the dtype is "ordered").

@wence-
Copy link
Contributor

wence- commented May 3, 2024

Hmmm. Can you try:

diff --git a/python/cudf/cudf/core/column/categorical.py b/python/cudf/cudf/core/column/categorical.py
index e3e7303504..fc996e6b6a 100644
--- a/python/cudf/cudf/core/column/categorical.py
+++ b/python/cudf/cudf/core/column/categorical.py
@@ -515,6 +515,10 @@ class CategoricalColumn(column.ColumnBase):
     dtype: cudf.core.dtypes.CategoricalDtype
     _codes: Optional[NumericalColumn]
     _children: Tuple[NumericalColumn]
+    _VALID_REDUCTIONS = {
+        "max",
+        "min",
+    }
     _VALID_BINARY_OPERATIONS = {
         "__eq__",
         "__ne__",
@@ -699,6 +703,27 @@ class CategoricalColumn(column.ColumnBase):
             ),
         )
 
+    def _reduce(
+        self,
+        op: str,
+        skipna: Optional[bool] = None,
+        min_count: int = 0,
+        *args,
+        **kwargs,
+    ) -> ScalarLike:
+        # Only valid reductions are min and max
+        if not self.ordered:
+            raise TypeError(
+                "Categorical is not ordered for operation min "
+                "you can use .as_ordered() to change the Categorical "
+                "to an ordered one."
+            )
+        return self._encode(
+            self.codes._reduce(
+                op=op, skipna=skipna, min_count=min_count, *args, **kwargs
+            )
+        )
+
     def _binaryop(self, other: ColumnBinaryOperand, op: str) -> ColumnBase:
         other = self._wrap_binop_normalization(other)
         # TODO: This is currently just here to make mypy happy, but eventually

Aside, it is mind-boggling to me that unordered_cat.sort_values() is allowed, but unordered_cat.min() is not. How can you sort if you can't produce a minimum element?

@vyasr
Copy link
Contributor

vyasr commented May 7, 2024

@rjzamora any chance you tried out @wence- 's snippet above?

Aside, it is mind-boggling to me that unordered_cat.sort_values() is allowed, but unordered_cat.min() is not. How can you sort if you can't produce a minimum element?

A weak ordering and we don't like ties??? I don't know, I agree that this seems nonsensical on its face.

@rjzamora
Copy link
Member Author

rjzamora commented May 8, 2024

any chance you tried out @wence- 's snippet above?

Not yet - Thanks for suggestion @wence-! I will test this out soon.

@wence-
Copy link
Contributor

wence- commented May 8, 2024

A weak ordering and we don't like ties??? I don't know, I agree that this seems nonsensical on its face.

I did some digging, it's a combination of matching R's factor API and implementation leaking into semantics. See discussions pandas-dev/pandas#9611, pandas-dev/pandas#9622, and pandas-dev/pandas#12785

Effectively by having sort-based groupby be a promise, you're backed into a corner by having unordered categoricals as groupby keys. Of course, the right answer is to say "no can do", but ...

@rjzamora
Copy link
Member Author

rjzamora commented May 8, 2024

Added a test to the patch suggested by @wence- and submitted #15701

rapids-bot bot pushed a commit that referenced this issue May 10, 2024
Closes #15641

Applies patch suggested by @wence-

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #15701
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants