ENH: Generalize groupby to better support ExtensionArray #53904

MichaelTiemannOSC · 2023-06-28T11:11:09Z

Feature Type

Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas

Problem Description

I have written changes to add uncertainties to Pint (hgrecco/pint#1615) and Pint-Pandas (hgrecco/pint-pandas#140). New developments in Pint and Pint-Pandas now deeply embrace the ExtensionArray API (which I also encouraged), but it's now causing my changes grief.

The uncertainties package uses wrapping functions to interoperate with floats and NumPy (https://pythonhosted.org/uncertainties/index.html). The uncertainty datatype is <class 'uncertainties.core.AffineScalarFunc'>, which is not hashable. I have largely been able to work around this within the EA framework, but I'm stuck on how to make them work with groupby and related. I wonder whether the groupby functionality can be generalized to work better with unhashable EA types.

Feature Description

Here's an example of a small change that allows my EA type to interoperate with groupby. Specifically, it does not force the assumption that a NaN value is np.nan, but is whatever value isna says is a NaN value. In the case of uncertainties, it's typically ufloat(np.nan, 0), but it could be a UFloat with either a np.nan nominal value or np.nan error value, or both.

diff --git a/pandas/core/groupby/groupby.py b/pandas/core/groupby/groupby.py
index 1a17fef071..98e9c53c37 100644
--- a/pandas/core/groupby/groupby.py
+++ b/pandas/core/groupby/groupby.py
@@ -3080,7 +3080,10 @@ class GroupBy(BaseGroupBy[NDFrameT]):
                 """Helper function for first item that isn't NA."""
                 arr = x.array[notna(x.array)]
                 if not len(arr):
-                    return np.nan
+                    nan_arr = x.array[isna(x.array)]
+                    if not len(nan_arr):
+                        return np.nan
+                    return nan_arr[0]
                 return arr[0]
 
             if isinstance(obj, DataFrame):

But here's the really sticky problem:

diff --git a/pandas/core/groupby/ops.py b/pandas/core/groupby/ops.py
index f0e4484f69..8b7f8e1aee 100644
--- a/pandas/core/groupby/ops.py
+++ b/pandas/core/groupby/ops.py
@@ -587,7 +587,7 @@ class BaseGrouper:
 
     def get_iterator(
         self, data: NDFrameT, axis: AxisInt = 0
-    ) -> Iterator[tuple[Hashable, NDFrameT]]:
+    ) -> Iterator[tuple[Hashable, NDFrameT]]:  # Does not work with non-hashable EA types
         """
         Groupby iterator

In the PintArray world (the ExtensionArray implemented in PintPandas) I've been able to make factorize functionality work independently of any Pandas changes, but the factorized results don't survive subsequent groupby actions (that come from splitting). And that's where I'm stuck.

@andrewgsavage @rhshadrach @lebigot @hgrecco

Alternative Solutions

If the Pandas test framework could xfail unhashable EA types for groupby tests, that might be a workaround acceptable workaround (need to check with Pint and Pint-Pandas maintainers).

Additional Context

No response

The text was updated successfully, but these errors were encountered:

MichaelTiemannOSC · 2023-06-28T13:51:22Z

Adding @jbrockmendel, who seems to know a lot about the Pandas EA world.

lebigot · 2023-06-28T16:22:43Z

If this can help, I'm wondering if uncertainties.core.AffineScalarFunc cannot be made hashable.

I would have to go back to the code I wrote 14 years ago, so I cannot answer immediately. I just had a superficial look and I'm wondering how much LinearCombination.expand() might make an AffineScalarFunc un-hashable (from memory, this is used for optimization purposes and can effectively modify the AffineScalarFunc).

jbrockmendel · 2023-06-28T16:24:05Z

@MichaelTiemannOSC can you give an example with traceback of what goes wrong?

MichaelTiemannOSC · 2023-06-28T17:09:59Z

Here's a traceback of the first failure:

=================================================================================================================================== FAILURES ===================================================================================================================================
_____________________________________________________________________________________________________________ TestGroupby.test_groupby_extension_transform[float] ______________________________________________________________________________________________________________

self = <pint_pandas.testsuite.test_pandas_extensiontests.TestGroupby object at 0x7fbae0080490>
data_for_grouping = <PintArray>
[4294967297.0+/-0, 4294967297.0+/-0,          nan+/-0,          nan+/-0,
          1.0+/-0,          1.0+/-0, 4294967297.0+/-0, 4294967306.0+/-0]
Length: 8, dtype: pint[meter]

    def test_groupby_extension_transform(self, data_for_grouping):
        valid = data_for_grouping[~data_for_grouping.isna()]
        df = pd.DataFrame({"A": [1, 1, 3, 3, 1, 4], "B": valid})
    
>       result = df.groupby("B").A.transform(len)

../pandas-dev/pandas/tests/extension/base/groupby.py:97: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../pandas-dev/pandas/core/groupby/generic.py:507: in transform
    return self._transform(
../pandas-dev/pandas/core/groupby/groupby.py:1823: in _transform
    return self._transform_general(func, engine, engine_kwargs, *args, **kwargs)
../pandas-dev/pandas/core/groupby/generic.py:546: in _transform_general
    object.__setattr__(group, "name", name)
../pandas-dev/pandas/core/series.py:701: in name
    validate_all_hashable(value, error_name=f"{type(self).__name__}.name")
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

error_name = 'Series.name', args = (<Quantity(   1.0+/-0, 'meter')>,)

    def validate_all_hashable(*args, error_name: str | None = None) -> None:
        """
        Return None if all args are hashable, else raise a TypeError.
    
        Parameters
        ----------
        *args
            Arguments to validate.
        error_name : str, optional
            The name to use if error
    
        Raises
        ------
        TypeError : If an argument is not hashable
    
        Returns
        -------
        None
        """
        if not all(is_hashable(arg) for arg in args):
            if error_name:
>               raise TypeError(f"{error_name} must be a hashable type")
E               TypeError: Series.name must be a hashable type

../pandas-dev/pandas/core/dtypes/common.py:1587: TypeError
=============================================================================================================================== warnings summary ===============================================================================================================================
pint_pandas/testsuite/test_pandas_extensiontests.py: 111 warnings
  /Users/michael/Documents/GitHub/MichaelTiemannOSC/pint-pandas/pint_pandas/pint_array.py:492: FutureWarning: pd.api.extensions.take accepting non-standard inputs is deprecated and will raise in a future version. Pass either a numpy.ndarray, ExtensionArray, Index, or Series instead.
    result = take(data, indices, fill_value=fill_value, allow_fill=allow_fill)

pint_pandas/testsuite/test_pandas_extensiontests.py::TestGetitem::test_getitem_series_integer_with_missing_raises[float-integer-array]
pint_pandas/testsuite/test_pandas_extensiontests.py::TestGetitem::test_getitem_series_integer_with_missing_raises[int-integer-array]
pint_pandas/testsuite/test_pandas_extensiontests.py::TestGetitem::test_getitem_series_integer_with_missing_raises[complex128-integer-array]
  /Users/michael/Documents/GitHub/MichaelTiemannOSC/pandas-dev/pandas/tests/extension/base/getitem.py:277: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
    ser[idx]

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================================================================================================================== short test summary info ============================================================================================================================
FAILED pint_pandas/testsuite/test_pandas_extensiontests.py::TestGroupby::test_groupby_extension_transform[float] - TypeError: Series.name must be a hashable type
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! _pytest.outcomes.Exit: Quitting debugger !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
=========================================================================================================== 1 failed, 220 passed, 6 xfailed, 114 warnings in 17.73s ============================================================================================================

jbrockmendel · 2023-06-28T17:17:47Z

Thanks. Well the good news is that the line object.__setattr__(group, "name", name) in _transform_general is one I've wanted to get rid of for a while (xref #41090). The bad news is that something similar is going to happen any time you put non-hashable items in a pd.Index, and it is the same for numpy dtypes, not specific to EA. We might be able to play whack-a-mole with them, and I'll try to help with that if its a path you want to go down. But if there's any way to make these hashable, I'd really encourage that.

Fixes include: * Factorization * NaN handling (several more issues still need to be resolved) * Proper unit declarations in test_offset_concat * Integration of new `numeric_dtype` parameter A major outstanding issue (presently being discussed as pandas-dev/pandas#53904) concerns whether we can make AffineScalarFunc hashable and/or whether other legacy Pandas code (which has been deprecated) can be further removed. Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>

MichaelTiemannOSC added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 28, 2023

lithomas1 added Groupby Needs Info Clarification about behavior needed to assess issue and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 28, 2023

mroeschke added Needs Discussion Requires discussion from core team before further action and removed Needs Info Clarification about behavior needed to assess issue labels Aug 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Generalize groupby to better support ExtensionArray #53904

ENH: Generalize groupby to better support ExtensionArray #53904

MichaelTiemannOSC commented Jun 28, 2023 •

edited

Loading

MichaelTiemannOSC commented Jun 28, 2023

lebigot commented Jun 28, 2023

jbrockmendel commented Jun 28, 2023

MichaelTiemannOSC commented Jun 28, 2023

jbrockmendel commented Jun 28, 2023

ENH: Generalize groupby to better support ExtensionArray #53904

ENH: Generalize groupby to better support ExtensionArray #53904

Comments

MichaelTiemannOSC commented Jun 28, 2023 • edited Loading

Feature Type

Problem Description

Feature Description

Alternative Solutions

Additional Context

MichaelTiemannOSC commented Jun 28, 2023

lebigot commented Jun 28, 2023

jbrockmendel commented Jun 28, 2023

MichaelTiemannOSC commented Jun 28, 2023

jbrockmendel commented Jun 28, 2023

MichaelTiemannOSC commented Jun 28, 2023 •

edited

Loading