Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Series[categorical] median raises, but DataFrame doesn't #21020

Closed
TomAugspurger opened this issue May 13, 2018 · 3 comments · Fixed by #37827
Closed

Series[categorical] median raises, but DataFrame doesn't #21020

TomAugspurger opened this issue May 13, 2018 · 3 comments · Fixed by #37827
Labels
API - Consistency Internal Consistency of API/Behavior API Design Bug Categorical Categorical Data Type Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply Reduction Operations sum, mean, min, max, etc.
Milestone

Comments

@TomAugspurger
Copy link
Contributor

This is a bit strange.

In [3]: pd.DataFrame({"A": pd.Categorical([1, 2, 2, 2, 3])})
Out[3]:
   A
0  1
1  2
2  2
3  2
4  3

In [4]: df = pd.DataFrame({"A": pd.Categorical([1, 2, 2, 2, 3])})

In [5]: df.median()
Out[5]:
A    2.0
dtype: float64

In [6]: df.A.median()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-6-15196eaacc89> in <module>()
----> 1 df.A.median()

~/sandbox/pandas/pandas/core/generic.py in stat_func(self, axis, skipna, level, numeric_only, **kwargs)
   9587                                       skipna=skipna)
   9588         return self._reduce(f, name, axis=axis, skipna=skipna,
-> 9589                             numeric_only=numeric_only)
   9590
   9591     return set_function_name(stat_func, name, cls)

~/sandbox/pandas/pandas/core/series.py in _reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
   3220         return delegate._reduce(op=op, name=name, axis=axis, skipna=skipna,
   3221                                 numeric_only=numeric_only,
-> 3222                                 filter_type=filter_type, **kwds)
   3223
   3224     def _reindex_indexer(self, new_index, indexer, copy):

~/sandbox/pandas/pandas/core/arrays/categorical.py in _reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
   2065         if func is None:
   2066             msg = 'Categorical cannot perform the operation {op}'
-> 2067             raise TypeError(msg.format(op=name))
   2068         return func(numeric_only=numeric_only, **kwds)
   2069

TypeError: Categorical cannot perform the operation median

Anyone know whether that's intentional?

@TomAugspurger TomAugspurger added API Design Numeric Operations Arithmetic, Comparison, and Logical operations Categorical Categorical Data Type labels May 13, 2018
@jreback
Copy link
Contributor

jreback commented May 13, 2018

median is like min/max it can only work on an ordered categorical

Dataframe is prob coercing to object?

@JarnoRFB
Copy link
Contributor

JarnoRFB commented Jun 1, 2018

Apparently there is some coercion going on. If I do the same with object categories df.median() just returns an empty series.

cat_ordered_series = pd.Series(
    pd.Categorical(['a', 'b', 'c', 'a'], categories=['b', 'c', 'a'],
                         ordered=True)
)
# cat_ordered_series.median() # raises TypeError
df = pd.DataFrame({'a': cat_ordered_series})
df.median()

Out: Series([], dtype: float64)

This really seems to be a little inconsistent. I think median calculations on unordered categories should always be omitted (when in DataFrames) or raise TypeError (when in Series). For ordered categories I am actually not sure, as the median can be computed, but is often not very meaningful. However, I think it is a bit confusing if categories are coerced to numbers.

@mroeschke mroeschke added the Bug label Jun 28, 2020
@jbrockmendel jbrockmendel added API - Consistency Internal Consistency of API/Behavior Reduction Operations sum, mean, min, max, etc. Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply labels Sep 22, 2020
@jbrockmendel
Copy link
Member

This is, surprisingly, the "correct" behavior with numeric_only=None. In DataFrame._reduce this goes through the frame_apply path and operates column-wise. When it operates on that column it ignores the TypeError that is raised, and you end up getting an empty result back.

You'll get the expected also-raising if you do df.median(numeric_only=False)

xref #28900 on getting rid of this footgun.

@jorisvandenbossche jorisvandenbossche removed the Numeric Operations Arithmetic, Comparison, and Logical operations label Sep 26, 2020
@jreback jreback added this to the 1.2 milestone Nov 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior API Design Bug Categorical Categorical Data Type Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply Reduction Operations sum, mean, min, max, etc.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants