API: What is the rationale for numeric_only of Categorical reductions? #25303

jorisvandenbossche · 2019-02-13T13:36:51Z

Consider an ordered Categorical with missing values:

In [32]: cat = pd.Categorical(['a', np.nan, 'b', 'a'], ordered=True)

In [33]: cat.min()
Out[33]: nan

In [34]: cat.max()
Out[34]: 'b'

In [35]: cat.min(numeric_only=True)
Out[35]: 'a'

In [36]: cat.max(numeric_only=True)
Out[36]: 'b'

In [37]: cat.min(numeric_only=False)
Out[37]: nan

In [38]: cat.max(numeric_only=False)
Out[38]: 'b'

So from the observation above (and from the code:

pandas/pandas/core/arrays/categorical.py

Line 2199 in a89e19d

good = self._codes != -1

), it seems that numeric_only means that only the actual categories should be considered, and not the missing values (so codes that are not -1).

This struck me as strange, for the following reasons:

The fact that -1 is used as the code for missing data is rather an implementation detail, but now actually determines min/max behaviour (missing value is always the minimum, but never the maximum, unless there are only missing values)

This behaviour is different than the default for other data types in pandas, which is skipping missing values by default:

In [1]: s = pd.Series([1, np.nan, 2, 1])  

In [2]: s.min()
Out[2]: 1.0

In [3]: s.astype(pd.CategoricalDtype(ordered=True)).min()
Out[3]: nan

In [5]: s.min(skipna=False)
Out[5]: nan

The keyword in pandas to determine whether NaNs should be skipped or not for reductions is skipna=True/False, not numeric_only (this also means the skipna keyword for categorical series is broken / has no effect).
Apart from that, the name "numeric_only" is also strange to me to mean this (and is also not documented).

The numeric_only keyword in reductions methods of DataFrame actually means something entirely different: should full columns be excluded from the result based on their dtype.

In [63]: cat = pd.Categorical(['a', np.nan, 'b', 'a'], ordered=True)

In [64]: pd.Series(cat).min(numeric_only=True)
Out[64]: 'a'

In [65]: pd.DataFrame({'cat': cat}).min(numeric_only=True)
Out[65]: Series([], dtype: float64)

From the above list, I don't see a good reason for having numeric_only=False as 1) the default behaviour and 2) altogether as an option (instead of skipna). But it seems this was implemented rather from the beginning that Categoricals were introduced.

Am I missing something?
Is there a reason we don't skip NaNs by default for Categorical?

Would it be an idea to deprecate numeric_only in favor of skipna and deprecate the default?

cc @jreback @jankatins

The text was updated successfully, but these errors were encountered:

WillAyd · 2019-02-14T06:24:27Z

To be honest I've never fully understood what numeric_only is supposed to represent even in other parts of the codebase (ex: groupby ops). That is, conceptually I understand it but not sure all the impacts and nuances to it are well communicated.

I haven't run into the issue described personally so not sure how much weight my opinion should carry here, but I would be OK with the deprecation you mention

arnov · 2019-02-14T08:50:53Z

All valid points!

I think the difference in behavior of min and max when nans are present is arbitrary. I would expect both min and max to return nan, not only min. I don't know what the behavior is for other Series / Dataframes, so let's check that first and keep it consistent
My thinking was, since numeric_only seems to be broken in 0.24.0, why not move already to skipna without fixing and deprecating numeric_only
Another option would be to support them both temporarily (and make one have precedence over the other), and deprecate numric_only, however the defaults behavior is different (skipna defaults to True, while numeric_only defaults to None)

jorisvandenbossche · 2019-02-14T09:00:27Z

I would expect both min and max to return nan, not only min

To be consistent with the rest of pandas, I think they both should skip NaNs by default, but if you say skipna=False, both to return NaN.

I think the the practical deprecation path could look like:

if somebody specifies numeric_only (for the categorical case), raise a deprecation warning to say they need to specify skipna instead.
in case of min ànd if there are NaNs present (the case that would change default behaviour), raise a warning that this will change and say that they can specify skipna to silence the warning and already have the future behaviour.

So that would indeed mean supporting both keywords temporarily.

arnov · 2019-02-14T11:14:19Z

I've updated the PR.

This will keep the old behavior if numeric_only is specified, but it will change the default behavior if it is not specified. This is a bit harder to fix, since the default for skipna is True, so it's impossible to distinguish if it was set or not. See https://github.com/pandas-dev/pandas/blob/master/pandas/core/series.py#L3664 (it's already converted from None to True here https://github.com/pandas-dev/pandas/blob/master/pandas/core/generic.py#L10950)

jorisvandenbossche · 2019-02-14T23:05:23Z

This is a bit harder to fix, since the default for skipna is True, so it's impossible to distinguish if it was set or not

Ah, yes, that's a complication. We might be able to push the conversion from None to True a bit down into _reduce, where we could special case the categorical one (as you are already doing in the PR to deal with the numeric/only/skipna)

jreback · 2019-02-16T16:36:54Z

This is probably ok to remove / deprecate this for only Categorical; the PR #25304 is not acceptable regardless of this change.

jorisvandenbossche · 2019-02-16T17:57:13Z

ok to remove / deprecate this for only Categorical

It is only implemented for Categorical, no other array has this keyword, so no problem to only deprecate it there. The numeric_only keyword for DataFrame reductions is basically unrelated (except for the fact that it has the same name).

jorisvandenbossche added API Design Categorical Categorical Data Type Needs Discussion Requires discussion from core team before further action labels Feb 13, 2019

jorisvandenbossche added this to the 0.25.0 milestone Feb 13, 2019

This was referenced Feb 13, 2019

Minimum of ordered categorical data in Panda DataFrames #25299

Closed

BUG: Fix passing of numeric_only argument for categorical reduce #25304

Merged

jreback modified the milestones: 0.25.0, 2.0 Apr 20, 2019

makbigc mentioned this issue Aug 15, 2019

API/DEPR: Change default skipna behaviour + deprecate numeric_only in Categorical.min and max #27929

Merged

jreback modified the milestones: 2.0, 1.0 Aug 15, 2019

jorisvandenbossche closed this as completed in #27929 Dec 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: What is the rationale for numeric_only of Categorical reductions? #25303

API: What is the rationale for numeric_only of Categorical reductions? #25303

jorisvandenbossche commented Feb 13, 2019

WillAyd commented Feb 14, 2019 •

edited

Loading

arnov commented Feb 14, 2019 •

edited

Loading

jorisvandenbossche commented Feb 14, 2019

arnov commented Feb 14, 2019

jorisvandenbossche commented Feb 14, 2019

jreback commented Feb 16, 2019

jorisvandenbossche commented Feb 16, 2019

API: What is the rationale for numeric_only of Categorical reductions? #25303

API: What is the rationale for numeric_only of Categorical reductions? #25303

Comments

jorisvandenbossche commented Feb 13, 2019

WillAyd commented Feb 14, 2019 • edited Loading

arnov commented Feb 14, 2019 • edited Loading

jorisvandenbossche commented Feb 14, 2019

arnov commented Feb 14, 2019

jorisvandenbossche commented Feb 14, 2019

jreback commented Feb 16, 2019

jorisvandenbossche commented Feb 16, 2019

WillAyd commented Feb 14, 2019 •

edited

Loading

arnov commented Feb 14, 2019 •

edited

Loading