Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: What is the rationale for numeric_only of Categorical reductions? #25303

Closed
jorisvandenbossche opened this issue Feb 13, 2019 · 7 comments · Fixed by #27929
Closed

API: What is the rationale for numeric_only of Categorical reductions? #25303

jorisvandenbossche opened this issue Feb 13, 2019 · 7 comments · Fixed by #27929
Labels
API Design Categorical Categorical Data Type Needs Discussion Requires discussion from core team before further action
Milestone

Comments

@jorisvandenbossche
Copy link
Member

Consider an ordered Categorical with missing values:

In [32]: cat = pd.Categorical(['a', np.nan, 'b', 'a'], ordered=True)

In [33]: cat.min()
Out[33]: nan

In [34]: cat.max()
Out[34]: 'b'

In [35]: cat.min(numeric_only=True)
Out[35]: 'a'

In [36]: cat.max(numeric_only=True)
Out[36]: 'b'

In [37]: cat.min(numeric_only=False)
Out[37]: nan

In [38]: cat.max(numeric_only=False)
Out[38]: 'b'

So from the observation above (and from the code:

good = self._codes != -1
), it seems that numeric_only means that only the actual categories should be considered, and not the missing values (so codes that are not -1).

This struck me as strange, for the following reasons:

  • The fact that -1 is used as the code for missing data is rather an implementation detail, but now actually determines min/max behaviour (missing value is always the minimum, but never the maximum, unless there are only missing values)

  • This behaviour is different than the default for other data types in pandas, which is skipping missing values by default:

    In [1]: s = pd.Series([1, np.nan, 2, 1])  
    
    In [2]: s.min()
    Out[2]: 1.0
    
    In [3]: s.astype(pd.CategoricalDtype(ordered=True)).min()
    Out[3]: nan
    
    In [5]: s.min(skipna=False)
    Out[5]: nan
    
  • The keyword in pandas to determine whether NaNs should be skipped or not for reductions is skipna=True/False, not numeric_only (this also means the skipna keyword for categorical series is broken / has no effect).
    Apart from that, the name "numeric_only" is also strange to me to mean this (and is also not documented).

  • The numeric_only keyword in reductions methods of DataFrame actually means something entirely different: should full columns be excluded from the result based on their dtype.

    In [63]: cat = pd.Categorical(['a', np.nan, 'b', 'a'], ordered=True)
    
    In [64]: pd.Series(cat).min(numeric_only=True)
    Out[64]: 'a'
    
    In [65]: pd.DataFrame({'cat': cat}).min(numeric_only=True)
    Out[65]: Series([], dtype: float64)
    

From the above list, I don't see a good reason for having numeric_only=False as 1) the default behaviour and 2) altogether as an option (instead of skipna). But it seems this was implemented rather from the beginning that Categoricals were introduced.

Am I missing something?
Is there a reason we don't skip NaNs by default for Categorical?

Would it be an idea to deprecate numeric_only in favor of skipna and deprecate the default?

cc @jreback @jankatins

@jorisvandenbossche jorisvandenbossche added API Design Categorical Categorical Data Type Needs Discussion Requires discussion from core team before further action labels Feb 13, 2019
@jorisvandenbossche jorisvandenbossche added this to the 0.25.0 milestone Feb 13, 2019
@WillAyd
Copy link
Member

WillAyd commented Feb 14, 2019

To be honest I've never fully understood what numeric_only is supposed to represent even in other parts of the codebase (ex: groupby ops). That is, conceptually I understand it but not sure all the impacts and nuances to it are well communicated.

I haven't run into the issue described personally so not sure how much weight my opinion should carry here, but I would be OK with the deprecation you mention

@arnov
Copy link
Contributor

arnov commented Feb 14, 2019

All valid points!

  • I think the difference in behavior of min and max when nans are present is arbitrary. I would expect both min and max to return nan, not only min. I don't know what the behavior is for other Series / Dataframes, so let's check that first and keep it consistent
  • My thinking was, since numeric_only seems to be broken in 0.24.0, why not move already to skipna without fixing and deprecating numeric_only
  • Another option would be to support them both temporarily (and make one have precedence over the other), and deprecate numric_only, however the defaults behavior is different (skipna defaults to True, while numeric_only defaults to None)

@jorisvandenbossche
Copy link
Member Author

I would expect both min and max to return nan, not only min

To be consistent with the rest of pandas, I think they both should skip NaNs by default, but if you say skipna=False, both to return NaN.

I think the the practical deprecation path could look like:

  • if somebody specifies numeric_only (for the categorical case), raise a deprecation warning to say they need to specify skipna instead.
  • in case of min ànd if there are NaNs present (the case that would change default behaviour), raise a warning that this will change and say that they can specify skipna to silence the warning and already have the future behaviour.

So that would indeed mean supporting both keywords temporarily.

@arnov
Copy link
Contributor

arnov commented Feb 14, 2019

I've updated the PR.

This will keep the old behavior if numeric_only is specified, but it will change the default behavior if it is not specified. This is a bit harder to fix, since the default for skipna is True, so it's impossible to distinguish if it was set or not. See https://github.com/pandas-dev/pandas/blob/master/pandas/core/series.py#L3664 (it's already converted from None to True here https://github.com/pandas-dev/pandas/blob/master/pandas/core/generic.py#L10950)

@jorisvandenbossche
Copy link
Member Author

This is a bit harder to fix, since the default for skipna is True, so it's impossible to distinguish if it was set or not

Ah, yes, that's a complication. We might be able to push the conversion from None to True a bit down into _reduce, where we could special case the categorical one (as you are already doing in the PR to deal with the numeric/only/skipna)

@jreback
Copy link
Contributor

jreback commented Feb 16, 2019

This is probably ok to remove / deprecate this for only Categorical; the PR #25304 is not acceptable regardless of this change.

@jorisvandenbossche
Copy link
Member Author

ok to remove / deprecate this for only Categorical

It is only implemented for Categorical, no other array has this keyword, so no problem to only deprecate it there. The numeric_only keyword for DataFrame reductions is basically unrelated (except for the fact that it has the same name).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Categorical Categorical Data Type Needs Discussion Requires discussion from core team before further action
Projects
None yet
4 participants