-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: What is the rationale for numeric_only of Categorical reductions? #25303
Comments
To be honest I've never fully understood what I haven't run into the issue described personally so not sure how much weight my opinion should carry here, but I would be OK with the deprecation you mention |
All valid points!
|
To be consistent with the rest of pandas, I think they both should skip NaNs by default, but if you say I think the the practical deprecation path could look like:
So that would indeed mean supporting both keywords temporarily. |
I've updated the PR. This will keep the old behavior if |
Ah, yes, that's a complication. We might be able to push the conversion from None to True a bit down into |
This is probably ok to remove / deprecate this for only |
It is only implemented for Categorical, no other array has this keyword, so no problem to only deprecate it there. The |
Consider an ordered Categorical with missing values:
So from the observation above (and from the code:
pandas/pandas/core/arrays/categorical.py
Line 2199 in a89e19d
numeric_only
means that only the actual categories should be considered, and not the missing values (so codes that are not -1).This struck me as strange, for the following reasons:
The fact that -1 is used as the code for missing data is rather an implementation detail, but now actually determines min/max behaviour (missing value is always the minimum, but never the maximum, unless there are only missing values)
This behaviour is different than the default for other data types in pandas, which is skipping missing values by default:
The keyword in pandas to determine whether NaNs should be skipped or not for reductions is
skipna=True/False
, notnumeric_only
(this also means theskipna
keyword for categorical series is broken / has no effect).Apart from that, the name "numeric_only" is also strange to me to mean this (and is also not documented).
The
numeric_only
keyword in reductions methods of DataFrame actually means something entirely different: should full columns be excluded from the result based on their dtype.From the above list, I don't see a good reason for having
numeric_only=False
as 1) the default behaviour and 2) altogether as an option (instead of skipna). But it seems this was implemented rather from the beginning that Categoricals were introduced.Am I missing something?
Is there a reason we don't skip NaNs by default for Categorical?
Would it be an idea to deprecate
numeric_only
in favor ofskipna
and deprecate the default?cc @jreback @jankatins
The text was updated successfully, but these errors were encountered: