-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] don't mangle null-objects in value_counts #42743
Conversation
asv shows, what cost mangling of null-like values would have:
|
@@ -133,12 +133,24 @@ class ValueCounts: | |||
param_names = ["N", "dtype"] | |||
|
|||
def setup(self, N, dtype): | |||
self.s = Series(np.random.randint(0, N, size=10 * N)).astype(dtype) | |||
self.s = Series(np.random.randint(0, N, size=10 * N)).astype("object") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
umm, isn't this defeating the purpose, e.g. why did this change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, it was an unintentional change...
@@ -774,8 +774,8 @@ def test_label_indexing_on_nan(self): | |||
# GH 32431 | |||
df = Series([1, "{1,2}", 1, None]) | |||
vc = df.value_counts(dropna=False) | |||
result1 = vc.loc[np.nan] | |||
result2 = vc[np.nan] | |||
result1 = vc.loc[None] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you parameterize this test to also include np.nan for indexing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, see c8f76c7
This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this. |
doc/source/whatsnew/v1.4.0.rst
Outdated
@@ -504,6 +504,7 @@ Missing | |||
^^^^^^^ | |||
- Bug in :meth:`DataFrame.fillna` with limit and no method ignores axis='columns' or ``axis = 1`` (:issue:`40989`) | |||
- Bug in :meth:`DataFrame.fillna` not replacing missing values when using a dict-like ``value`` and duplicate column names (:issue:`43476`) | |||
- :meth:`Series.value_counts` and :meth:`Series.mode` no longer coerce ``None``, ``NaT`` and other null-values to a NaN-value for ``np.object``-dtype. This behavior is now consistent with ``unique``, ``isin`` and others (:issue:`42688`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably better in the notable bug fix
section since the prior behavior was intentional at one point
pandas/tests/libs/test_hashtable.py
Outdated
# pd.Na and np.nan will have the same representative: np.nan | ||
# thus we have 2 nans and 1 True | ||
# GH42688, nans aren't mangled | ||
values = np.array([True, pd.NA, np.nan, pd.NA, np.nan], dtype=np.object_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add pd.NaT
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few comments but overall looks good. Sorry this went unreviewed for a but so if you could merge master that'd be great.
@mroeschke some tests are failing, but it looks as if it is not due to this PR. |
thanks @realead very nice! |
Null-like values are no longer mangled in
value_counts
.This was overlooked in #22296, as decision was made not to mangle null-like values (np.nan, None, pd.NaT and so on).
It also has impact on
mode
, as it usesvalue_counts
under the hood.