Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] don't mangle null-objects in value_counts #42743

Merged
merged 11 commits into from
Nov 12, 2021

Conversation

realead
Copy link
Contributor

@realead realead commented Jul 27, 2021

Null-like values are no longer mangled in value_counts.

This was overlooked in #22296, as decision was made not to mangle null-like values (np.nan, None, pd.NaT and so on).

It also has impact on mode, as it uses value_counts under the hood.

@realead
Copy link
Contributor Author

realead commented Jul 27, 2021

asv shows, what cost mangling of null-like values would have:

       before           after         ratio
     [9731fd07]       [d5454938]
-         174±2ms          138±2ms     0.79  series_methods.ValueCountsObjectDropNAFalse.time_value_counts(100000)
-     1.16±0.01ms         916±20μs     0.79  series_methods.ValueCountsObjectDropNAFalse.time_value_counts(1000)
-         144±2ms        111±0.6ms     0.77  series_methods.ModeObjectDropNAFalse.time_mode(100000)
-      7.03±0.1ms      4.40±0.05ms     0.63  series_methods.ModeObjectDropNAFalse.time_mode(10000)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

@jreback jreback added this to the 1.4 milestone Jul 28, 2021
@jreback jreback added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Jul 28, 2021
@@ -133,12 +133,24 @@ class ValueCounts:
param_names = ["N", "dtype"]

def setup(self, N, dtype):
self.s = Series(np.random.randint(0, N, size=10 * N)).astype(dtype)
self.s = Series(np.random.randint(0, N, size=10 * N)).astype("object")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

umm, isn't this defeating the purpose, e.g. why did this change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it was an unintentional change...

asv_bench/benchmarks/series_methods.py Show resolved Hide resolved
@@ -774,8 +774,8 @@ def test_label_indexing_on_nan(self):
# GH 32431
df = Series([1, "{1,2}", 1, None])
vc = df.value_counts(dropna=False)
result1 = vc.loc[np.nan]
result2 = vc[np.nan]
result1 = vc.loc[None]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you parameterize this test to also include np.nan for indexing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, see c8f76c7

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@@ -504,6 +504,7 @@ Missing
^^^^^^^
- Bug in :meth:`DataFrame.fillna` with limit and no method ignores axis='columns' or ``axis = 1`` (:issue:`40989`)
- Bug in :meth:`DataFrame.fillna` not replacing missing values when using a dict-like ``value`` and duplicate column names (:issue:`43476`)
- :meth:`Series.value_counts` and :meth:`Series.mode` no longer coerce ``None``, ``NaT`` and other null-values to a NaN-value for ``np.object``-dtype. This behavior is now consistent with ``unique``, ``isin`` and others (:issue:`42688`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better in the notable bug fix section since the prior behavior was intentional at one point

# pd.Na and np.nan will have the same representative: np.nan
# thus we have 2 nans and 1 True
# GH42688, nans aren't mangled
values = np.array([True, pd.NA, np.nan, pd.NA, np.nan], dtype=np.object_)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add pd.NaT?

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments but overall looks good. Sorry this went unreviewed for a but so if you could merge master that'd be great.

@realead
Copy link
Contributor Author

realead commented Nov 10, 2021

@mroeschke some tests are failing, but it looks as if it is not due to this PR.

@mroeschke mroeschke removed the Stale label Nov 12, 2021
@jreback jreback merged commit 2d3644c into pandas-dev:master Nov 12, 2021
@jreback
Copy link
Contributor

jreback commented Nov 12, 2021

thanks @realead very nice!

nickleus27 pushed a commit to nickleus27/pandas that referenced this pull request Nov 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: value_counts and nunique behave differently for NaN and None with dropna=False
3 participants