[BUG] don't mangle null-objects in value_counts #42743

realead · 2021-07-27T05:38:16Z

closes BUG: value_counts and nunique behave differently for NaN and None with dropna=False #42688
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

Null-like values are no longer mangled in value_counts.

This was overlooked in #22296, as decision was made not to mangle null-like values (np.nan, None, pd.NaT and so on).

It also has impact on mode, as it uses value_counts under the hood.

realead · 2021-07-27T05:40:19Z

asv shows, what cost mangling of null-like values would have:

       before           after         ratio
     [9731fd07]       [d5454938]
-         174±2ms          138±2ms     0.79  series_methods.ValueCountsObjectDropNAFalse.time_value_counts(100000)
-     1.16±0.01ms         916±20μs     0.79  series_methods.ValueCountsObjectDropNAFalse.time_value_counts(1000)
-         144±2ms        111±0.6ms     0.77  series_methods.ModeObjectDropNAFalse.time_mode(100000)
-      7.03±0.1ms      4.40±0.05ms     0.63  series_methods.ModeObjectDropNAFalse.time_mode(10000)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

jreback · 2021-07-28T01:24:45Z

asv_bench/benchmarks/series_methods.py

@@ -133,12 +133,24 @@ class ValueCounts:
    param_names = ["N", "dtype"]

    def setup(self, N, dtype):
-        self.s = Series(np.random.randint(0, N, size=10 * N)).astype(dtype)
+        self.s = Series(np.random.randint(0, N, size=10 * N)).astype("object")


umm, isn't this defeating the purpose, e.g. why did this change

Thanks, it was an unintentional change...

asv_bench/benchmarks/series_methods.py

jreback · 2021-07-28T01:26:00Z

pandas/tests/indexing/test_indexing.py

@@ -774,8 +774,8 @@ def test_label_indexing_on_nan(self):
        # GH 32431
        df = Series([1, "{1,2}", 1, None])
        vc = df.value_counts(dropna=False)
-        result1 = vc.loc[np.nan]
-        result2 = vc[np.nan]
+        result1 = vc.loc[None]


can you parameterize this test to also include np.nan for indexing

Done, see c8f76c7

github-actions · 2021-08-30T00:02:56Z

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

mroeschke · 2021-11-07T18:31:49Z

doc/source/whatsnew/v1.4.0.rst

@@ -504,6 +504,7 @@ Missing
 ^^^^^^^
 - Bug in :meth:`DataFrame.fillna` with limit and no method ignores axis='columns' or ``axis = 1`` (:issue:`40989`)
 - Bug in :meth:`DataFrame.fillna` not replacing missing values when using a dict-like ``value`` and duplicate column names (:issue:`43476`)
+- :meth:`Series.value_counts` and :meth:`Series.mode` no longer coerce ``None``, ``NaT`` and other null-values to a NaN-value for ``np.object``-dtype. This behavior is now consistent with ``unique``, ``isin`` and others (:issue:`42688`)


Probably better in the notable bug fix section since the prior behavior was intentional at one point

mroeschke · 2021-11-07T18:32:35Z

pandas/tests/libs/test_hashtable.py

-    # pd.Na and np.nan will have the same representative: np.nan
-    # thus we have 2 nans and 1 True
+    # GH42688, nans aren't mangled
+    values = np.array([True, pd.NA, np.nan, pd.NA, np.nan], dtype=np.object_)


Could you add pd.NaT?

mroeschke

Just a few comments but overall looks good. Sorry this went unreviewed for a but so if you could merge master that'd be great.

realead · 2021-11-10T12:22:44Z

@mroeschke some tests are failing, but it looks as if it is not due to this PR.

jreback · 2021-11-12T14:57:18Z

thanks @realead very nice!

jreback added this to the 1.4 milestone Jul 28, 2021

jreback added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Jul 28, 2021

jreback requested changes Jul 28, 2021

View reviewed changes

realead force-pushed the fix_42688 branch from 5547497 to 9e3ecd2 Compare July 30, 2021 15:06

github-actions bot added the Stale label Aug 30, 2021

realead force-pushed the fix_42688 branch from 9e3ecd2 to c8f76c7 Compare October 18, 2021 18:10

mroeschke reviewed Nov 7, 2021

View reviewed changes

mroeschke requested changes Nov 7, 2021

View reviewed changes

realead added 10 commits November 9, 2021 20:22

do not mangle nans in value_count

c97b677

fix test cases which assumed mangled nans

34b960b

is_null only if really needed

ead1da4

add asv tests

d53c64c

fix overlooked test case

5b02c52

fixing reverting wrong change of asv-test

94d3378

adding whatsnew note

55d6f6e

parametrize test

e3fd597

moving whatsnew note to notable fixes

088b494

adding pd.NaT to tested null-objects

792cb7b

realead force-pushed the fix_42688 branch from c8f76c7 to 792cb7b Compare November 9, 2021 20:11

fixing title line

89ecece

mroeschke approved these changes Nov 10, 2021

View reviewed changes

mroeschke removed the Stale label Nov 12, 2021

jreback approved these changes Nov 12, 2021

View reviewed changes

jreback merged commit 2d3644c into pandas-dev:master Nov 12, 2021

nickleus27 pushed a commit to nickleus27/pandas that referenced this pull request Nov 28, 2021

[BUG] don't mangle null-objects in value_counts (pandas-dev#42743)

a0b00b8

jreback mentioned this pull request Jan 6, 2022

BUG: value counts with mixed NaNs is oddly rendering #45222

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] don't mangle null-objects in value_counts #42743

[BUG] don't mangle null-objects in value_counts #42743

realead commented Jul 27, 2021 •

edited

Loading

realead commented Jul 27, 2021

jreback Jul 28, 2021

realead Jul 30, 2021

jreback Jul 28, 2021

realead Oct 18, 2021

github-actions bot commented Aug 30, 2021

mroeschke Nov 7, 2021

mroeschke Nov 7, 2021

mroeschke left a comment

realead commented Nov 10, 2021

jreback commented Nov 12, 2021

[BUG] don't mangle null-objects in value_counts #42743

[BUG] don't mangle null-objects in value_counts #42743

Conversation

realead commented Jul 27, 2021 • edited Loading

realead commented Jul 27, 2021

jreback Jul 28, 2021

Choose a reason for hiding this comment

realead Jul 30, 2021

Choose a reason for hiding this comment

jreback Jul 28, 2021

Choose a reason for hiding this comment

realead Oct 18, 2021

Choose a reason for hiding this comment

github-actions bot commented Aug 30, 2021

mroeschke Nov 7, 2021

Choose a reason for hiding this comment

mroeschke Nov 7, 2021

Choose a reason for hiding this comment

mroeschke left a comment

Choose a reason for hiding this comment

realead commented Nov 10, 2021

jreback commented Nov 12, 2021

realead commented Jul 27, 2021 •

edited

Loading