Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: don't mangle NaN-float-values and pd.NaT (GH 22295) #22296

Merged
merged 15 commits into from
Sep 18, 2018

Conversation

realead
Copy link
Contributor

@realead realead commented Aug 12, 2018

it is more or less the clean-up after PR #21904 and PR #22207, the underlying hash-map handles all cases correctly out-of-the box and thus no special handling is needed.

As a collateral, the mangling of NaNs and NaT is fixed.

@realead
Copy link
Contributor Author

realead commented Aug 12, 2018

Performance tests:

asv continuous -f 1.01 upstream/master HEAD -b ^series_methods -b ^algorithms -b ^categorial

· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
·· Installing into conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
· Running 66 total benchmarks (2 commits * 1 environments * 33 benchmarks)
[  0.00%] · For pandas commit hash e14fd992:
[  0.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
[  0.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[  0.00%] ··· Running benchmarks............................
[  0.00%] ··· Setting up algorithms.py:95
[  0.00%] ··· Running benchmarks.........
[  1.52%] ··· algorithms.Duplicated.time_duplicated_float                                                    19.1±1ms;...
[  3.03%] ··· algorithms.Duplicated.time_duplicated_int                                                    10.5±0.9ms;...
[  4.55%] ··· algorithms.Duplicated.time_duplicated_string                                                 11.0±0.2μs;...
[  6.06%] ··· algorithms.DuplicatedUniqueIndex.time_duplicated_unique_int                                      31.5±0.2μs
[  7.58%] ··· algorithms.Factorize.time_factorize_float                                                      67.7±8ms;...
[  9.09%] ··· algorithms.Factorize.time_factorize_int                                                        23.1±4ms;...
[ 10.61%] ··· algorithms.Factorize.time_factorize_string                                                     162±20ms;...
[ 12.12%] ··· algorithms.Match.time_match_string                                                                  376±4μs
[ 13.64%] ··· series_methods.Clip.time_clip                                                                       126±3μs
[ 15.15%] ··· series_methods.Dir.time_dir_strings                                                             2.06±0.03ms
[ 16.67%] ··· series_methods.Dropna.time_dropna                                                           2.85±0.08ms;...
[ 18.18%] ··· series_methods.IsIn.time_isin                                                               1.67±0.05ms;...
[ 19.70%] ··· series_methods.IsInFloat64.time_isin_few_different                                                 39.5±2ms
[ 21.21%] ··· series_methods.IsInFloat64.time_isin_many_different                                                48.9±9ms
[ 22.73%] ··· series_methods.IsInFloat64.time_isin_nan_values                                                  38.8±0.3ms
[ 24.24%] ··· series_methods.IsInForObjects.time_isin_long_series_long_values                                  3.62±0.1ms
[ 25.76%] ··· series_methods.IsInForObjects.time_isin_long_series_long_values_floats                          6.46±0.09ms
[ 27.27%] ··· series_methods.IsInForObjects.time_isin_long_series_short_values                                2.37±0.05ms
[ 28.79%] ··· series_methods.IsInForObjects.time_isin_nans                                                       717±40μs
[ 30.30%] ··· series_methods.IsInForObjects.time_isin_short_series_long_values                                 1.41±0.1ms
[ 31.82%] ··· series_methods.Map.time_map                                                                 1.10±0.05ms;...
[ 33.33%] ··· series_methods.NSort.time_nlargest                                                           2.89±0.1ms;...
[ 34.85%] ··· series_methods.NSort.time_nsmallest                                                         2.42±0.07ms;...
[ 36.36%] ··· series_methods.SeriesConstructor.time_constructor                                               163±3μs;...
[ 37.88%] ··· series_methods.SeriesGetattr.time_series_datetimeindex_repr                                     3.19±0.09μs
[ 39.39%] ··· series_methods.ValueCounts.time_value_counts                                                 2.19±0.3ms;...
[ 40.91%] ··· algorithms.Hashing.time_frame                                                                    22.5±0.6ms
[ 42.42%] ··· algorithms.Hashing.time_series_categorical                                                       5.35±0.1ms
[ 43.94%] ··· algorithms.Hashing.time_series_dates                                                            3.56±0.03ms
[ 45.45%] ··· algorithms.Hashing.time_series_float                                                            3.59±0.05ms
[ 46.97%] ··· algorithms.Hashing.time_series_int                                                               3.37±0.2ms
[ 48.48%] ··· algorithms.Hashing.time_series_string                                                            14.0±0.9ms
[ 50.00%] ··· algorithms.Hashing.time_series_timedeltas                                                       3.42±0.06ms
[ 50.00%] · For pandas commit hash 03707400:
[ 50.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
[ 50.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 50.00%] ··· Running benchmarks............................
[ 50.00%] ··· Setting up algorithms.py:95
[ 50.00%] ··· Running benchmarks.........
[ 51.52%] ··· algorithms.Duplicated.time_duplicated_float                                                  15.5±0.3ms;...
[ 53.03%] ··· algorithms.Duplicated.time_duplicated_int                                                      11.3±2ms;...
[ 54.55%] ··· algorithms.Duplicated.time_duplicated_string                                                 10.6±0.2μs;...
[ 56.06%] ··· algorithms.DuplicatedUniqueIndex.time_duplicated_unique_int                                      31.3±0.3μs
[ 57.58%] ··· algorithms.Factorize.time_factorize_float                                                      63.8±3ms;...
[ 59.09%] ··· algorithms.Factorize.time_factorize_int                                                        22.6±3ms;...
[ 60.61%] ··· algorithms.Factorize.time_factorize_string                                                      145±6ms;...
[ 62.12%] ··· algorithms.Match.time_match_string                                                                  367±3μs
[ 63.64%] ··· series_methods.Clip.time_clip                                                                       120±2μs
[ 65.15%] ··· series_methods.Dir.time_dir_strings                                                              2.03±0.1ms
[ 66.67%] ··· series_methods.Dropna.time_dropna                                                            3.03±0.1ms;...
[ 68.18%] ··· series_methods.IsIn.time_isin                                                               1.68±0.01ms;...
[ 69.70%] ··· series_methods.IsInFloat64.time_isin_few_different                                               39.0±0.5ms
[ 71.21%] ··· series_methods.IsInFloat64.time_isin_many_different                                                43.6±7ms
[ 72.73%] ··· series_methods.IsInFloat64.time_isin_nan_values                                                  38.9±0.1ms
[ 74.24%] ··· series_methods.IsInForObjects.time_isin_long_series_long_values                                 3.55±0.05ms
[ 75.76%] ··· series_methods.IsInForObjects.time_isin_long_series_long_values_floats                           6.12±0.1ms
[ 77.27%] ··· series_methods.IsInForObjects.time_isin_long_series_short_values                                2.25±0.03ms
[ 78.79%] ··· series_methods.IsInForObjects.time_isin_nans                                                       680±10μs
[ 80.30%] ··· series_methods.IsInForObjects.time_isin_short_series_long_values                                 1.28±0.1ms
[ 81.82%] ··· series_methods.Map.time_map                                                                 1.11±0.04ms;...
[ 83.33%] ··· series_methods.NSort.time_nlargest                                                          2.94±0.06ms;...
[ 84.85%] ··· series_methods.NSort.time_nsmallest                                                         2.45±0.09ms;...
[ 86.36%] ··· series_methods.SeriesConstructor.time_constructor                                              169±10μs;...
[ 87.88%] ··· series_methods.SeriesGetattr.time_series_datetimeindex_repr                                      3.53±0.2μs
[ 89.39%] ··· series_methods.ValueCounts.time_value_counts                                                 2.17±0.2ms;...
[ 90.91%] ··· algorithms.Hashing.time_frame                                                                    21.4±0.4ms
[ 92.42%] ··· algorithms.Hashing.time_series_categorical                                                      5.31±0.09ms
[ 93.94%] ··· algorithms.Hashing.time_series_dates                                                             3.51±0.1ms
[ 95.45%] ··· algorithms.Hashing.time_series_float                                                            3.35±0.03ms
[ 96.97%] ··· algorithms.Hashing.time_series_int                                                              3.31±0.06ms
[ 98.48%] ··· algorithms.Hashing.time_series_string                                                            15.2±0.4ms
[100.00%] ··· algorithms.Hashing.time_series_timedeltas                                                       3.44±0.07ms
BENCHMARKS NOT SIGNIFICANTLY CHANGED.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u run perf to make sure this is ok

@gfyoung gfyoung added Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Aug 13, 2018
@realead
Copy link
Contributor Author

realead commented Aug 15, 2018

@jreback, @jorisvandenbossche, @mroeschke, @WillAyd wanted to ask for your advice, concerning this PR.

There is currently a problem, that np.unique() mangles pd.NaT and np.nan but not None:

 if not checknull(val) or val is None:
    ....

on the other hand other functions of hash_table mangle all three of them, for example map_locations:

if val != val or val is None:
     val = na_sentinel
....

Now, this leads to problems as soon as these two definition of unique meet, for example #22332, where mangling/not mangling None and np.nan is a problem.

This PR would not change the situation much: None+np.nan stays a problem, but the PR also demangles np.nan and pd.NaTin pd.unique() and thus the pair np.nan+pd.NaT becomes also a problem in situation #22332.

I see the following options:

  1. leave it as it is/do nothing
  2. mangle None with others in pd.unique(). i.e. rollback the fix for Change in behaviour of unique method regarding None values #20866
  3. go on with this PR, accept that also [np.nan, pd.NaT] will be a problem similar to [None, np.nan] for pd.to_datetime(). Disabling/changing the one failing unit test (see related #22305, right now [None`, pd.NaT] is also problem, but not tested in the test-suite)
  4. Unmangle None, np.nan and pd.NaT everywhere. But this mangling is almost everywhere...

What would you suggest?

@WillAyd
Copy link
Member

WillAyd commented Aug 15, 2018

No idea how much effort this is signing up for but option 4 imo seems to be the best and cleanest

@mroeschke
Copy link
Member

Option 4 would be very nice but as you mention would take a lot of untangling. Additionally, there's already some institutional mangling of None and np.nan when it comes to converting None to np.nan https://pandas.pydata.org/pandas-docs/stable/missing_data.html#inserting-missing-data

@codecov
Copy link

codecov bot commented Aug 17, 2018

Codecov Report

Merging #22296 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #22296      +/-   ##
==========================================
+ Coverage   92.16%   92.16%   +<.01%     
==========================================
  Files         169      169              
  Lines       50708    50711       +3     
==========================================
+ Hits        46734    46737       +3     
  Misses       3974     3974
Flag Coverage Δ
#multiple 90.57% <100%> (ø) ⬆️
#single 42.35% <80%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/indexes/base.py 96.44% <ø> (-0.01%) ⬇️
pandas/core/indexes/numeric.py 97.3% <100%> (+0.06%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 996f361...b6459f3. Read the comment docs.

@realead
Copy link
Contributor Author

realead commented Aug 17, 2018

@jreback After running asv for the whole suite and rerunning for flashing test cases, that is what is left:

      before           after         ratio
     [03707400]       [a6eb063c]
-        29.7±1ms       25.6±0.5ms     0.86  index_object.SetOperations.time_operation('strings', 'symmetric_difference')
-        41.0±1ms       35.2±0.5ms     0.86  index_object.SetOperations.time_operation('strings', 'intersection')
-         179±3ms          150±2ms     0.84  index_object.Indexing.time_get_loc_non_unique('String')
-      32.0±0.5ms       26.4±0.5ms     0.82  index_object.SetOperations.time_operation('date_string', 'symmetric_difference')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

The patch is slightly bigger than I would like, but one has to change everything at once in order not to get inconsistencies. In the end it was not as much work as expected to unmangle NA-values for object-type:

  1. Don't mangle NA-values in the hash-map and in the Base-Index
  2. Coarse NA-values to nans in Numeric-Index (to keep the current behavior), which probably makes sense.

There are probably still some hiccups, but at least the test suite doesn't show anything.

@realead realead force-pushed the cleanup_nans_in_unique branch 2 times, most recently from 8227280 to a608644 Compare August 23, 2018 20:26
@realead
Copy link
Contributor Author

realead commented Aug 26, 2018

@jreback pinging you to ask, whether you are Ok with the result of the performance test suite, or maybe I didn't understand your request correctly?

Btw. do I have to retrigger the CI, because there are some errors but unrelated to my changes?

@realead realead closed this Sep 5, 2018
@realead realead reopened this Sep 5, 2018
@pep8speaks
Copy link

Hello @realead! Thanks for updating the PR.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls rebase as well

@@ -664,13 +664,16 @@ Indexing
- Bug when indexing :class:`DatetimeIndex` with nanosecond resolution dates and timezones (:issue:`11679`)
- Bug where indexing with a Numpy array containing negative values would mutate the indexer (:issue:`21867`)
- ``Float64Index.get_loc`` now raises ``KeyError`` when boolean key passed. (:issue:`19087`)
- :class:`Index` no longer mangles None, NaN and NaT (:issue:`22332`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you be a little more explicit, use double back-ticks around None, NaN, NaT.


Missing
^^^^^^^

- Bug in :func:`DataFrame.fillna` where a ``ValueError`` would raise when one column contained a ``datetime64[ns, tz]`` dtype (:issue:`15522`)
- Bug in :func:`Series.hasnans` that could be incorrectly cached and return incorrect answers if null elements are introduced after an initial call (:issue:`19700`)
- :func:`Series.isin` now treats all nans as equal also for `np.object`-dtype. This behavior is consistent with the behavior for float64 (:issue:`22119`)
- :func:`unique` no longer manges NaN-float-values and `pd.NaT` (:issue:`22295`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use double back-ticks around NaT (no pd.) required

def insert(self, loc, item):
# treat NA values as nans:
if is_scalar(item) and isna(item):
item = np.nan
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use self._na_value, hmm isna not already imported?

# GH 22332
# check pairwise, that no pair of na values
# is mangled
na_values = [None, np.nan, pd.NaT]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the nulls_fixture

struct.pack("d", result[0]))[0]
assert result_nan_bits == bits_for_nan1

def test_do_not_mangle_na_values(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the fixture

@@ -1245,6 +1245,20 @@ def test_get_indexer(self):
e1 = np.array([1, 3, -1], dtype=np.intp)
assert_almost_equal(r1, e1)

def test_get_indexer_with_NA_values(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move near the other get_loc tests

@jreback jreback added this to the 0.24.0 milestone Sep 5, 2018
@realead realead force-pushed the cleanup_nans_in_unique branch from a608644 to 0e80f70 Compare September 5, 2018 21:04
@realead
Copy link
Contributor Author

realead commented Sep 7, 2018

@jreback done & green.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments, ping on green.


Missing
^^^^^^^

- Bug in :func:`DataFrame.fillna` where a ``ValueError`` would raise when one column contained a ``datetime64[ns, tz]`` dtype (:issue:`15522`)
- Bug in :func:`Series.hasnans` that could be incorrectly cached and return incorrect answers if null elements are introduced after an initial call (:issue:`19700`)
- :func:`Series.isin` now treats all nans as equal also for `np.object`-dtype. This behavior is consistent with the behavior for float64 (:issue:`22119`)
- :func:`unique` no longer manges NaN-float-values and ``NaT`` (:issue:`22295`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

manges -> mangles

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the same sentence as above note (the i.e. part)

@@ -278,6 +278,18 @@ def nulls_fixture(request):
nulls_fixture2 = nulls_fixture # Generate cartesian product of nulls_fixture


@pytest.fixture(params=[None, np.nan, pd.NaT])
def unique_nulls_fixture(request):
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't you use the existing nulls_fixture? (ahh or that has the float NaN which are the same as np.nan)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback I could have used nulls_fixture, but then as you already have pointed out, I would have to filter out [np.nan, float('nan')] & Co. (which are not list of unique elements) at several places in tests, so I assumed adding unique_nulls_fixture would be cleaner.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this is ok (i see how you are using it below).

@@ -560,8 +560,8 @@ def test_insert(self):
tm.assert_index_equal(Index(['a']), null_index.insert(0, 'a'))

def test_insert_missing(self, nulls_fixture):
# GH 18295 (test missing)
expected = Index(['a', np.nan, 'b', 'c'])
# test there is no mangling of NA values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the issue number as a comment

@@ -669,13 +669,16 @@ Indexing
- Bug where indexing with a Numpy array containing negative values would mutate the indexer (:issue:`21867`)
- Bug where mixed indexes wouldn't allow integers for ``.at`` (:issue:`19860`)
- ``Float64Index.get_loc`` now raises ``KeyError`` when boolean key passed. (:issue:`19087`)
- :class:`Index` no longer mangles ``None``, ``NaN`` and ``NaT``, i.e. they are treated as three different keys. However, for :class:`NumericIndex` all three are still coarsed to a ``NaN`` (:issue:`22332`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use "numeric Index" instead of "NumericIndex" ? (such a class is not public so not known to users)

@@ -114,6 +115,12 @@ def is_all_dates(self):
"""
return False

def insert(self, loc, item):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a docstring. I think you can add @Appender(Index.insert.__doc__) as decorator to do it (but best check in an interactive session that it indeed works)


Missing
^^^^^^^

- Bug in :func:`DataFrame.fillna` where a ``ValueError`` would raise when one column contained a ``datetime64[ns, tz]`` dtype (:issue:`15522`)
- Bug in :func:`Series.hasnans` that could be incorrectly cached and return incorrect answers if null elements are introduced after an initial call (:issue:`19700`)
- :func:`Series.isin` now treats all nans as equal also for `np.object`-dtype. This behavior is consistent with the behavior for float64 (:issue:`22119`)
- :func:`unique` no longer manges NaN-float-values and ``NaT`` (:issue:`22295`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new addition (unique not mangling different NAs) seems in constrast with isin actually treating them equal (the line above) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The isin-fix is about isin treating different nan-float-object as the same entity, i.e.

algos.isin([float('nan')], [float('nan')]) # different nan-objects

should yield [True] (and not [False] as it is in 0.23).

The unique-fix is about treating np.nan and pd.NaT as two different entities, ie.

pd.unique([np.nan, pd.NaT])

should yield [np.nan, pd.NaT] (and not [np.nan] as in 0.23). It still treats all nan-object as the same, i.e.:

pd.unique([np.nan, float('nan'), -float('nan')] ) # three different nan-objects

results in [np.nan]

I hope it is clearer with the updated more detailed description for the unique-change.

@realead realead force-pushed the cleanup_nans_in_unique branch from 6e02bd4 to b6459f3 Compare September 8, 2018 05:14
@realead
Copy link
Contributor Author

realead commented Sep 8, 2018

@jreback @jorisvandenbossche your remarks are adopted, PTAL.

@realead realead force-pushed the cleanup_nans_in_unique branch from b6459f3 to 8760066 Compare September 12, 2018 20:13
@jreback
Copy link
Contributor

jreback commented Sep 15, 2018

can you rebase

@realead realead force-pushed the cleanup_nans_in_unique branch from 8760066 to a258997 Compare September 15, 2018 18:51
it is more or less the clean-up after PR pandas-dev#21904 and PR pandas-dev#22207, the underlying hash-map handles all cases correctly out-of-the box and thus no special handling is needed.
@realead realead force-pushed the cleanup_nans_in_unique branch from a258997 to 356b8aa Compare September 17, 2018 19:01
@realead
Copy link
Contributor Author

realead commented Sep 18, 2018

@jreback rebased & green.

@jreback jreback merged commit d39249a into pandas-dev:master Sep 18, 2018
@jreback
Copy link
Contributor

jreback commented Sep 18, 2018

thankks @realead nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
7 participants