-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: don't mangle NaN-float-values and pd.NaT (GH 22295) #22296
Conversation
Performance tests:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can u run perf to make sure this is ok
@jreback, @jorisvandenbossche, @mroeschke, @WillAyd wanted to ask for your advice, concerning this PR. There is currently a problem, that
on the other hand other functions of hash_table mangle all three of them, for example
Now, this leads to problems as soon as these two definition of unique meet, for example #22332, where mangling/not mangling This PR would not change the situation much: I see the following options:
What would you suggest? |
No idea how much effort this is signing up for but option 4 imo seems to be the best and cleanest |
Option 4 would be very nice but as you mention would take a lot of untangling. Additionally, there's already some institutional mangling of |
Codecov Report
@@ Coverage Diff @@
## master #22296 +/- ##
==========================================
+ Coverage 92.16% 92.16% +<.01%
==========================================
Files 169 169
Lines 50708 50711 +3
==========================================
+ Hits 46734 46737 +3
Misses 3974 3974
Continue to review full report at Codecov.
|
@jreback After running asv for the whole suite and rerunning for flashing test cases, that is what is left:
The patch is slightly bigger than I would like, but one has to change everything at once in order not to get inconsistencies. In the end it was not as much work as expected to unmangle NA-values for object-type:
There are probably still some hiccups, but at least the test suite doesn't show anything. |
8227280
to
a608644
Compare
@jreback pinging you to ask, whether you are Ok with the result of the performance test suite, or maybe I didn't understand your request correctly? Btw. do I have to retrigger the CI, because there are some errors but unrelated to my changes? |
Hello @realead! Thanks for updating the PR.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls rebase as well
doc/source/whatsnew/v0.24.0.txt
Outdated
@@ -664,13 +664,16 @@ Indexing | |||
- Bug when indexing :class:`DatetimeIndex` with nanosecond resolution dates and timezones (:issue:`11679`) | |||
- Bug where indexing with a Numpy array containing negative values would mutate the indexer (:issue:`21867`) | |||
- ``Float64Index.get_loc`` now raises ``KeyError`` when boolean key passed. (:issue:`19087`) | |||
- :class:`Index` no longer mangles None, NaN and NaT (:issue:`22332`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you be a little more explicit, use double back-ticks around None, NaN, NaT.
doc/source/whatsnew/v0.24.0.txt
Outdated
|
||
Missing | ||
^^^^^^^ | ||
|
||
- Bug in :func:`DataFrame.fillna` where a ``ValueError`` would raise when one column contained a ``datetime64[ns, tz]`` dtype (:issue:`15522`) | ||
- Bug in :func:`Series.hasnans` that could be incorrectly cached and return incorrect answers if null elements are introduced after an initial call (:issue:`19700`) | ||
- :func:`Series.isin` now treats all nans as equal also for `np.object`-dtype. This behavior is consistent with the behavior for float64 (:issue:`22119`) | ||
- :func:`unique` no longer manges NaN-float-values and `pd.NaT` (:issue:`22295`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use double back-ticks around NaT (no pd.) required
pandas/core/indexes/numeric.py
Outdated
def insert(self, loc, item): | ||
# treat NA values as nans: | ||
if is_scalar(item) and isna(item): | ||
item = np.nan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use self._na_value
, hmm isna not already imported?
pandas/tests/indexes/test_base.py
Outdated
# GH 22332 | ||
# check pairwise, that no pair of na values | ||
# is mangled | ||
na_values = [None, np.nan, pd.NaT] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use the nulls_fixture
pandas/tests/test_algos.py
Outdated
struct.pack("d", result[0]))[0] | ||
assert result_nan_bits == bits_for_nan1 | ||
|
||
def test_do_not_mangle_na_values(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use the fixture
pandas/tests/indexes/test_base.py
Outdated
@@ -1245,6 +1245,20 @@ def test_get_indexer(self): | |||
e1 = np.array([1, 3, -1], dtype=np.intp) | |||
assert_almost_equal(r1, e1) | |||
|
|||
def test_get_indexer_with_NA_values(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you move near the other get_loc tests
a608644
to
0e80f70
Compare
@jreback done & green. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comments, ping on green.
doc/source/whatsnew/v0.24.0.txt
Outdated
|
||
Missing | ||
^^^^^^^ | ||
|
||
- Bug in :func:`DataFrame.fillna` where a ``ValueError`` would raise when one column contained a ``datetime64[ns, tz]`` dtype (:issue:`15522`) | ||
- Bug in :func:`Series.hasnans` that could be incorrectly cached and return incorrect answers if null elements are introduced after an initial call (:issue:`19700`) | ||
- :func:`Series.isin` now treats all nans as equal also for `np.object`-dtype. This behavior is consistent with the behavior for float64 (:issue:`22119`) | ||
- :func:`unique` no longer manges NaN-float-values and ``NaT`` (:issue:`22295`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
manges -> mangles
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add the same sentence as above note (the i.e. part)
@@ -278,6 +278,18 @@ def nulls_fixture(request): | |||
nulls_fixture2 = nulls_fixture # Generate cartesian product of nulls_fixture | |||
|
|||
|
|||
@pytest.fixture(params=[None, np.nan, pd.NaT]) | |||
def unique_nulls_fixture(request): | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't you use the existing nulls_fixture? (ahh or that has the float NaN which are the same as np.nan)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback I could have used nulls_fixture
, but then as you already have pointed out, I would have to filter out [np.nan, float('nan')]
& Co. (which are not list of unique elements) at several places in tests, so I assumed adding unique_nulls_fixture
would be cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes this is ok (i see how you are using it below).
pandas/tests/indexes/test_base.py
Outdated
@@ -560,8 +560,8 @@ def test_insert(self): | |||
tm.assert_index_equal(Index(['a']), null_index.insert(0, 'a')) | |||
|
|||
def test_insert_missing(self, nulls_fixture): | |||
# GH 18295 (test missing) | |||
expected = Index(['a', np.nan, 'b', 'c']) | |||
# test there is no mangling of NA values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add the issue number as a comment
doc/source/whatsnew/v0.24.0.txt
Outdated
@@ -669,13 +669,16 @@ Indexing | |||
- Bug where indexing with a Numpy array containing negative values would mutate the indexer (:issue:`21867`) | |||
- Bug where mixed indexes wouldn't allow integers for ``.at`` (:issue:`19860`) | |||
- ``Float64Index.get_loc`` now raises ``KeyError`` when boolean key passed. (:issue:`19087`) | |||
- :class:`Index` no longer mangles ``None``, ``NaN`` and ``NaT``, i.e. they are treated as three different keys. However, for :class:`NumericIndex` all three are still coarsed to a ``NaN`` (:issue:`22332`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use "numeric Index" instead of "NumericIndex" ? (such a class is not public so not known to users)
pandas/core/indexes/numeric.py
Outdated
@@ -114,6 +115,12 @@ def is_all_dates(self): | |||
""" | |||
return False | |||
|
|||
def insert(self, loc, item): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a docstring. I think you can add @Appender(Index.insert.__doc__)
as decorator to do it (but best check in an interactive session that it indeed works)
doc/source/whatsnew/v0.24.0.txt
Outdated
|
||
Missing | ||
^^^^^^^ | ||
|
||
- Bug in :func:`DataFrame.fillna` where a ``ValueError`` would raise when one column contained a ``datetime64[ns, tz]`` dtype (:issue:`15522`) | ||
- Bug in :func:`Series.hasnans` that could be incorrectly cached and return incorrect answers if null elements are introduced after an initial call (:issue:`19700`) | ||
- :func:`Series.isin` now treats all nans as equal also for `np.object`-dtype. This behavior is consistent with the behavior for float64 (:issue:`22119`) | ||
- :func:`unique` no longer manges NaN-float-values and ``NaT`` (:issue:`22295`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new addition (unique not mangling different NAs) seems in constrast with isin
actually treating them equal (the line above) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The isin-fix is about isin
treating different nan-float-object as the same entity, i.e.
algos.isin([float('nan')], [float('nan')]) # different nan-objects
should yield [True]
(and not [False] as it is in 0.23).
The unique-fix is about treating np.nan
and pd.NaT
as two different entities, ie.
pd.unique([np.nan, pd.NaT])
should yield [np.nan, pd.NaT]
(and not [np.nan]
as in 0.23). It still treats all nan-object as the same, i.e.:
pd.unique([np.nan, float('nan'), -float('nan')] ) # three different nan-objects
results in [np.nan]
I hope it is clearer with the updated more detailed description for the unique-change.
6e02bd4
to
b6459f3
Compare
@jreback @jorisvandenbossche your remarks are adopted, PTAL. |
b6459f3
to
8760066
Compare
can you rebase |
8760066
to
a258997
Compare
it is more or less the clean-up after PR pandas-dev#21904 and PR pandas-dev#22207, the underlying hash-map handles all cases correctly out-of-the box and thus no special handling is needed.
…nulls_fixture, because otherwise it is just too troublesome to filter out all types of nans
…ssion that it works)
a258997
to
356b8aa
Compare
@jreback rebased & green. |
thankks @realead nice! |
it is more or less the clean-up after PR #21904 and PR #22207, the underlying hash-map handles all cases correctly out-of-the box and thus no special handling is needed.
As a collateral, the mangling of NaNs and NaT is fixed.
git diff upstream/master -u -- "*.py" | flake8 --diff