-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fix IntervalIndex.insert to allow inserting NaN #18300
BUG: Fix IntervalIndex.insert to allow inserting NaN #18300
Conversation
right_insert = item.right | ||
elif is_scalar(item) and isna(item): | ||
# GH 18295 | ||
left_insert = right_insert = item |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes u need to use a nan compat with left iow this could be a NaT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we have a left.na_value iirc (might be spelled differently)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or maybe the underlying already handles this in the insert
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. Procedure I'm following is "check if any type of NA value is passed -> raise if the wrong type of NA is passed". I suppose I could just bypass this and only check if the right type of NA is passed, if that would be preferred.
@@ -255,6 +255,11 @@ def test_insert(self): | |||
pytest.raises(ValueError, self.index.insert, 0, | |||
Interval(2, 3, closed='left')) | |||
|
|||
# GH 18295 | |||
expected = self.index_with_nan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can u add a fixture that hits multiple kinds of left (float,int,datetime); might be able to do this more generally in the interval tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Codecov Report
@@ Coverage Diff @@
## master #18300 +/- ##
==========================================
- Coverage 91.4% 91.38% -0.02%
==========================================
Files 164 164
Lines 49880 49884 +4
==========================================
- Hits 45592 45587 -5
- Misses 4288 4297 +9
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #18300 +/- ##
==========================================
- Coverage 91.34% 91.32% -0.02%
==========================================
Files 163 163
Lines 49717 49727 +10
==========================================
+ Hits 45413 45414 +1
- Misses 4304 4313 +9
Continue to review full report at Codecov.
|
fbfe3ab
to
237766c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can do a pre-cursor PR to fix .insert
or can do here.
pandas/core/indexes/interval.py
Outdated
right_insert = item.right | ||
elif is_scalar(item) and isna(item): | ||
# GH 18295 | ||
if item is not self.left._na_value: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this needs a little work in Base.insert
(which works, though the values should be inserted as _na_value
). and doesn't work with any datetimelikes (e.g. DatetimeIndex). can you make the fix for this, rather than here?
@@ -255,6 +255,30 @@ def test_insert(self): | |||
pytest.raises(ValueError, self.index.insert, 0, | |||
Interval(2, 3, closed='left')) | |||
|
|||
@pytest.mark.parametrize('data', [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prob need some more tests for .insert
with (None, np.nan, pd.NaT as the inserted value) generally
doc/source/whatsnew/v0.21.1.txt
Outdated
@@ -62,6 +62,7 @@ Bug Fixes | |||
- Bug in ``pd.Series.rolling.skew()`` and ``rolling.kurt()`` with all equal values has floating issue (:issue:`18044`) | |||
- Bug in ``pd.DataFrameGroupBy.count()`` when counting over a datetimelike column (:issue:`13393`) | |||
- Bug in ``pd.concat`` when empty and non-empty DataFrames or Series are concatenated (:issue:`18178` :issue:`18187`) | |||
- Bug in ``IntervalIndex.insert`` when attempting to insert ``NaN`` (:issue:`18295`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to 0.22, this is a bit more invasive.
ab4fd51
to
8986439
Compare
Made the changes at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice fixes and tests. some comments.
pandas/core/indexes/base.py
Outdated
@@ -3728,6 +3728,10 @@ def insert(self, loc, item): | |||
------- | |||
new_index : Index | |||
""" | |||
if lib.checknull(item): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can just use isna()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I'm using checknull
over isna
is to guard against non-scalar item
.
Using isna
by itself in an if
statement would fail for non-scalar item
with an unhelpful error message, whereas checknull
returns False
for any non-scalar:
In [8]: checknull(['a', 'b'])
Out[8]: False
In [9]: bool(isna(['a', 'b']))
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-9-358fb1312857> in <module>()
----> 1 bool(isna(['a', 'b']))
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
The way around this with isna
would be to do something like: is_scalar(item) and isna(item)
. However, looking at the implementation of isna
, this just forces it down a path to hit checknull
:
pandas/pandas/core/dtypes/missing.py
Lines 51 to 53 in cfad581
def _isna_new(obj): | |
if is_scalar(obj): | |
return lib.checknull(obj) |
So is_scalar(item) and isna(item)
is equivalent to checknull(item)
but with the additional overhead of two calls to is_scalar
.
Seems more efficient to just use checknull
, but can switch to isna
if that's preferred stylistically. Or maybe I'm overlooking a better way to using isna
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an aside, might be nice to do a checknull
-> checkna
renaming in the spirit of isnull
-> isna
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah should rename that as well.
yeah usually I would just do: is_scalar(..) and is_na(...)
to make this explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lib.checknull
is pretty internal and don't want to use it in this code
pandas/core/indexes/category.py
Outdated
@@ -688,7 +688,7 @@ def insert(self, loc, item): | |||
|
|||
""" | |||
code = self.categories.get_indexer([item]) | |||
if (code == -1): | |||
if (code == -1) and not lib.checknull(item): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use isna()
(or notna()
pandas/core/indexes/datetimes.py
Outdated
@@ -1751,6 +1751,9 @@ def insert(self, loc, item): | |||
------- | |||
new_index : Index | |||
""" | |||
if lib.checknull(item): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
pandas/core/indexes/interval.py
Outdated
'side as the index') | ||
left_insert = item.left | ||
right_insert = item.right | ||
elif lib.checknull(item): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
pandas/core/indexes/timedeltas.py
Outdated
|
||
freq = None | ||
if isinstance(item, Timedelta) or item is NaT: | ||
if isinstance(item, Timedelta) or (item is self._na_value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use isna()
here as well
pandas/tests/indexes/test_numeric.py
Outdated
def test_insert(self): | ||
# GH 18295 (test missing) | ||
expected = UInt64Index([0, 0, 1, 2, 3, 4]) | ||
for na in (np.nan, pd.NaT, None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I think this should coerce to a FloatIndex
, cc @gfyoung
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was my intuition as well, and how I originally wrote the test (flowed through from Numeric
class). The logic behind expected
is based on the _na_value
of UInt64Index
being 0:
In [2]: pd.UInt64Index._na_value
Out[2]: 0
So, it'd be a matter of altering that to np.nan
if we want it to coerce to Float64Index
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, this should mimic Int64Index
, in that it cannot hold na, so yes pls change this logic. (possibly this might break other things, not sure). may need to do this as a pre-curser (or you do it post as well, lmk).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested out changing the default value to np.nan
, and it produced the expected result of coercing to Float64Index
. It broke two tests though, both involving Index.where
, so I didn't include it in the most recent commit. The tests that broke are:
-
test_where_array_like
fails due toexpected
raising aValueError
:
pandas/pandas/tests/indexes/common.py
Lines 545 to 551 in 3b05a60
def test_where_array_like(self): i = self.create_index() _nan = i._na_value cond = [False] + [True] * (len(i) - 1) klasses = [list, tuple, np.array, pd.Series] expected = pd.Index([_nan] + i[1:].tolist(), dtype=i.dtype)
The change makes_nan
->np.nan
, soexpected
essentially becomespd.Index([np.nan, 1, 2, 3, 4], dtype='uint64')
, which raisesValueError: cannot convert float NaN to integer
. I think theValueError
is fine, and it's just the construction ofexpected
that needs to be modified. My thoughts are thatexpected
should be aFloat64Index
in this case, as the introduction ofnp.nan
should causeUInt64Index
to coerce. -
test_where
fails becauseresult
gets coerced toInt64Index
:
pandas/pandas/tests/indexes/common.py
Lines 532 to 536 in 3b05a60
def test_where(self): i = self.create_index() result = i.where(notna(i)) expected = i tm.assert_index_equal(result, expected)
This actually happens in 0.21.0 too, and is only now apparent due to testing the change. Essentially, if any non-uint64 value gets passed as other (in this casenp.nan
due to the change), the index gets coerced, even if the mask results in no changes. Reproducing this error on 0.21.0:
In [2]: idx = pd.UInt64Index(range(5))
...: idx
...:
Out[2]: UInt64Index([0, 1, 2, 3, 4], dtype='uint64')
In [3]: idx.where(idx < 100, np.nan)
Out[3]: Int64Index([0, 1, 2, 3, 4], dtype='int64')
Seems to be caused by _try_convert_to_int_index
automatically trying Int64Index
first, so should be fairly straightforward to fix by short-circuiting based on dtype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this fixed here? or the other issue fix should be merged first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will open a separate PR that should be merged first since there are a couple unrelated fixes. My bad, said that in the original comment above, but then edited over it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np; just reference this pr from new one
8986439
to
71a09eb
Compare
@@ -57,6 +57,12 @@ def test_insert(self): | |||
assert result.name == expected.name | |||
assert result.freq == expected.freq | |||
|
|||
# GH 18295 (test missing) | |||
expected = TimedeltaIndex(['1day', pd.NaT, '2day', '3day']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
side issue, we have lots of duplication on these test_insert/delete and such in the hierarchy and most can simply be tested via fixture / in superclass (but will require some refactoring in the tests). I think we have an issue about this.
c1a0901
to
e401032
Compare
@@ -697,3 +697,11 @@ def test_join_self(self, how): | |||
index = period_range('1/1/2000', periods=10) | |||
joined = index.join(index, how=how) | |||
assert index is joined | |||
|
|||
def test_insert(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in another PR, can come back and move test_insert to datetimelike.py where this can be tested generically (instead of having code inside each datetimelike index type)
@@ -457,6 +457,12 @@ def test_insert(self): | |||
null_index = Index([]) | |||
tm.assert_index_equal(Index(['a']), null_index.insert(0, 'a')) | |||
|
|||
# GH 18295 (test missing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can also be done generically
thanks @jschendel |
git diff upstream/master -u -- "*.py" | flake8 --diff