-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG, TST: Patch handling of uint64 in algorithms #15162
BUG, TST: Patch handling of uint64 in algorithms #15162
Conversation
@@ -144,6 +144,41 @@ class Index(IndexOpsMixin, StringAccessorMixin, PandasObject): | |||
def __new__(cls, data=None, dtype=None, copy=False, name=None, | |||
fastpath=False, tupleize_cols=True, **kwargs): | |||
|
|||
def convert_to_int_index(_data, _copy, _name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this a class method instead (nested functions are harder to read); name it _maybe_convert_to_integer_index
use data, copy, name as arg names
return UInt64Index(res, copy=_copy, | ||
name=_name) | ||
except (TypeError, ValueError): | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pass
return None
if out is not None: | ||
return out | ||
else: | ||
return Index(subarr, copy=copy, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't need the else here
@@ -919,6 +919,10 @@ def test_constructor(self): | |||
res = Index([1, 2**63]) | |||
tm.assert_index_equal(res, idx) | |||
|
|||
idx = Index([-1, 2**63], dtype=object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it reads easier to hard code the expected here e.g. use UInt64Index or whatever as the constructor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add test when dtype is specified (and it fails)
e.g. uint64 with neg and also with nan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I specified dtype=object
. There is no other sensible constructor I can use here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With regards to the specifying uint64
with negative or NaN
, the current behavior is this:
>>> Index([-1], dtype='uint64')
UInt64Index([18446744073709551615], dtype='uint64')
>>>
>>> Index([np.nan], dtype='uint64')
Float64Index([nan], dtype='float64')
uint64
does not raise on either case (which I imagine is where we are / should be going is because) we return np.array([-1],..., dtype='uint64')
, and numpy
is too kind that it casts -1
to uint64
. For np.nan
, when it detects floating data, it tries converting to an integer index before giving up and returning Float64Index
.
I should point out though that int64
has similar behavior for NaN
(same __new__
):
>>> Index([np.nan], dtype='int64')
Float64Index([nan], dtype='float64')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those examples should both raise an error. @jreback : thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In [1]: Series([np.nan],dtype='int64')
ValueError: cannot convert float NaN to integer
In [2]: Index([np.nan],dtype='int64')
Out[2]: Float64Index([nan], dtype='float64')
yeah this seems different that what we do in Series. I agree I think this should raise, IOW, IF you specify a dtype and its not compatible it should raise, rather than fall back. (of course if you don't specify a dtype it will just figure it out).
Note that we should raise if we have a compat type though, e.g. dtype=int32
-> int64
is fine (same with float32
-> float64
).
I think we may be a bit relaxed about conversions when int is specified though. See what breaks if you change this to be more strict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, Series
is very lenient against negative numbers in uint64
too:
>>> Series([-1], dtype='uint64')
0 18446744073709551615
dtype: uint64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback : Raising on NaN
causes test breakages. Two of them are in test_where
for Index
because the construction of the test involves constructing an Index
with integer dtype but we are passing in a NaN
as one of the elements. The other one is a little more concerning in test_analytics.test_count
for Series
:
>>> mi = MultiIndex.from_arrays([list('aabbcc'), [1, 2, 2, nan, 1, 2]])
>>> ts = Series(np.arange(len(mi)), index=mi)
>>> ts.count(level=1)
...
ValueError: cannot convert float NaN to integer
cc55688
to
1fb256b
Compare
Current coverage is 85.68% (diff: 100%)@@ master #15162 diff @@
==========================================
Files 145 145
Lines 51419 51427 +8
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 44057 44066 +9
+ Misses 7362 7361 -1
Partials 0 0
|
if out is not None: | ||
return out | ||
|
||
return Index(subarr, copy=copy, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a change here? IOW, above we will then try to convert to a Float64Index
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, there is no change here. Just a refactoring.
@@ -352,6 +339,42 @@ def __new__(cls, data=None, dtype=None, copy=False, name=None, | |||
""" | |||
|
|||
@classmethod | |||
def _convert_to_int_index(cls, data, copy, name): | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any case where the dtype is specified when you are calling this? (IOW should there be)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, there should not be. This was a refactoring because I had duplicate logic.
@@ -287,6 +287,23 @@ def test_complex_sorting(self): | |||
|
|||
self.assertRaises(TypeError, algos.factorize, x17[::-1], sort=True) | |||
|
|||
def test_uint64_factorize(self): | |||
data = np.array([2**63, 1, 2**63], dtype=np.uint64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can add an example with a nan & separately with a -1 in the data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do once we can figure what to do about the situation with passing in NaN
and specifying an integer index.
this looks pretty good. prob some cases to care about after though. ping when updated tests as above. |
thanks @gfyoung ! |
1) Patches handling of `uint64` in `value_counts` 2) Adds tests for handling of `uint64` in `factorize` xref pandas-dev#14934 Author: gfyoung <gfyoung17@gmail.com> Closes pandas-dev#15162 from gfyoung/core-algorithms-uint64-three and squashes the following commits: 1fb256b [gfyoung] BUG, TST: Patch uint64 behavior in value_counts and factorize
uint64
invalue_counts
uint64
infactorize
xref #14934