Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG, TST: Patch handling of uint64 in algorithms #15162

Closed

Conversation

gfyoung
Copy link
Member

@gfyoung gfyoung commented Jan 19, 2017

  1. Patches handling of uint64 in value_counts
  2. Adds tests for handling of uint64 in factorize

xref #14934

@@ -144,6 +144,41 @@ class Index(IndexOpsMixin, StringAccessorMixin, PandasObject):
def __new__(cls, data=None, dtype=None, copy=False, name=None,
fastpath=False, tupleize_cols=True, **kwargs):

def convert_to_int_index(_data, _copy, _name):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this a class method instead (nested functions are harder to read); name it _maybe_convert_to_integer_index

use data, copy, name as arg names

return UInt64Index(res, copy=_copy,
name=_name)
except (TypeError, ValueError):
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pass

return None

if out is not None:
return out
else:
return Index(subarr, copy=copy,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need the else here

@@ -919,6 +919,10 @@ def test_constructor(self):
res = Index([1, 2**63])
tm.assert_index_equal(res, idx)

idx = Index([-1, 2**63], dtype=object)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it reads easier to hard code the expected here e.g. use UInt64Index or whatever as the constructor

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add test when dtype is specified (and it fails)
e.g. uint64 with neg and also with nan

Copy link
Member Author

@gfyoung gfyoung Jan 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I specified dtype=object. There is no other sensible constructor I can use here.

Copy link
Member Author

@gfyoung gfyoung Jan 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With regards to the specifying uint64 with negative or NaN, the current behavior is this:

>>> Index([-1], dtype='uint64')
UInt64Index([18446744073709551615], dtype='uint64')
>>>
>>> Index([np.nan], dtype='uint64')
Float64Index([nan], dtype='float64')

uint64 does not raise on either case (which I imagine is where we are / should be going is because) we return np.array([-1],..., dtype='uint64'), and numpy is too kind that it casts -1 to uint64. For np.nan, when it detects floating data, it tries converting to an integer index before giving up and returning Float64Index.

I should point out though that int64 has similar behavior for NaN (same __new__):

>>> Index([np.nan], dtype='int64')
Float64Index([nan], dtype='float64')

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those examples should both raise an error. @jreback : thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In [1]: Series([np.nan],dtype='int64')
ValueError: cannot convert float NaN to integer

In [2]: Index([np.nan],dtype='int64')
Out[2]: Float64Index([nan], dtype='float64')

yeah this seems different that what we do in Series. I agree I think this should raise, IOW, IF you specify a dtype and its not compatible it should raise, rather than fall back. (of course if you don't specify a dtype it will just figure it out).

Note that we should raise if we have a compat type though, e.g. dtype=int32 -> int64 is fine (same with float32 -> float64).

I think we may be a bit relaxed about conversions when int is specified though. See what breaks if you change this to be more strict.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, Series is very lenient against negative numbers in uint64 too:

>>> Series([-1], dtype='uint64')
0    18446744073709551615
dtype: uint64

Copy link
Member Author

@gfyoung gfyoung Jan 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback : Raising on NaN causes test breakages. Two of them are in test_where for Index because the construction of the test involves constructing an Indexwith integer dtype but we are passing in a NaN as one of the elements. The other one is a little more concerning in test_analytics.test_count for Series:

>>> mi = MultiIndex.from_arrays([list('aabbcc'), [1, 2, 2, nan, 1, 2]])
>>> ts = Series(np.arange(len(mi)), index=mi)
>>> ts.count(level=1)
...
ValueError: cannot convert float NaN to integer

@jreback jreback added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Dtype Conversions Unexpected or buggy dtype conversions labels Jan 19, 2017
@gfyoung gfyoung force-pushed the core-algorithms-uint64-three branch from cc55688 to 1fb256b Compare January 21, 2017 10:13
@codecov-io
Copy link

codecov-io commented Jan 21, 2017

Current coverage is 85.68% (diff: 100%)

Merging #15162 into master will increase coverage by <.01%

@@             master     #15162   diff @@
==========================================
  Files           145        145          
  Lines         51419      51427     +8   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          44057      44066     +9   
+ Misses         7362       7361     -1   
  Partials          0          0          

Powered by Codecov. Last update 97fd744...1fb256b

if out is not None:
return out

return Index(subarr, copy=copy,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a change here? IOW, above we will then try to convert to a Float64Index.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, there is no change here. Just a refactoring.

@@ -352,6 +339,42 @@ def __new__(cls, data=None, dtype=None, copy=False, name=None,
"""

@classmethod
def _convert_to_int_index(cls, data, copy, name):
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any case where the dtype is specified when you are calling this? (IOW should there be)

Copy link
Member Author

@gfyoung gfyoung Jan 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, there should not be. This was a refactoring because I had duplicate logic.

@@ -287,6 +287,23 @@ def test_complex_sorting(self):

self.assertRaises(TypeError, algos.factorize, x17[::-1], sort=True)

def test_uint64_factorize(self):
data = np.array([2**63, 1, 2**63], dtype=np.uint64)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can add an example with a nan & separately with a -1 in the data

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do once we can figure what to do about the situation with passing in NaN and specifying an integer index.

@jreback
Copy link
Contributor

jreback commented Jan 21, 2017

this looks pretty good. prob some cases to care about after though. ping when updated tests as above.

@jreback
Copy link
Contributor

jreback commented Jan 22, 2017

thanks @gfyoung !

@gfyoung gfyoung deleted the core-algorithms-uint64-three branch January 22, 2017 19:42
AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this pull request Mar 21, 2017
1) Patches handling of `uint64` in `value_counts`
2) Adds tests for handling of `uint64` in `factorize`

xref pandas-dev#14934

Author: gfyoung <gfyoung17@gmail.com>

Closes pandas-dev#15162 from gfyoung/core-algorithms-uint64-three and squashes the following commits:

1fb256b [gfyoung] BUG, TST: Patch uint64 behavior in value_counts and factorize
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants