BUG, TST: Patch handling of uint64 in algorithms #15162

gfyoung · 2017-01-19T10:45:04Z

Patches handling of uint64 in value_counts
Adds tests for handling of uint64 in factorize

jreback · 2017-01-19T12:06:17Z

pandas/indexes/base.py

@@ -144,6 +144,41 @@ class Index(IndexOpsMixin, StringAccessorMixin, PandasObject):
    def __new__(cls, data=None, dtype=None, copy=False, name=None,
                fastpath=False, tupleize_cols=True, **kwargs):

+        def convert_to_int_index(_data, _copy, _name):


make this a class method instead (nested functions are harder to read); name it _maybe_convert_to_integer_index

use data, copy, name as arg names

jreback · 2017-01-19T12:06:45Z

pandas/indexes/base.py

+                    return UInt64Index(res, copy=_copy,
+                                       name=_name)
+            except (TypeError, ValueError):
+                return None


pass

return None

jreback · 2017-01-19T12:07:42Z

pandas/indexes/base.py

+                    if out is not None:
+                        return out
+                    else:
+                        return Index(subarr, copy=copy,


don't need the else here

jreback · 2017-01-19T12:10:42Z

pandas/tests/indexes/test_numeric.py

@@ -919,6 +919,10 @@ def test_constructor(self):
        res = Index([1, 2**63])
        tm.assert_index_equal(res, idx)

+        idx = Index([-1, 2**63], dtype=object)


i think it reads easier to hard code the expected here e.g. use UInt64Index or whatever as the constructor

add test when dtype is specified (and it fails)
e.g. uint64 with neg and also with nan

I specified dtype=object. There is no other sensible constructor I can use here.

With regards to the specifying uint64 with negative or NaN, the current behavior is this:

>>> Index([-1], dtype='uint64') UInt64Index([18446744073709551615], dtype='uint64') >>> >>> Index([np.nan], dtype='uint64') Float64Index([nan], dtype='float64')

uint64 does not raise on either case (which I imagine is where we are / should be going is because) we return np.array([-1],..., dtype='uint64'), and numpy is too kind that it casts -1 to uint64. For np.nan, when it detects floating data, it tries converting to an integer index before giving up and returning Float64Index.

I should point out though that int64 has similar behavior for NaN (same __new__):

>>> Index([np.nan], dtype='int64') Float64Index([nan], dtype='float64')

Those examples should both raise an error. @jreback : thoughts?

In [1]: Series([np.nan],dtype='int64') ValueError: cannot convert float NaN to integer In [2]: Index([np.nan],dtype='int64') Out[2]: Float64Index([nan], dtype='float64')

yeah this seems different that what we do in Series. I agree I think this should raise, IOW, IF you specify a dtype and its not compatible it should raise, rather than fall back. (of course if you don't specify a dtype it will just figure it out).

Note that we should raise if we have a compat type though, e.g. dtype=int32 -> int64 is fine (same with float32 -> float64).

I think we may be a bit relaxed about conversions when int is specified though. See what breaks if you change this to be more strict.

FYI, Series is very lenient against negative numbers in uint64 too:

>>> Series([-1], dtype='uint64') 0 18446744073709551615 dtype: uint64

@jreback : Raising on NaN causes test breakages. Two of them are in test_where for Index because the construction of the test involves constructing an Indexwith integer dtype but we are passing in a NaN as one of the elements. The other one is a little more concerning in test_analytics.test_count for Series:

>>> mi = MultiIndex.from_arrays([list('aabbcc'), [1, 2, 2, nan, 1, 2]]) >>> ts = Series(np.arange(len(mi)), index=mi) >>> ts.count(level=1) ... ValueError: cannot convert float NaN to integer

codecov-io · 2017-01-21T10:59:04Z

Current coverage is 85.68% (diff: 100%)

Merging #15162 into master will increase coverage by <.01%

@@             master     #15162   diff @@
==========================================
  Files           145        145          
  Lines         51419      51427     +8   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          44057      44066     +9   
+ Misses         7362       7361     -1   
  Partials          0          0

Powered by Codecov. Last update 97fd744...1fb256b

jreback · 2017-01-21T15:14:08Z

pandas/indexes/base.py

+                    if out is not None:
+                        return out
+
+                    return Index(subarr, copy=copy,


is this a change here? IOW, above we will then try to convert to a Float64Index.

No, there is no change here. Just a refactoring.

jreback · 2017-01-21T15:15:24Z

pandas/indexes/base.py

@@ -352,6 +339,42 @@ def __new__(cls, data=None, dtype=None, copy=False, name=None,
    """

    @classmethod
+    def _convert_to_int_index(cls, data, copy, name):
+        """


is there any case where the dtype is specified when you are calling this? (IOW should there be)

No, there should not be. This was a refactoring because I had duplicate logic.

jreback · 2017-01-21T15:17:21Z

pandas/tests/test_algos.py

@@ -287,6 +287,23 @@ def test_complex_sorting(self):

        self.assertRaises(TypeError, algos.factorize, x17[::-1], sort=True)

+    def test_uint64_factorize(self):
+        data = np.array([2**63, 1, 2**63], dtype=np.uint64)


can add an example with a nan & separately with a -1 in the data

Will do once we can figure what to do about the situation with passing in NaN and specifying an integer index.

jreback · 2017-01-21T21:13:04Z

this looks pretty good. prob some cases to care about after though. ping when updated tests as above.

jreback · 2017-01-22T15:49:54Z

thanks @gfyoung !

1) Patches handling of `uint64` in `value_counts` 2) Adds tests for handling of `uint64` in `factorize` xref pandas-dev#14934 Author: gfyoung <gfyoung17@gmail.com> Closes pandas-dev#15162 from gfyoung/core-algorithms-uint64-three and squashes the following commits: 1fb256b [gfyoung] BUG, TST: Patch uint64 behavior in value_counts and factorize

jreback requested changes Jan 19, 2017

View reviewed changes

jreback added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Dtype Conversions Unexpected or buggy dtype conversions labels Jan 19, 2017

BUG, TST: Patch uint64 behavior in value_counts and factorize

1fb256b

gfyoung force-pushed the core-algorithms-uint64-three branch from cc55688 to 1fb256b Compare January 21, 2017 10:13

jreback reviewed Jan 21, 2017

View reviewed changes

jreback added this to the 0.20.0 milestone Jan 22, 2017

jreback mentioned this pull request Jan 22, 2017

COMPAT: clarify Index integer conversions when dtype is specified in construction #15187

Closed

jreback closed this in ce2f421 Jan 22, 2017

gfyoung deleted the core-algorithms-uint64-three branch January 22, 2017 19:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG, TST: Patch handling of uint64 in algorithms #15162

BUG, TST: Patch handling of uint64 in algorithms #15162

gfyoung commented Jan 19, 2017

jreback Jan 19, 2017

jreback Jan 19, 2017

jreback Jan 19, 2017

jreback Jan 19, 2017

jreback Jan 19, 2017

gfyoung Jan 21, 2017 •

edited

Loading

gfyoung Jan 21, 2017 •

edited

Loading

gfyoung Jan 21, 2017

jreback Jan 21, 2017

gfyoung Jan 21, 2017

gfyoung Jan 22, 2017 •

edited

Loading

codecov-io commented Jan 21, 2017 •

edited

Loading

jreback Jan 21, 2017

gfyoung Jan 21, 2017

jreback Jan 21, 2017

gfyoung Jan 21, 2017 •

edited

Loading

jreback Jan 21, 2017

gfyoung Jan 22, 2017

jreback commented Jan 21, 2017

jreback commented Jan 22, 2017

BUG, TST: Patch handling of uint64 in algorithms #15162

BUG, TST: Patch handling of uint64 in algorithms #15162

Conversation

gfyoung commented Jan 19, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung Jan 21, 2017 • edited Loading

Choose a reason for hiding this comment

gfyoung Jan 21, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung Jan 22, 2017 • edited Loading

Choose a reason for hiding this comment

codecov-io commented Jan 21, 2017 • edited Loading

Current coverage is 85.68% (diff: 100%)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung Jan 21, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jan 21, 2017

jreback commented Jan 22, 2017

gfyoung Jan 21, 2017 •

edited

Loading

gfyoung Jan 21, 2017 •

edited

Loading

gfyoung Jan 22, 2017 •

edited

Loading

codecov-io commented Jan 21, 2017 •

edited

Loading

gfyoung Jan 21, 2017 •

edited

Loading