COMPAT: clarify Index integer conversions when dtype is specified in construction #15187

jreback · 2017-01-22T15:40:30Z

so [8], and [9] are our current model, IOW, we make an effort to convert to the specified type, but will coerce to an available type if the data is not valid for that dtype.

ideally we would also be consistent w.r.t. #15145, IOW Series construction with a specified dtype (not that we upcast to the available Index types but don't do this for Series, e.g. [18])

So we should be consistent on this.

In [8]: Index([np.nan],dtype='int64')
Out[8]: Float64Index([nan], dtype='float64')

In [9]: Index([np.nan],dtype='uint64')
Out[9]: Float64Index([nan], dtype='float64')

In [10]: Index([np.iinfo(np.int64).max-1],dtype='int64')
Out[10]: Int64Index([9223372036854775806], dtype='int64')

In [11]: Index([np.iinfo(np.int64).max-1],dtype='uint64')
Out[11]: UInt64Index([9223372036854775806], dtype='uint64')

# I guess this should convert to Float64Index?
In [12]: Index([np.iinfo(np.uint64).max-1],dtype='int64')
---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
<ipython-input-12-f2ec7d3c38a4> in <module>()
----> 1 Index([np.iinfo(np.uint64).max-1],dtype='int64')

/Users/jreback/pandas/pandas/indexes/base.py in __new__(cls, data, dtype, copy, name, fastpath, tupleize_cols, **kwargs)
    318             # other iterable of some kind
    319             subarr = _asarray_tuplesafe(data, dtype=object)
--> 320             return Index(subarr, dtype=dtype, copy=copy, name=name, **kwargs)
    321 
    322     """

/Users/jreback/pandas/pandas/indexes/base.py in __new__(cls, data, dtype, copy, name, fastpath, tupleize_cols, **kwargs)
    199                         inferred = lib.infer_dtype(data)
    200                         if inferred == 'integer':
--> 201                             data = np.array(data, copy=copy, dtype=dtype)
    202                         elif inferred in ['floating', 'mixed-integer-float']:
    203 

OverflowError: Python int too large to convert to C long

In [13]: Index([np.iinfo(np.uint64).max-1],dtype='uint64')
Out[13]: UInt64Index([18446744073709551614], dtype='uint64')

In [14]: Index([-1], dtype='int64')
Out[14]: Int64Index([-1], dtype='int64')

# this looks wrong
In [15]: Index([-1], dtype='uint64')
Out[15]: UInt64Index([18446744073709551615], dtype='uint64')

# we do this type of same-dtype upcasting already (this is correct / good thing)
In [18]: Index([np.iinfo(np.int32).max+1], dtype='int64')
Out[18]: Int64Index([2147483648], dtype='int64')

The text was updated successfully, but these errors were encountered:

jreback · 2017-01-22T15:40:36Z

cc @gfyoung

gfyoung · 2017-01-30T07:47:05Z

For references, here are the same inputs with Series :

>>> Series([np.nan],dtype='int64')  # NO MATCH
...
ValueError: cannot convert float NaN to integer
>>>
>>> Series([np.nan],dtype='uint64')  # NO MATCH
...
ValueError: cannot convert float NaN to integer
>>>
>>> Series([np.iinfo(np.int64).max-1],dtype='int64')  # MATCH
0    9223372036854775806
dtype: int64
>>>
>>> Series([np.iinfo(np.int64).max-1],dtype='int64')  # MATCH
0    9223372036854775806
dtype: uint64
>>>
>>> Series([np.iinfo(np.uint64).max-1],dtype='int64')  # MATCH
...
OverflowError: Python int too large to convert to C long
>>>
>>> Series([-1], dtype='int64')  # MATCH
0   -1
dtype: int64
>>>
>>> Series([-1], dtype='uint64')  # MATCH
0    18446744073709551615
dtype: uint64
>>>
>>> Series([np.iinfo(np.int32).max+1], dtype='int64')  # MATCH
0    2147483648
dtype: int64

If we are going to resolve this, the behavior should be consistent across the board. Also, I'm not entirely sure now if I want to change the behavior when we pass in negative numbers and specify uint64 (or any uint) for that matter. On the one hand, the value is destroyed, but on the other hand, it is the correct value modulo np.iinfo(np.uint64).max.

jorisvandenbossche · 2017-01-30T09:17:01Z

On the NaN case: I would be in favor of raising an error when passing np.nan to integer dtype (so for pd.Index([np.nan], dtype='int64')), so it is consistent with Series and with numpy as well.
In general, I think if you pass a specific dtype and the data cannot be converted to that dtype, we should raise.

gfyoung · 2017-01-30T09:26:00Z

@jorisvandenbossche : What do you mean by "converted" ? Casting modulo the maximum value of the integer data-type could be argued to be a conversion. Or are you saying that we just go with numpy conventions?

jorisvandenbossche · 2017-01-30T09:59:27Z

Yes, I know, the exact interpretation of that is the debatable part.
For NaNs I think this is quite clear, for integers outside of its 'natural' range not so. Personally I would find it more informative if eg pd.Series([-1], dtype='uint64) would raise (as this feels like a user error), but can also see the argument for 'converting' it and not raising (which would indeed be consistent with numpy).

gfyoung · 2017-01-30T10:13:10Z

@jorisvandenbossche : Yes, I agree with you regarding NaN. That is also consistent with numpy, hence my last question in my previous comment above.

jreback · 2017-01-30T21:07:19Z

I think the general philosophy is to raise when presented with a dtype that is incompat with the input, essentially this is going to do an .astype. so in favor of having Index agree with Series (for np.nan with int/uint dtype specified).

pd.Series([-1], dtype='uint64) should also raise (informative message).

>>> Series([np.iinfo(np.uint64).max-1],dtype='int64')  # MATCH
...
OverflowError: Python int too large to convert to C long

I also find this error message not very informative (we are passing thru the numpy message). OverflowError is fine, but its not useful to a user. So would be in favor of cleaning that up both for Index and Series.

gfyoung · 2017-01-31T08:26:39Z

Yes, we can agree on raising when NaN is present and an integer dtype is specified. Regarding cleaning up the error message for int64 overflow, that is perfectly reasonable.

Again, I'm not too sure right now about raising on negative integers specified with an unsigned integer dtype. Maybe I'm just being lazy, but consistency with numpy might be preferable. 😂

jorisvandenbossche · 2017-01-31T08:45:56Z

>>> Series([np.iinfo(np.uint64).max-1],dtype='int64')  # MATCH
...
OverflowError: Python int too large to convert to C long

It's true that the above raises, but this is not very consistent:

In [174]: pd.Series(np.array([np.iinfo(np.int64).max+1]), dtype='int64')   # array will already have a dtype, uint64
Out[174]: 
0   -9223372036854775808
dtype: int64

In [169]: pd.Series([np.iinfo(np.int32).max+1], dtype='int32')
Out[169]: 
0   -2147483648
dtype: int32

At least, we could make Series([..], dtype='..') consistent with Series([..]).astype('..'), and that means raising for NaN casted to integer (indeed, we can agree on that one)

gfyoung · 2017-01-31T08:49:55Z

Maybe we can start with a PR to raise with NaN and integer dtype specified? It seems we haven't fully reached consensus on the other cases.

jreback · 2017-01-31T15:03:23Z

@gfyoung sure partially closing this would be great.

jreback · 2017-01-31T15:04:23Z

Maybe I'm just being lazy, but consistency with numpy might be preferable. 😂

sure when numpy makes sense. Given a wrap-around with negative numbers for uint is quite suprising. These should be treated exactly like NaN with unint (when a dtype is specified), raising an error.

gfyoung · 2017-01-31T17:27:10Z

@jreback : That wouldn't be the only case you would have to consider of "wrap-around" BTW. Look at all of the examples @jorisvandenbossche provided above.

Essentially casting to ndarray would be a forbidden thing without explicit checks all over the place for overflow and underflow. Perhaps "wrapping" the np.array method would be the only way to go if we go with raising on such instances and enforcing it across the code base.

jreback · 2017-01-31T17:38:23Z

@gfyoung I don't think this is very complicated.. This is only when the dtype is actually specified. You check for negative values then cast to uint64 (which catches the nans). I don't think this is a big deal.

This is the point of having a UInt64 block which can do these validations will catch anything else, e.g. try to fillna with an invalid value (which actually works, but you cast it first), and things like setitem.

Everything is already divorced but numpy and wrapped in the blocks which have convenient API for values (and casting). We do exactly these types of things for example in trying to set nan with an integer block.

The only part of this which is not wrapped up are some construction validation routines (which happen way before block creation). We do lots of inference to figure out what the user is passing, see https://github.com/pandas-dev/pandas/blob/master/pandas/core/series.py#L2843

gfyoung · 2017-02-06T03:08:01Z

diff --git a/pandas/indexes/base.py b/pandas/indexes/base.py
index dcd565e..4cccd96 100644
--- a/pandas/indexes/base.py
+++ b/pandas/indexes/base.py
@@ -200,6 +200,9 @@ class Index(IndexOpsMixin, StringAccessorMixin, PandasObject):
                         if inferred == 'integer':
                             data = np.array(data, copy=copy, dtype=dtype)
                         elif inferred in ['floating', 'mixed-integer-float']:
+                            if isnull(data).any():
+                                raise ValueError('cannot convert float '
+                                                 'NaN to integer')
 
                             # If we are actually all equal to integers,
                             # then coerce to integer.
@@ -227,8 +230,10 @@ class Index(IndexOpsMixin, StringAccessorMixin, PandasObject):
                     else:
                         data = np.array(data, dtype=dtype, copy=copy)
 
-                except (TypeError, ValueError):
-                    pass
+                except (TypeError, ValueError) as e:
+                    msg = str(e)
+                    if 'cannot convert float' in msg:
+                        raise
 
             # maybe coerce to a sub-class
             from pandas.tseries.period import (PeriodIndex,

Here's a possible way to patch the NaN with integer dtype specified. Unfortunately, that causes tests to break (with the error I added) as follows:

indexes/common.py for Int64Index and RangeIndex
test_count in test_analytics.py

All of these tests incorporate NaN somehow into the test setup. The first failures are rather trivial to patch, but I'm not so sure about the second. Thoughts?

jreback · 2017-02-06T15:20:10Z

I think you need to do the 2nd part first. IOW, if the floats == integers when casted it is ok, but then need to raise on the ValueError (rather than converting to a float index)

+                            if isnull(data).any():
+                                raise ValueError('cannot convert float '
+                                                 'NaN to integer')
 
                             # If we are actually all equal to integers,
                             # then coerce to integer.

can you show the test that is problematic? IOW an example of it.

gfyoung · 2017-02-06T15:26:30Z

I'm not sure I 100% followed what you said there. Also, all the tests I mentioned that failed fail with my ValueError being raised.

jreback · 2017-02-06T15:30:08Z

I said that your change is actually not necessary, rather, https://github.com/pandas-dev/pandas/blob/master/pandas/indexes/base.py#L260

I think all you need to do is remove the try/except and let the ValueError raise (rather than use coercing to Index of object dtype)

gfyoung · 2017-02-07T04:32:30Z

No, that's definitely the wrong place. Note that dtype is None in that if block, so the user didn't specify that they wanted an integer dtype.

gfyoung · 2017-02-10T15:47:53Z

This is getting more difficult to change, as I realized that .where for indices just breaks period with this change of mine:

>>> from pandas import Index
>>> Index([1, 2, 3]).where([False, True, True])
...
ValueError: cannot convert float NaN to integer

Partial addresses pandas-devgh-15187.

Partially addresses pandas-devgh-15187.

xref #15187. Author: gfyoung <gfyoung17@gmail.com> Closes #15616 from gfyoung/nan-int-index and squashes the following commits: 195b830 [gfyoung] BUG: Error when specifying int index containing NaN

jreback · 2017-03-10T14:41:46Z

thanks!

we can make a checklist in the issue if that helps? (pls put it up and ill copy to the top if you can)

gfyoung · 2017-03-10T14:42:55Z

@jreback : Can you edit the original issue to add this PR to the checklist?

jreback · 2017-03-10T14:52:45Z

yes can u enumerate the open issues (as i see them) and will make checkboxes (even better is for you to post below and i'll update the top)

gfyoung · 2017-03-10T15:13:07Z

@jreback : Fair enough. Which issues would we pull in though besides my PR? The only I can think of is #15145 as you mentioned already.

jreback · 2017-03-10T16:38:10Z

so I think that fixed [10] and [11], but [12] and [15] remain

gfyoung · 2017-03-10T18:24:38Z

@jreback : So the behavior of Index matches that of Series now (I think you meant [8] and [9] are patched). Thus, any questions regarding conversions are now beyond the scope of Index.

jreback · 2017-03-10T23:41:49Z

@gfyoung yes you are correct. Ok then. I will close this one. Let's however open a new one for Index/Series fixes for [12], [15] (with checkboxes).

xref pandas-dev#15187. Author: gfyoung <gfyoung17@gmail.com> Closes pandas-dev#15616 from gfyoung/nan-int-index and squashes the following commits: 195b830 [gfyoung] BUG: Error when specifying int index containing NaN

jreback · 2017-03-29T00:57:19Z

@gfyoung can you open an issue as above:

@gfyoung yes you are correct. Ok then. I will close this one. Let's however open a new one for Index/Series fixes for [12], [15] (with checkboxes).

gfyoung · 2017-03-29T04:31:00Z

Yep, done: #15832

jreback added Compat pandas objects compatability with Numpy or Python functions Difficulty Intermediate Dtype Conversions Unexpected or buggy dtype conversions labels Jan 22, 2017

jreback added this to the 0.20.0 milestone Jan 22, 2017

gfyoung added a commit to forking-repos/pandas that referenced this issue Mar 8, 2017

BUG: Error when specifying int index containing NaN

5adc34f

Partial addresses pandas-devgh-15187.

gfyoung added a commit to forking-repos/pandas that referenced this issue Mar 8, 2017

BUG: Error when specifying int index containing NaN

2d8a55e

Partially addresses pandas-devgh-15187.

gfyoung mentioned this issue Mar 8, 2017

BUG: Error when specifying int index containing NaN #15616

Closed

gfyoung added a commit to forking-repos/pandas that referenced this issue Mar 8, 2017

BUG: Error when specifying int index containing NaN

4cbce41

Partially addresses pandas-devgh-15187.

gfyoung added a commit to forking-repos/pandas that referenced this issue Mar 8, 2017

BUG: Error when specifying int index containing NaN

c3969ab

Partially addresses pandas-devgh-15187.

gfyoung added a commit to forking-repos/pandas that referenced this issue Mar 9, 2017

BUG: Error when specifying int index containing NaN

195b830

Partially addresses pandas-devgh-15187.

jreback closed this as completed Mar 10, 2017

jreback mentioned this issue Mar 29, 2017

API/BUG/CLN: Index/Series common dtype handling #12758

Closed

gfyoung mentioned this issue Mar 29, 2017

API/BUG: Handling Dtype Coercions in Series/Index #15832

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

COMPAT: clarify Index integer conversions when dtype is specified in construction #15187

COMPAT: clarify Index integer conversions when dtype is specified in construction #15187

jreback commented Jan 22, 2017

jreback commented Jan 22, 2017

gfyoung commented Jan 30, 2017 •

edited

Loading

jorisvandenbossche commented Jan 30, 2017

gfyoung commented Jan 30, 2017 •

edited

Loading

jorisvandenbossche commented Jan 30, 2017

gfyoung commented Jan 30, 2017

jreback commented Jan 30, 2017 •

edited

Loading

gfyoung commented Jan 31, 2017

jorisvandenbossche commented Jan 31, 2017

gfyoung commented Jan 31, 2017

jreback commented Jan 31, 2017

jreback commented Jan 31, 2017

gfyoung commented Jan 31, 2017 •

edited

Loading

jreback commented Jan 31, 2017

gfyoung commented Feb 6, 2017

jreback commented Feb 6, 2017

gfyoung commented Feb 6, 2017

jreback commented Feb 6, 2017 •

edited

Loading

gfyoung commented Feb 7, 2017

gfyoung commented Feb 10, 2017

jreback commented Mar 10, 2017

gfyoung commented Mar 10, 2017

jreback commented Mar 10, 2017

gfyoung commented Mar 10, 2017

jreback commented Mar 10, 2017

gfyoung commented Mar 10, 2017

jreback commented Mar 10, 2017

jreback commented Mar 29, 2017

gfyoung commented Mar 29, 2017

COMPAT: clarify Index integer conversions when dtype is specified in construction #15187

COMPAT: clarify Index integer conversions when dtype is specified in construction #15187

Comments

jreback commented Jan 22, 2017

jreback commented Jan 22, 2017

gfyoung commented Jan 30, 2017 • edited Loading

jorisvandenbossche commented Jan 30, 2017

gfyoung commented Jan 30, 2017 • edited Loading

jorisvandenbossche commented Jan 30, 2017

gfyoung commented Jan 30, 2017

jreback commented Jan 30, 2017 • edited Loading

gfyoung commented Jan 31, 2017

jorisvandenbossche commented Jan 31, 2017

gfyoung commented Jan 31, 2017

jreback commented Jan 31, 2017

jreback commented Jan 31, 2017

gfyoung commented Jan 31, 2017 • edited Loading

jreback commented Jan 31, 2017

gfyoung commented Feb 6, 2017

jreback commented Feb 6, 2017

gfyoung commented Feb 6, 2017

jreback commented Feb 6, 2017 • edited Loading

gfyoung commented Feb 7, 2017

gfyoung commented Feb 10, 2017

jreback commented Mar 10, 2017

gfyoung commented Mar 10, 2017

jreback commented Mar 10, 2017

gfyoung commented Mar 10, 2017

jreback commented Mar 10, 2017

gfyoung commented Mar 10, 2017

jreback commented Mar 10, 2017

jreback commented Mar 29, 2017

gfyoung commented Mar 29, 2017

gfyoung commented Jan 30, 2017 •

edited

Loading

gfyoung commented Jan 30, 2017 •

edited

Loading

jreback commented Jan 30, 2017 •

edited

Loading

gfyoung commented Jan 31, 2017 •

edited

Loading

jreback commented Feb 6, 2017 •

edited

Loading