Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COMPAT: clarify Index integer conversions when dtype is specified in construction #15187

Closed
jreback opened this issue Jan 22, 2017 · 29 comments
Closed
Labels
Compat pandas objects compatability with Numpy or Python functions Dtype Conversions Unexpected or buggy dtype conversions
Milestone

Comments

@jreback
Copy link
Contributor

jreback commented Jan 22, 2017

xref #15162

so [8], and [9] are our current model, IOW, we make an effort to convert to the specified type, but will coerce to an available type if the data is not valid for that dtype.

ideally we would also be consistent w.r.t. #15145, IOW Series construction with a specified dtype (not that we upcast to the available Index types but don't do this for Series, e.g. [18])

So we should be consistent on this.

In [8]: Index([np.nan],dtype='int64')
Out[8]: Float64Index([nan], dtype='float64')

In [9]: Index([np.nan],dtype='uint64')
Out[9]: Float64Index([nan], dtype='float64')

In [10]: Index([np.iinfo(np.int64).max-1],dtype='int64')
Out[10]: Int64Index([9223372036854775806], dtype='int64')

In [11]: Index([np.iinfo(np.int64).max-1],dtype='uint64')
Out[11]: UInt64Index([9223372036854775806], dtype='uint64')

# I guess this should convert to Float64Index?
In [12]: Index([np.iinfo(np.uint64).max-1],dtype='int64')
---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
<ipython-input-12-f2ec7d3c38a4> in <module>()
----> 1 Index([np.iinfo(np.uint64).max-1],dtype='int64')

/Users/jreback/pandas/pandas/indexes/base.py in __new__(cls, data, dtype, copy, name, fastpath, tupleize_cols, **kwargs)
    318             # other iterable of some kind
    319             subarr = _asarray_tuplesafe(data, dtype=object)
--> 320             return Index(subarr, dtype=dtype, copy=copy, name=name, **kwargs)
    321 
    322     """

/Users/jreback/pandas/pandas/indexes/base.py in __new__(cls, data, dtype, copy, name, fastpath, tupleize_cols, **kwargs)
    199                         inferred = lib.infer_dtype(data)
    200                         if inferred == 'integer':
--> 201                             data = np.array(data, copy=copy, dtype=dtype)
    202                         elif inferred in ['floating', 'mixed-integer-float']:
    203 

OverflowError: Python int too large to convert to C long

In [13]: Index([np.iinfo(np.uint64).max-1],dtype='uint64')
Out[13]: UInt64Index([18446744073709551614], dtype='uint64')

In [14]: Index([-1], dtype='int64')
Out[14]: Int64Index([-1], dtype='int64')

# this looks wrong
In [15]: Index([-1], dtype='uint64')
Out[15]: UInt64Index([18446744073709551615], dtype='uint64')

# we do this type of same-dtype upcasting already (this is correct / good thing)
In [18]: Index([np.iinfo(np.int32).max+1], dtype='int64')
Out[18]: Int64Index([2147483648], dtype='int64')
@jreback jreback added Compat pandas objects compatability with Numpy or Python functions Difficulty Intermediate Dtype Conversions Unexpected or buggy dtype conversions labels Jan 22, 2017
@jreback jreback added this to the 0.20.0 milestone Jan 22, 2017
@jreback
Copy link
Contributor Author

jreback commented Jan 22, 2017

cc @gfyoung

@gfyoung
Copy link
Member

gfyoung commented Jan 30, 2017

For references, here are the same inputs with Series :

>>> Series([np.nan],dtype='int64')  # NO MATCH
...
ValueError: cannot convert float NaN to integer
>>>
>>> Series([np.nan],dtype='uint64')  # NO MATCH
...
ValueError: cannot convert float NaN to integer
>>>
>>> Series([np.iinfo(np.int64).max-1],dtype='int64')  # MATCH
0    9223372036854775806
dtype: int64
>>>
>>> Series([np.iinfo(np.int64).max-1],dtype='int64')  # MATCH
0    9223372036854775806
dtype: uint64
>>>
>>> Series([np.iinfo(np.uint64).max-1],dtype='int64')  # MATCH
...
OverflowError: Python int too large to convert to C long
>>>
>>> Series([-1], dtype='int64')  # MATCH
0   -1
dtype: int64
>>>
>>> Series([-1], dtype='uint64')  # MATCH
0    18446744073709551615
dtype: uint64
>>>
>>> Series([np.iinfo(np.int32).max+1], dtype='int64')  # MATCH
0    2147483648
dtype: int64

If we are going to resolve this, the behavior should be consistent across the board. Also, I'm not entirely sure now if I want to change the behavior when we pass in negative numbers and specify uint64 (or any uint) for that matter. On the one hand, the value is destroyed, but on the other hand, it is the correct value modulo np.iinfo(np.uint64).max.

@jorisvandenbossche
Copy link
Member

On the NaN case: I would be in favor of raising an error when passing np.nan to integer dtype (so for pd.Index([np.nan], dtype='int64')), so it is consistent with Series and with numpy as well.
In general, I think if you pass a specific dtype and the data cannot be converted to that dtype, we should raise.

@gfyoung
Copy link
Member

gfyoung commented Jan 30, 2017

@jorisvandenbossche : What do you mean by "converted" ? Casting modulo the maximum value of the integer data-type could be argued to be a conversion. Or are you saying that we just go with numpy conventions?

@jorisvandenbossche
Copy link
Member

Yes, I know, the exact interpretation of that is the debatable part.
For NaNs I think this is quite clear, for integers outside of its 'natural' range not so. Personally I would find it more informative if eg pd.Series([-1], dtype='uint64) would raise (as this feels like a user error), but can also see the argument for 'converting' it and not raising (which would indeed be consistent with numpy).

@gfyoung
Copy link
Member

gfyoung commented Jan 30, 2017

@jorisvandenbossche : Yes, I agree with you regarding NaN. That is also consistent with numpy, hence my last question in my previous comment above.

@jreback
Copy link
Contributor Author

jreback commented Jan 30, 2017

I think the general philosophy is to raise when presented with a dtype that is incompat with the input, essentially this is going to do an .astype. so in favor of having Index agree with Series (for np.nan with int/uint dtype specified).

pd.Series([-1], dtype='uint64) should also raise (informative message).

>>> Series([np.iinfo(np.uint64).max-1],dtype='int64')  # MATCH
...
OverflowError: Python int too large to convert to C long

I also find this error message not very informative (we are passing thru the numpy message). OverflowError is fine, but its not useful to a user. So would be in favor of cleaning that up both for Index and Series.

@gfyoung
Copy link
Member

gfyoung commented Jan 31, 2017

Yes, we can agree on raising when NaN is present and an integer dtype is specified. Regarding cleaning up the error message for int64 overflow, that is perfectly reasonable.

Again, I'm not too sure right now about raising on negative integers specified with an unsigned integer dtype. Maybe I'm just being lazy, but consistency with numpy might be preferable. 😂

@jorisvandenbossche
Copy link
Member

>>> Series([np.iinfo(np.uint64).max-1],dtype='int64')  # MATCH
...
OverflowError: Python int too large to convert to C long

It's true that the above raises, but this is not very consistent:

In [174]: pd.Series(np.array([np.iinfo(np.int64).max+1]), dtype='int64')   # array will already have a dtype, uint64
Out[174]: 
0   -9223372036854775808
dtype: int64

In [169]: pd.Series([np.iinfo(np.int32).max+1], dtype='int32')
Out[169]: 
0   -2147483648
dtype: int32

At least, we could make Series([..], dtype='..') consistent with Series([..]).astype('..'), and that means raising for NaN casted to integer (indeed, we can agree on that one)

@gfyoung
Copy link
Member

gfyoung commented Jan 31, 2017

Maybe we can start with a PR to raise with NaN and integer dtype specified? It seems we haven't fully reached consensus on the other cases.

@jreback
Copy link
Contributor Author

jreback commented Jan 31, 2017

@gfyoung sure partially closing this would be great.

@jreback
Copy link
Contributor Author

jreback commented Jan 31, 2017

Maybe I'm just being lazy, but consistency with numpy might be preferable. 😂

sure when numpy makes sense. Given a wrap-around with negative numbers for uint is quite suprising. These should be treated exactly like NaN with unint (when a dtype is specified), raising an error.

@gfyoung
Copy link
Member

gfyoung commented Jan 31, 2017

@jreback : That wouldn't be the only case you would have to consider of "wrap-around" BTW. Look at all of the examples @jorisvandenbossche provided above.

Essentially casting to ndarray would be a forbidden thing without explicit checks all over the place for overflow and underflow. Perhaps "wrapping" the np.array method would be the only way to go if we go with raising on such instances and enforcing it across the code base.

@jreback
Copy link
Contributor Author

jreback commented Jan 31, 2017

@gfyoung I don't think this is very complicated.. This is only when the dtype is actually specified. You check for negative values then cast to uint64 (which catches the nans). I don't think this is a big deal.

This is the point of having a UInt64 block which can do these validations will catch anything else, e.g. try to fillna with an invalid value (which actually works, but you cast it first), and things like setitem.

Everything is already divorced but numpy and wrapped in the blocks which have convenient API for values (and casting). We do exactly these types of things for example in trying to set nan with an integer block.

The only part of this which is not wrapped up are some construction validation routines (which happen way before block creation). We do lots of inference to figure out what the user is passing, see https://github.com/pandas-dev/pandas/blob/master/pandas/core/series.py#L2843

@gfyoung
Copy link
Member

gfyoung commented Feb 6, 2017

diff --git a/pandas/indexes/base.py b/pandas/indexes/base.py
index dcd565e..4cccd96 100644
--- a/pandas/indexes/base.py
+++ b/pandas/indexes/base.py
@@ -200,6 +200,9 @@ class Index(IndexOpsMixin, StringAccessorMixin, PandasObject):
                         if inferred == 'integer':
                             data = np.array(data, copy=copy, dtype=dtype)
                         elif inferred in ['floating', 'mixed-integer-float']:
+                            if isnull(data).any():
+                                raise ValueError('cannot convert float '
+                                                 'NaN to integer')
 
                             # If we are actually all equal to integers,
                             # then coerce to integer.
@@ -227,8 +230,10 @@ class Index(IndexOpsMixin, StringAccessorMixin, PandasObject):
                     else:
                         data = np.array(data, dtype=dtype, copy=copy)
 
-                except (TypeError, ValueError):
-                    pass
+                except (TypeError, ValueError) as e:
+                    msg = str(e)
+                    if 'cannot convert float' in msg:
+                        raise
 
             # maybe coerce to a sub-class
             from pandas.tseries.period import (PeriodIndex,

Here's a possible way to patch the NaN with integer dtype specified. Unfortunately, that causes tests to break (with the error I added) as follows:

  1. indexes/common.py for Int64Index and RangeIndex
  2. test_count in test_analytics.py

All of these tests incorporate NaN somehow into the test setup. The first failures are rather trivial to patch, but I'm not so sure about the second. Thoughts?

@jreback
Copy link
Contributor Author

jreback commented Feb 6, 2017

I think you need to do the 2nd part first. IOW, if the floats == integers when casted it is ok, but then need to raise on the ValueError (rather than converting to a float index)

+                            if isnull(data).any():
+                                raise ValueError('cannot convert float '
+                                                 'NaN to integer')
 
                             # If we are actually all equal to integers,
                             # then coerce to integer.

can you show the test that is problematic? IOW an example of it.

@gfyoung
Copy link
Member

gfyoung commented Feb 6, 2017

I'm not sure I 100% followed what you said there. Also, all the tests I mentioned that failed fail with my ValueError being raised.

@jreback
Copy link
Contributor Author

jreback commented Feb 6, 2017

I said that your change is actually not necessary, rather, https://github.com/pandas-dev/pandas/blob/master/pandas/indexes/base.py#L260

I think all you need to do is remove the try/except and let the ValueError raise (rather than use coercing to Index of object dtype)

@gfyoung
Copy link
Member

gfyoung commented Feb 7, 2017

No, that's definitely the wrong place. Note that dtype is None in that if block, so the user didn't specify that they wanted an integer dtype.

@gfyoung
Copy link
Member

gfyoung commented Feb 10, 2017

This is getting more difficult to change, as I realized that .where for indices just breaks period with this change of mine:

>>> from pandas import Index
>>> Index([1, 2, 3]).where([False, True, True])
...
ValueError: cannot convert float NaN to integer

gfyoung added a commit to forking-repos/pandas that referenced this issue Mar 8, 2017
gfyoung added a commit to forking-repos/pandas that referenced this issue Mar 8, 2017
gfyoung added a commit to forking-repos/pandas that referenced this issue Mar 8, 2017
gfyoung added a commit to forking-repos/pandas that referenced this issue Mar 8, 2017
gfyoung added a commit to forking-repos/pandas that referenced this issue Mar 9, 2017
jreback pushed a commit that referenced this issue Mar 10, 2017
xref #15187.

Author: gfyoung <gfyoung17@gmail.com>

Closes #15616 from gfyoung/nan-int-index and squashes the following commits:

195b830 [gfyoung] BUG: Error when specifying int index containing NaN
@jreback
Copy link
Contributor Author

jreback commented Mar 10, 2017

thanks!

we can make a checklist in the issue if that helps? (pls put it up and ill copy to the top if you can)

@gfyoung
Copy link
Member

gfyoung commented Mar 10, 2017

@jreback : Can you edit the original issue to add this PR to the checklist?

@jreback
Copy link
Contributor Author

jreback commented Mar 10, 2017

yes can u enumerate the open issues (as i see them) and will make checkboxes (even better is for you to post below and i'll update the top)

@gfyoung
Copy link
Member

gfyoung commented Mar 10, 2017

@jreback : Fair enough. Which issues would we pull in though besides my PR? The only I can think of is #15145 as you mentioned already.

@jreback
Copy link
Contributor Author

jreback commented Mar 10, 2017

so I think that fixed [10] and [11], but [12] and [15] remain

@gfyoung
Copy link
Member

gfyoung commented Mar 10, 2017

@jreback : So the behavior of Index matches that of Series now (I think you meant [8] and [9] are patched). Thus, any questions regarding conversions are now beyond the scope of Index.

@jreback
Copy link
Contributor Author

jreback commented Mar 10, 2017

@gfyoung yes you are correct. Ok then. I will close this one. Let's however open a new one for Index/Series fixes for [12], [15] (with checkboxes).

@jreback jreback closed this as completed Mar 10, 2017
AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this issue Mar 21, 2017
xref pandas-dev#15187.

Author: gfyoung <gfyoung17@gmail.com>

Closes pandas-dev#15616 from gfyoung/nan-int-index and squashes the following commits:

195b830 [gfyoung] BUG: Error when specifying int index containing NaN
@jreback
Copy link
Contributor Author

jreback commented Mar 29, 2017

@gfyoung can you open an issue as above:

@gfyoung yes you are correct. Ok then. I will close this one. Let's however open a new one for Index/Series fixes for [12], [15] (with checkboxes).

@gfyoung
Copy link
Member

gfyoung commented Mar 29, 2017

Yep, done: #15832

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

No branches or pull requests

3 participants