Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0.17 regression: convert_objects coercion with multiple dtypes #11116

Closed
AmbroseKeith opened this issue Sep 15, 2015 · 6 comments · Fixed by #11173
Closed

0.17 regression: convert_objects coercion with multiple dtypes #11116

AmbroseKeith opened this issue Sep 15, 2015 · 6 comments · Fixed by #11173
Labels
Dtype Conversions Unexpected or buggy dtype conversions Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@AmbroseKeith
Copy link

I upgraded from 0.16.2 to 0.17/master and it broke working code.

0.16.2:

In [2]: import pandas as pd
   ...: from StringIO import StringIO
   ...: x="""foo,bar
   ...: 2015-09-14,True
   ...: 2015-09-15,
   ...: """
   ...: df=pd.read_csv(StringIO(x),sep=',').convert_objects('coerce')
   ...: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 2 columns):
foo    2 non-null datetime64[ns]
bar    1 non-null object
dtypes: datetime64[ns](1), object(1)
memory usage: 48.0+ bytes

0.17/master:

import pandas as pd
from StringIO import StringIO
x="""foo,bar
2015-09-14,True
2015-09-15,
"""
df=pd.read_csv(StringIO(x),sep=',').convert_objects('coerce')
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 2 columns):
foo    2 non-null datetime64[ns]
bar    0 non-null datetime64[ns]
dtypes: datetime64[ns](2)
memory usage: 48.0 bytes
/home/ambk/work/pandas/pandas/core/generic.py:2584: FutureWarning: The use of 'coerce' as an input is deprecated. Instead set coerce=True.
  FutureWarning)

booleans get cast to datetimes now?! Usually deprecation means "avoid in new code" and not that your working code will break, otherwise it wouldn't be a deprecation but a breaking change. So that's not good, but ok, let's follow the helpful hint:

In [3]: import pandas as pd
   ...: from StringIO import StringIO
   ...: x="""foo,bar
   ...: 2015-09-14,True
   ...: 2015-09-15,
   ...: """
   ...: df=pd.read_csv(StringIO(x),sep=',').convert_objects(coerce=True)
   ...: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 2 columns):
foo    2 non-null object
bar    1 non-null float64
dtypes: float64(1), object(1)
memory usage: 48.0+ bytes

wut?, booleans become floats and almost-datetimes aren't converted at all anymore?

@TomAugspurger TomAugspurger changed the title 0.17 regression: coercion utterly broken 0.17 regression: convert_objects coercion with multiple dtypes Sep 15, 2015
@TomAugspurger
Copy link
Contributor

@AmbroseKeith thanks for the report. I changed to title to be more helpful than "utterly broken"

@jreback I think there's two bugs here. First the multiple dtypes, where the bools are getting mixed in with the Datetime block and getting coerced to NaN. At least that's my guess.

The last example, .convert_objects(coerce=True) shouldn't do anything right? Since datetime, numeric and timedelta are all False? btw @AmbroseKeith I don't think that does what you want (once we fix the bug). You want .convert_objects(datetime=True, coerce=True).

@jorisvandenbossche jorisvandenbossche added this to the 0.17.0 milestone Sep 15, 2015
@jorisvandenbossche
Copy link
Member

cc @bashtage

The original change occurred in PR #10265

@jorisvandenbossche jorisvandenbossche added Dtype Conversions Unexpected or buggy dtype conversions Regression Functionality that used to work in a prior pandas version labels Sep 15, 2015
@jreback
Copy link
Contributor

jreback commented Sep 16, 2015

This looks correct to me. You asked it to take an object column and force convert. Not sure what else you would return here. Bools are nothing special.

In [17]: df.convert_objects(coerce=True)
Out[17]: 
          foo  bar
0  2015-09-14    1
1  2015-09-15  NaN

df.convert_objects('coerce') could be caught as 'coerce' to datetime is invalid (though IIRC this was accept to provide back-compat).

In general expecting force converting to work always is simply user-error. It is impossible to guess things reliably (and this is why this was changed). If you have dates, use pd.to_datetime. .convert_objects has really a simple purpose, and that is to force convert mixed values in a known type.

Applying this on a DataFrame is quite dangerous and requires much guessing. Not sure what we could do about this.

@jreback
Copy link
Contributor

jreback commented Sep 16, 2015

Though I agree with @TomAugspurger

df.convert_objects(coerce=True) should raise a ValueError (and not just return the original frame). as this is a specification error.

jreback added a commit to jreback/pandas that referenced this issue Sep 17, 2015
…e a ValueError on old-style input

- raise a ValueError for df.convert_objects('coerce')
- raise a ValueError for df.convert_objects(convert_dates='coerce') (and convert_numeric,convert_timedelta)
@jorisvandenbossche
Copy link
Member

In any case, the whatsnew entry will have to be much more explicit about these changes, as this has a big impact on the result if you use convert_objects on whole frames.

I think we should think about if it is possible to implement the new behaviour without breaking the existing usage.

@bashtage
Copy link
Contributor

Trying to handle the legacy usage was a bad fit given the substantial change int he behavior of convert objects - it isn't just a keyword renaming, but a removal of the magic of the old version.

The examples in the original post are all correct as intended, although I agree that coerce=True while all of datetime==timedelta==numeric=False should raise since it doesn't make any sense.

bashtage pushed a commit to bashtage/pandas that referenced this issue Oct 1, 2015
Restores the v0.16 behavior of convert_objects and moves the new
version of _convert
Adds to_numeric for directly converting numeric data

closes pandas-dev#11116
closes pandas-dev#11133
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Regression Functionality that used to work in a prior pandas version
Projects
None yet
5 participants