Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Need a best dtype method #14400

Closed
dragonator4 opened this issue Oct 12, 2016 · 12 comments
Closed

[ENH] Need a best dtype method #14400

dragonator4 opened this issue Oct 12, 2016 · 12 comments
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement

Comments

@dragonator4
Copy link

Introduction

Basically, the pandas.read_* methods and constructors are awesome at assigning the highest level of dtype that can include all values of a column. But such functionality is lacking for DataFrames created by other methods (stack, unstack are prime examples).

There has been a lot of discussion about dtypes here (ref. #9216, #5902 and especially #9589), and I understand it is a well rehearsed topic, but with no general consensus. An unfortunate result of those discussions was the deprecation of the .convert_objects method for being too forceful. However, the undercurrent in those discussions (IMHO) point to, and my needs often require a (DataFrame and Series) method which will intelligently assign the lowest generic dtype based on the data.

The method may optionally take a list of dtypes or a dictionary of column names, dtypes to assign user specified dtypes. Note that I am proposing this in addition to the existing to_* methods. The following example will help illustrate:

In [1]: df = pd.DataFrame({'c1' : list('AAABBBCCC'),
                           'c2' : list('abcdefghi'),
                           'c3' : np.random.randn(9),
                           'c4' : np.arange(9)})
        df.dtypes
Out[1]: c1     object
        c2     object
        c3    float64
        c4      int64
        dtype: object

In [2]: df = df.stack().unstack()
        df.dtypes
Out[2]: c1     object
        c2     object
        c3     object
        c4     object
        dtype: object

Expected Output

Define a method .set_dtypes which does the following:

  1. Either takes a boolean keyword argument infer to infer and reset the column dtype to the least general dtype such that values are not lost.
  2. Or takes a list or dictionary of dtypes to force each column into user specified dtypes, with an optional errors keyword argument to handle casting errors.

As illustrated below:

In [3]: df.set_dtypes(infer=True).dtypes
Out[3]: c1     object
        c2     object
        c3    float64
        c4      int64
        dtype: object

In [4]: df.set_dtypes(types=[np.int64]*4, errors='coerce').dtypes
Out[4]: c1     int64
        c2     int64
        c3     int64
        c4     int64
        dtype: object

In [5]: df.set_dtypes(types=[np.int64]*4, errors='coerce') # Note loss of data
Out[5]:     c1  c2  c3  c4
        0   NaN NaN 1   0
        1   NaN NaN 1   1
        2   NaN NaN 0   2
        3   NaN NaN 0   3
        4   NaN NaN 0   4
        5   NaN NaN 0   5
        6   NaN NaN 2   6
        7   NaN NaN 0   7
        8   NaN NaN 1   8

In [6]: df.set_dtypes(types=[np.int64]*4, errors='ignore').dtypes
Out[6]: c1     object
        c2     object
        c3     object
        c4      int64
        dtype: object

Additional Notes

I understand that date and time types will be a little difficult to infer. However, following the logic powering pandas.read_*, date and time types are not automatically inferred, but explicitly passed by the user.

It would be a one-size-fits-all solution if users were allowed to pass True, and False in addition to dtype to force when specifying dtypes per column. True in this case would indicate infer automatically (set the best dtype), while False would indicate ignore column from conversion.

@jreback
Copy link
Contributor

jreback commented Oct 12, 2016

In [8]: df.dtypes
Out[8]: 
c1     object
c2     object
c3    float64
c4      int64
dtype: object

In [9]: df.stack().unstack().dtypes
Out[9]: 
c1    object
c2    object
c3    object
c4    object
dtype: object

In [10]: df.stack().unstack().apply(lambda x: pd.to_numeric(x, errors='ignore')).dtypes
Out[10]: 
c1     object
c2     object
c3    float64
c4      int64
dtype: object

In [11]: pd.to_numeric?

# you can even request downcasting
In [12]: df.stack().unstack().apply(lambda x: pd.to_numeric(x, errors='ignore', downcast='integer')).dtypes
Out[12]: 
c1     object
c2     object
c3    float64
c4       int8
dtype: object

@jreback
Copy link
Contributor

jreback commented Oct 12, 2016

if you really want to

In [13]: df.stack().unstack().apply(lambda x: pd.to_numeric(x, errors='coerce')).dtypes
Out[13]: 
c1    float64
c2    float64
c3    float64
c4      int64
dtype: object

@jreback jreback added the Dtype Conversions Unexpected or buggy dtype conversions label Oct 12, 2016
@jreback
Copy link
Contributor

jreback commented Oct 12, 2016

note that the downcast kw is new in 0.19.0; .to_numeric has been around for a while.

@jreback jreback closed this as completed Oct 12, 2016
@jreback jreback added this to the No action milestone Oct 12, 2016
@jreback
Copy link
Contributor

jreback commented Oct 12, 2016

@jreback
Copy link
Contributor

jreback commented Oct 12, 2016

Explict astyping is also available in 0.19.0

In [9]: df.stack().unstack().astype({'c3' : 'float', 'c4' :'int8'}).dtypes
Out[9]: 
c1     object
c2     object
c3    float64
c4       int8
dtype: object

All of the above said, if you think the docs could use an enhancement, the pls submit a doc PR.

@jreback
Copy link
Contributor

jreback commented Oct 12, 2016

Further note that the stack/unstack idiom does not generally play nice with mixed object and other dtypes, so a more typical pattern is.

In [15]: df.set_index(['c1', 'c2']).stack().unstack().reset_index().dtypes
Out[15]: 
c1     object
c2     object
c3    float64
c4    float64
dtype: object

@dragonator4
Copy link
Author

dragonator4 commented Oct 12, 2016

@jreback, Thank you for all the replies. However, how will you explain the following behavior?

In [1]: df = pd.DataFrame({'c1' : list('ABC'),
                           'c2' : list('123'),
                           'c3' : np.random.randn(3),
                           'c4' : np.arange(3)})
        df.dtypes
Out[1]: c1     object
        c2     object
        c3    float64
        c4      int64
        dtype: object

In [2]: df.apply(lambda x: pd.to_numeric(x, errors='ignore'), axis=0).dtypes
Out[2]: c1     object
        c2     object
        c3    float64
        c4      int64
        dtype: object

The expected output (using .set_types method I proposed):

In [2]: df.set_types(infer=True).dtypes
Out[2]: c1     object
        c2      int64
        c3    float64
        c4      int64
        dtype: object

Note that this cannot be done by ignoring errors in pd.to_numeric. Errors will have to be coerced to get the desired output, which will loose data on the object column, while upcasting the int64 columns to float64. downcast='integer' does nothing in this case.

In my present use case, I am fitting models to data from thousands of sensors, and returning a Series with model result information (see this SO question). The resulting DataFrame is then unstacked (see answer to that question) and I get a frame with many columns, and hundreds of thousands of rows.

Unfortunately, not all rows in a given column have the same type. For example, if the model fits one parameter, I get a single number, if it fits multiple parameters, I get a list. Since I cannot be expected to manually examine all values in the frame, and I cannot afford to loose data from typecasting, I need a better method to cast intelligently so that I can build further analysis of model results.

Even the following doesn't help:

df.apply(lambda x: pd.to_numeric(x, errors='corece'), axis=0).fillna(df)

@TomAugspurger
Copy link
Contributor

Your call here

df.apply(lambda x: pd.to_numeric(x, errors='ignore'), axis=1).dtypes

is applying pd.to_numeric to each row. Are you sure you want axis=1?

@dragonator4
Copy link
Author

dragonator4 commented Oct 12, 2016

Sorry @TomAugspurger, that was a typo, out of habit. I want axis=0. I have edited accordingly.

Actually, correcting that typo results in this behavior:

In [3]: df.apply(lambda x: pd.to_numeric(x, errors='corece')).fillna(df).dtypes
Out[3]: c1    object
        c2     int64
        c3    object
        c4     int64
        dtype: object

Why is column c3 converted to object from float64?

@jreback
Copy link
Contributor

jreback commented Oct 12, 2016

In [45]: df.apply(lambda x: pd.to_numeric(x, errors='corece')).fillna(df).apply(lambda x: pd.to_numeric(x, errors='ignore')).dtypes
Out[45]: 
c1     object
c2      int64
c3    float64
c4      int64
dtype: object

filling a formerly float column (e.g. c1) with objects works, but also coerces c3 to object because its in the same block. This is an implementation detail ATM. But i'll open an issue for it.

Having another method for this operation is really not worth it. We already have .astype() and explict conversions.

@dragonator4
Copy link
Author

dragonator4 commented Oct 12, 2016

I agree, having another method is not worth it if the method is not doing anything new. But the proposed method can, in principle, replace .astype and pd.to_* and the erstwhile .convert_objects methods. It can be like pd.set_dtypes(<frame or series>, infer=<boolean>, types=<list or dict>, errors=<'ignore', 'coerce', 'raise'>). Basically, it is a one-size-fits-all solution..

The current solution of applying to_numeric twice, and filling NaNs is a bit too round about and clunky. It will also mess up datetime columns. In my use case, datetime is part of the index (raw data is grouped on datetime), but if it weren't (as in if I was returning datetime), then I can foresee it having problems. Post conversion with to_numeric (via applying it twice), one will also have to apply to_datetime in the case of datetimes.

Pandas already does a great job with converting datatime and timedelta types by inferring the format. Basically what I am driving at is a universal method to handle all generic forms of type casting, with the ability to automatically infer types if needed, and take user inputs as desired. I agree the algorithm for such a method might get a bit hairy with datetime types, but Pandas is known for it's simplistic flexibility.

A method to handle all simple conditions of typecasting does not exist. I rest my case here, and leave the decision to you. But it is worth considering a universal type setting and resetting method.

@jreback
Copy link
Contributor

jreback commented Oct 12, 2016

@dragonator4 methods with lots of keyword args that are potentially overlapping are confusing and non-pythonic.

you should re-examine your pipeline to see why in fact you are doing .stack().unstack(); as I pointed above, this is not a normal idiom.

Further the reason .convert_objects() was removed because of ambiguity in type detection esp w.r.t datetime/timedeltas, multiple answers are correct and its an underspecified problem.

If you want to suggest a small API addition that fills out a need that is not currently satisfied, please do so. But, for example, removing .astype() is a non-starter, this is a long-time API method for many eco-systems.

As I xrefed the convert-fill-convert idiom is buggy ATM in this specific case. Pull-requests to fix are also welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement
Projects
None yet
Development

No branches or pull requests

3 participants