-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Need a best dtype method #14400
Comments
|
if you really want to
|
note that the |
Explict astyping is also available in 0.19.0
All of the above said, if you think the docs could use an enhancement, the pls submit a doc PR. |
Further note that the stack/unstack idiom does not generally play nice with mixed object and other dtypes, so a more typical pattern is.
|
@jreback, Thank you for all the replies. However, how will you explain the following behavior?
The expected output (using
Note that this cannot be done by ignoring errors in In my present use case, I am fitting models to data from thousands of sensors, and returning a Series with model result information (see this SO question). The resulting DataFrame is then unstacked (see answer to that question) and I get a frame with many columns, and hundreds of thousands of rows. Unfortunately, not all rows in a given column have the same type. For example, if the model fits one parameter, I get a single number, if it fits multiple parameters, I get a list. Since I cannot be expected to manually examine all values in the frame, and I cannot afford to loose data from typecasting, I need a better method to cast intelligently so that I can build further analysis of model results. Even the following doesn't help:
|
Your call here df.apply(lambda x: pd.to_numeric(x, errors='ignore'), axis=1).dtypes is applying |
Sorry @TomAugspurger, that was a typo, out of habit. I want Actually, correcting that typo results in this behavior:
Why is column |
filling a formerly float column (e.g. c1) with objects works, but also coerces c3 to object because its in the same block. This is an implementation detail ATM. But i'll open an issue for it. Having another method for this operation is really not worth it. We already have |
I agree, having another method is not worth it if the method is not doing anything new. But the proposed method can, in principle, replace The current solution of applying Pandas already does a great job with converting datatime and timedelta types by inferring the format. Basically what I am driving at is a universal method to handle all generic forms of type casting, with the ability to automatically infer types if needed, and take user inputs as desired. I agree the algorithm for such a method might get a bit hairy with datetime types, but Pandas is known for it's simplistic flexibility. A method to handle all simple conditions of typecasting does not exist. I rest my case here, and leave the decision to you. But it is worth considering a universal type setting and resetting method. |
@dragonator4 methods with lots of keyword args that are potentially overlapping are confusing and non-pythonic. you should re-examine your pipeline to see why in fact you are doing Further the reason If you want to suggest a small API addition that fills out a need that is not currently satisfied, please do so. But, for example, removing As I xrefed the convert-fill-convert idiom is buggy ATM in this specific case. Pull-requests to fix are also welcome. |
Introduction
Basically, the
pandas.read_*
methods and constructors are awesome at assigning the highest level of dtype that can include all values of a column. But such functionality is lacking for DataFrames created by other methods (stack, unstack are prime examples).There has been a lot of discussion about dtypes here (ref. #9216, #5902 and especially #9589), and I understand it is a well rehearsed topic, but with no general consensus. An unfortunate result of those discussions was the deprecation of the
.convert_objects
method for being too forceful. However, the undercurrent in those discussions (IMHO) point to, and my needs often require a (DataFrame and Series) method which will intelligently assign the lowest generic dtype based on the data.The method may optionally take a list of dtypes or a dictionary of column names, dtypes to assign user specified dtypes. Note that I am proposing this in addition to the existing
to_*
methods. The following example will help illustrate:Expected Output
Define a method
.set_dtypes
which does the following:infer
to infer and reset the column dtype to the least general dtype such that values are not lost.errors
keyword argument to handle casting errors.As illustrated below:
Additional Notes
I understand that date and time types will be a little difficult to infer. However, following the logic powering
pandas.read_*
, date and time types are not automatically inferred, but explicitly passed by the user.It would be a one-size-fits-all solution if users were allowed to pass
True
, andFalse
in addition to dtype to force when specifying dtypes per column.True
in this case would indicate infer automatically (set the best dtype), whileFalse
would indicate ignore column from conversion.The text was updated successfully, but these errors were encountered: