Skip to content

Commit

Permalink
BUG: Ensure 'coerce' actually coerces datatypes
Browse files Browse the repository at this point in the history
Changes behavior of convert objects so that passing 'coerce' will
ensure that data of the correct type is returned, even if all
values are null-types (NaN or NaT).

closes pandas-dev#9589
  • Loading branch information
bashtage authored and Kevin Sheppard committed Jul 13, 2015
1 parent 35c0863 commit 0727803
Show file tree
Hide file tree
Showing 19 changed files with 301 additions and 153 deletions.
25 changes: 17 additions & 8 deletions doc/source/basics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1522,23 +1522,29 @@ then the more *general* one will be used as the result of the operation.
object conversion
~~~~~~~~~~~~~~~~~

:meth:`~DataFrame.convert_objects` is a method to try to force conversion of types from the ``object`` dtype to other types.
To force conversion of specific types that are *number like*, e.g. could be a string that represents a number,
pass ``convert_numeric=True``. This will force strings and numbers alike to be numbers if possible, otherwise
they will be set to ``np.nan``.
.. note::

The syntax of :meth:`~DataFrame.convert_objects` changed in 0.17.0.

:meth:`~DataFrame.convert_objects` is a method to try to force conversion of
types from the ``object`` dtype to other types. To try converting specific
types that are *number like*, e.g. could be a string that represents a number,
pass ``numeric=True``. The force the conversion, add the keword argument
``coerce=True``. This will force strings and numbers alike to be numbers if
possible, otherwise they will be set to ``np.nan``.

.. ipython:: python
df3['D'] = '1.'
df3['E'] = '1'
df3.convert_objects(convert_numeric=True).dtypes
df3.convert_objects(numeric=True).dtypes
# same, but specific dtype conversion
df3['D'] = df3['D'].astype('float16')
df3['E'] = df3['E'].astype('int32')
df3.dtypes
To force conversion to ``datetime64[ns]``, pass ``convert_dates='coerce'``.
To force conversion to ``datetime64[ns]``, pass ``datetime=True`` and ``coerce=True``.
This will convert any datetime-like object to dates, forcing other values to ``NaT``.
This might be useful if you are reading in data which is mostly dates,
but occasionally has non-dates intermixed and you want to represent as missing.
Expand All @@ -1550,10 +1556,13 @@ but occasionally has non-dates intermixed and you want to represent as missing.
'foo', 1.0, 1, pd.Timestamp('20010104'),
'20010105'], dtype='O')
s
s.convert_objects(convert_dates='coerce')
s.convert_objects(datetime=True, coerce=True)
In addition, :meth:`~DataFrame.convert_objects` will attempt the *soft* conversion of any *object* dtypes, meaning that if all
Without passing ``coerce=True``, :meth:`~DataFrame.convert_objects` will attempt
the *soft* conversion of any *object* dtypes, meaning that if all
the objects in a Series are of the same type, the Series will have that dtype.
Setting ``coerce=True`` will not *convert* - for example, a series of string
dates will not be converted to a series of datetimes.

gotchas
~~~~~~~
Expand Down
39 changes: 39 additions & 0 deletions doc/source/whatsnew/v0.17.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -48,13 +48,52 @@ Backwards incompatible API changes

.. _whatsnew_0170.api_breaking.other:

Changes to convert_objects
^^^^^^^^^^^^^^^^^^^^^^^^^^
- ``DataFrame.convert_objects`` keyword arguments have been shortened. (:issue:`10265`)

===================== =============
Old New
===================== =============
``convert_dates`` ``datetime``
``convert_numeric`` ``numeric``
``convert_timedelta`` ``timedelta``
===================== =============

- Coercing types with ``DataFrame.convert_objects`` is now implemented using the
keyword argument ``coerce=True``. Previously types were coerced by setting a
keyword argument to ``'coerce'`` instead of ``True``, as in ``convert_dates='coerce'``.

.. ipython:: python

df = pd.DataFrame({'i': ['1','2'], 'f': ['apple', '4.2']})
df

The old usage of ``DataFrame.convert_objects`` used `'coerce'` along with the
type.

.. code-block:: python

In [2]: df.convert_objects(convert_numeric='coerce')

Now the ``coerce`` keyword must be explicitly used.

.. ipython:: python

df.convert_objects(numeric=True, coerce=True)

- The new default behavior for ``DataFrame.convert_objects`` is to do nothing,
and so it is necessary to pass at least one conversion target when calling.


Other API Changes
^^^^^^^^^^^^^^^^^
- Enable writing Excel files in :ref:`memory <_io.excel_writing_buffer>` using StringIO/BytesIO (:issue:`7074`)
- Enable serialization of lists and dicts to strings in ExcelWriter (:issue:`8188`)
- Allow passing `kwargs` to the interpolation methods (:issue:`10378`).
- Serialize metadata properties of subclasses of pandas objects (:issue:`10553`).


.. _whatsnew_0170.deprecations:

Deprecations
Expand Down
106 changes: 52 additions & 54 deletions pandas/core/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -1887,65 +1887,63 @@ def _maybe_box_datetimelike(value):

_values_from_object = lib.values_from_object

def _possibly_convert_objects(values, convert_dates=True,
convert_numeric=True,
convert_timedeltas=True):

def _possibly_convert_objects(values,
datetime=True,
numeric=True,
timedelta=True,
coerce=False):
""" if we have an object dtype, try to coerce dates and/or numbers """

# if we have passed in a list or scalar
conversion_count = sum((datetime, numeric, timedelta))
if conversion_count == 0:
import warnings
warnings.warn('Must explicitly pass type for conversion. Original '
'value returned.', RuntimeWarning)
return values

if isinstance(values, (list, tuple)):
# List or scalar
values = np.array(values, dtype=np.object_)
if not hasattr(values, 'dtype'):
elif not hasattr(values, 'dtype'):
values = np.array([values], dtype=np.object_)

# convert dates
if convert_dates and values.dtype == np.object_:

# we take an aggressive stance and convert to datetime64[ns]
if convert_dates == 'coerce':
new_values = _possibly_cast_to_datetime(
values, 'M8[ns]', coerce=True)

# if we are all nans then leave me alone
if not isnull(new_values).all():
values = new_values

else:
values = lib.maybe_convert_objects(
values, convert_datetime=convert_dates)

# convert timedeltas
if convert_timedeltas and values.dtype == np.object_:

if convert_timedeltas == 'coerce':
from pandas.tseries.timedeltas import to_timedelta
values = to_timedelta(values, coerce=True)

# if we are all nans then leave me alone
if not isnull(new_values).all():
values = new_values

else:
values = lib.maybe_convert_objects(
values, convert_timedelta=convert_timedeltas)

# convert to numeric
if values.dtype == np.object_:
if convert_numeric:
try:
new_values = lib.maybe_convert_numeric(
values, set(), coerce_numeric=True)

# if we are all nans then leave me alone
if not isnull(new_values).all():
values = new_values

except:
pass
else:

# soft-conversion
values = lib.maybe_convert_objects(values)
elif not is_object_dtype(values.dtype):
# If not object, do not attempt conversion
return values

# If 1 flag is coerce, ensure 2 others are False
if coerce:
if conversion_count > 1:
raise ValueError("Only one of 'datetime', 'numeric' or "
"'timedelta' can be True when when coerce=True.")

# Immediate return if coerce
if datetime:
return pd.to_datetime(values, coerce=True, box=False)
elif timedelta:
return pd.to_timedelta(values, coerce=True, box=False)
elif numeric:
return lib.maybe_convert_numeric(values, set(), coerce_numeric=True)

# Soft conversions
if datetime:
values = lib.maybe_convert_objects(values,
convert_datetime=datetime)

if timedelta and is_object_dtype(values.dtype):
# Object check to ensure only run if previous did not convert
values = lib.maybe_convert_objects(values,
convert_timedelta=timedelta)

if numeric and is_object_dtype(values.dtype):
try:
converted = lib.maybe_convert_numeric(values,
set(),
coerce_numeric=True)
# If all NaNs, then do not-alter
values = converted if not isnull(converted).all() else values
except:
pass

return values

Expand Down
11 changes: 8 additions & 3 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -3351,7 +3351,7 @@ def combine(self, other, func, fill_value=None, overwrite=True):
return self._constructor(result,
index=new_index,
columns=new_columns).convert_objects(
convert_dates=True,
datetime=True,
copy=False)

def combine_first(self, other):
Expand Down Expand Up @@ -3830,7 +3830,9 @@ def _apply_standard(self, func, axis, ignore_failures=False, reduce=True):

if axis == 1:
result = result.T
result = result.convert_objects(copy=False)
result = result.convert_objects(datetime=True,
timedelta=True,
copy=False)

else:

Expand Down Expand Up @@ -3958,7 +3960,10 @@ def append(self, other, ignore_index=False, verify_integrity=False):
combined_columns = self.columns.tolist() + self.columns.union(other.index).difference(self.columns).tolist()
other = other.reindex(combined_columns, copy=False)
other = DataFrame(other.values.reshape((1, len(other))),
index=index, columns=combined_columns).convert_objects()
index=index,
columns=combined_columns)
other = other.convert_objects(datetime=True, timedelta=True)

if not self.columns.equals(combined_columns):
self = self.reindex(columns=combined_columns)
elif isinstance(other, list) and not isinstance(other[0], DataFrame):
Expand Down
33 changes: 19 additions & 14 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -2433,22 +2433,26 @@ def copy(self, deep=True):
data = self._data.copy(deep=deep)
return self._constructor(data).__finalize__(self)

def convert_objects(self, convert_dates=True, convert_numeric=False,
convert_timedeltas=True, copy=True):
@deprecate_kwarg(old_arg_name='convert_dates', new_arg_name='datetime')
@deprecate_kwarg(old_arg_name='convert_numeric', new_arg_name='numeric')
@deprecate_kwarg(old_arg_name='convert_timedeltas', new_arg_name='timedelta')
def convert_objects(self, datetime=False, numeric=False,
timedelta=False, coerce=False, copy=True):
"""
Attempt to infer better dtype for object columns
Parameters
----------
convert_dates : boolean, default True
If True, convert to date where possible. If 'coerce', force
conversion, with unconvertible values becoming NaT.
convert_numeric : boolean, default False
If True, attempt to coerce to numbers (including strings), with
datetime : boolean, default False
If True, convert to date where possible.
numeric : boolean, default False
If True, attempt to convert to numbers (including strings), with
unconvertible values becoming NaN.
convert_timedeltas : boolean, default True
If True, convert to timedelta where possible. If 'coerce', force
conversion, with unconvertible values becoming NaT.
timedelta : boolean, default False
If True, convert to timedelta where possible.
coerce : boolean, default False
If True, force conversion with unconvertible values converted to
nulls (NaN or NaT)
copy : boolean, default True
If True, return a copy even if no copy is necessary (e.g. no
conversion was done). Note: This is meant for internal use, and
Expand All @@ -2459,9 +2463,10 @@ def convert_objects(self, convert_dates=True, convert_numeric=False,
converted : same as input object
"""
return self._constructor(
self._data.convert(convert_dates=convert_dates,
convert_numeric=convert_numeric,
convert_timedeltas=convert_timedeltas,
self._data.convert(datetime=datetime,
numeric=numeric,
timedelta=timedelta,
coerce=coerce,
copy=copy)).__finalize__(self)

#----------------------------------------------------------------------
Expand Down Expand Up @@ -2859,7 +2864,7 @@ def replace(self, to_replace=None, value=None, inplace=False, limit=None,
'{0!r}').format(type(to_replace).__name__)
raise TypeError(msg) # pragma: no cover

new_data = new_data.convert(copy=not inplace, convert_numeric=False)
new_data = new_data.convert(copy=not inplace, numeric=False)

if inplace:
self._update_inplace(new_data)
Expand Down
32 changes: 20 additions & 12 deletions pandas/core/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,7 @@ def f(self):
except Exception:
result = self.aggregate(lambda x: npfunc(x, axis=self.axis))
if _convert:
result = result.convert_objects()
result = result.convert_objects(datetime=True)
return result

f.__doc__ = "Compute %s of group values" % name
Expand Down Expand Up @@ -2700,7 +2700,7 @@ def aggregate(self, arg, *args, **kwargs):
self._insert_inaxis_grouper_inplace(result)
result.index = np.arange(len(result))

return result.convert_objects()
return result.convert_objects(datetime=True)

def _aggregate_multiple_funcs(self, arg):
from pandas.tools.merge import concat
Expand Down Expand Up @@ -2939,18 +2939,25 @@ def _wrap_applied_output(self, keys, values, not_indexed_same=False):

# if we have date/time like in the original, then coerce dates
# as we are stacking can easily have object dtypes here
if (self._selected_obj.ndim == 2
and self._selected_obj.dtypes.isin(_DATELIKE_DTYPES).any()):
cd = 'coerce'
if (self._selected_obj.ndim == 2 and
self._selected_obj.dtypes.isin(_DATELIKE_DTYPES).any()):
result = result.convert_objects(numeric=True)
date_cols = self._selected_obj.select_dtypes(
include=list(_DATELIKE_DTYPES)).columns
result[date_cols] = (result[date_cols]
.convert_objects(datetime=True,
coerce=True))
else:
cd = True
result = result.convert_objects(convert_dates=cd)
result = result.convert_objects(datetime=True)

return self._reindex_output(result)

else:
# only coerce dates if we find at least 1 datetime
cd = 'coerce' if any([ isinstance(v,Timestamp) for v in values ]) else False
return Series(values, index=key_index).convert_objects(convert_dates=cd)
coerce = True if any([ isinstance(v,Timestamp) for v in values ]) else False
return (Series(values, index=key_index)
.convert_objects(datetime=True,
coerce=coerce))

else:
# Handle cases like BinGrouper
Expand Down Expand Up @@ -3053,7 +3060,8 @@ def transform(self, func, *args, **kwargs):
if any(counts == 0):
results = self._try_cast(results, obj[result.columns])

return DataFrame(results,columns=result.columns,index=obj.index).convert_objects()
return (DataFrame(results,columns=result.columns,index=obj.index)
.convert_objects(datetime=True))

def _define_paths(self, func, *args, **kwargs):
if isinstance(func, compat.string_types):
Expand Down Expand Up @@ -3246,7 +3254,7 @@ def _wrap_aggregated_output(self, output, names=None):
if self.axis == 1:
result = result.T

return self._reindex_output(result).convert_objects()
return self._reindex_output(result).convert_objects(datetime=True)

def _wrap_agged_blocks(self, items, blocks):
if not self.as_index:
Expand All @@ -3264,7 +3272,7 @@ def _wrap_agged_blocks(self, items, blocks):
if self.axis == 1:
result = result.T

return self._reindex_output(result).convert_objects()
return self._reindex_output(result).convert_objects(datetime=True)

def _reindex_output(self, result):
"""
Expand Down
Loading

0 comments on commit 0727803

Please sign in to comment.