Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use argument dtype to inform coercion #17779

Merged

Conversation

TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Oct 4, 2017

Master:

>>> import dask.dataframe as dd

>>> s = dd.core.Scalar({('s', 0): 10}, 's', 'i8')
>>> pdf = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6, 7],
...                     'b': [7, 6, 5, 4, 3, 2, 1]})
>>> (pdf + s).dtypes
a    object
b    object
dtype: object

Head:

>>> (pdf + s).dtypes
a    int64
b    int64
dtype: object

This is more consistent with 0.20.3, while keeping the benefits from #16821

Closes #17767


I may play around with this a bit more. I feel like this is a bit too numpy specific, for example, adding a DataFrame and a pa.Scalar still fails:

In [39]: x = pa.array([1])[0]

In [40]: df = pd.DataFrame({"A": [1, 2]})

In [41]: df + x
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/internals.py in eval(self, func, other, raise_on_error, try_cast, mgr)
   1292         try:
-> 1293             values, values_mask, other, other_mask = self._try_coerce_args(
   1294                 transf(values), other)

~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/internals.py in _try_coerce_args(self, values, other)
    677             raise TypeError("cannot convert {} to an {}".format(
--> 678                 type(other).__name__,
    679                 type(self).__name__.lower().replace('Block', '')))

TypeError: cannot convert Int64Value to an intblock

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/ops.py in na_op(x, y)
   1199             result = expressions.evaluate(op, str_rep, x, y,
-> 1200                                           raise_on_error=True, **eval_kwargs)
   1201         except TypeError:

~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/computation/expressions.py in evaluate(op, op_str, a, b, raise_on_error, use_numexpr, **eval_kwargs)
    210         return _evaluate(op, op_str, a, b, raise_on_error=raise_on_error,
--> 211                          **eval_kwargs)
    212     return _evaluate_standard(op, op_str, a, b, raise_on_error=raise_on_error)

~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/computation/expressions.py in _evaluate_standard(op, op_str, a, b, raise_on_error, **eval_kwargs)
     63     with np.errstate(all='ignore'):
---> 64         return op(a, b)
     65

TypeError: unsupported operand type(s) for +: 'int' and 'pyarrow.lib.Int64Value'

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/internals.py in eval(self, func, other, raise_on_error, try_cast, mgr)
   1351         try:
-> 1352             with np.errstate(all='ignore'):
   1353                 result = get_result(other)

~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/internals.py in get_result(other)
   1320                 result = func(values, other)
-> 1321             else:
   1322                 result = func(values, other)

~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/ops.py in na_op(x, y)
   1225                     with np.errstate(all='ignore'):
-> 1226                         result[mask] = op(xrav, y)
   1227             else:

TypeError: unsupported operand type(s) for +: 'int' and 'pyarrow.lib.Int64Value'

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
<ipython-input-41-a01bb1dd67a6> in <module>()
----> 1 df + x

~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/ops.py in f(self, other, axis, level, fill_value)
   1263                 self = self.fillna(fill_value)
   1264
-> 1265             return self._combine_const(other, na_op)
   1266
   1267     f.__name__ = name

~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/frame.py in _combine_const(self, other, func, raise_on_error, try_cast)
   3836         new_data = self._data.eval(func=func, other=other,
   3837                                    raise_on_error=raise_on_error,
-> 3838                                    try_cast=try_cast)
   3839         return self._constructor(new_data)
   3840

~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/internals.py in eval(self, **kwargs)
   3375         return self.apply('where', **kwargs)
   3376
-> 3377     def eval(self, **kwargs):
   3378         return self.apply('eval', **kwargs)
   3379

~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/internals.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
   3269                                                  copy=align_copy)
   3270
-> 3271             kwargs['mgr'] = self
   3272             applied = getattr(b, f)(**kwargs)
   3273             result_blocks = _extend_blocks(applied, result_blocks)

~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/internals.py in eval(self, func, other, raise_on_error, try_cast, mgr)
   1296             block = self.coerce_to_target_dtype(orig_other)
   1297             return block.eval(func, orig_other,
-> 1298                               raise_on_error=raise_on_error,
   1299                               try_cast=try_cast, mgr=mgr)
   1300

~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/internals.py in eval(self, func, other, raise_on_error, try_cast, mgr)
   1357         except ValueError as detail:
   1358             raise
-> 1359         except Exception as detail:
   1360             result = handle_error()
   1361

~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/internals.py in handle_error()
   1340             if raise_on_error:
   1341                 # The 'detail' variable is defined in outer scope.
-> 1342                 raise TypeError('Could not operate %s with block values %s' %
   1343                                 (repr(other), str(detail)))  # noqa
   1344             else:

TypeError: Could not operate 1 with block values unsupported operand type(s) for +: 'int' and 'pyarrow.lib.Int64Value'

Do we want that to work?

@TomAugspurger TomAugspurger added API Design Dtype Conversions Unexpected or buggy dtype conversions labels Oct 4, 2017
@TomAugspurger TomAugspurger added this to the 0.21.0 milestone Oct 4, 2017
@TomAugspurger
Copy link
Contributor Author

cc @wesm for the arrow part.

The background is that dask has scalar objects that duck-type numpy scalars. We'd like df + scalar to know that scalar has a compatible dtype, and so doesn't have to cast df to object dtype. My current fix is pretty specific to objects having a .dtype attribute that's similar to a numpy.dtype.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems reasonable if all passing

maybe want to check some perf
but i doubt it affect things

@@ -483,6 +483,39 @@ def infer_dtype_from_array(arr, pandas_dtype=False):
return arr.dtype, arr


def _maybe_infer_dtype_type(element):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deprivatize (no leading _)

Master:

```python
>>> import dask.dataframe as dd

>>> s = dd.core.Scalar({('s', 0): 10}, 's', 'i8')
>>> pdf = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6, 7],
...                     'b': [7, 6, 5, 4, 3, 2, 1]})
>>> (pdf + s).dtypes
a    object
b    object
dtype: object

Head:

```
>>> (pdf + s).dtypes
a    int64
b    int64
dtype: object
```

This is more consistent with 0.20.3, while still most of the changes in
pandas-dev#16821

Closes pandas-dev#17767
@TomAugspurger TomAugspurger force-pushed the can_hold_type-coercion branch from a3a2ee2 to b34c61f Compare October 4, 2017 18:24
@codecov
Copy link

codecov bot commented Oct 4, 2017

Codecov Report

Merging #17779 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17779      +/-   ##
==========================================
- Coverage   91.24%   91.22%   -0.02%     
==========================================
  Files         163      163              
  Lines       49916    49923       +7     
==========================================
- Hits        45544    45542       -2     
- Misses       4372     4381       +9
Flag Coverage Δ
#multiple 89.02% <100%> (ø) ⬆️
#single 40.26% <46.42%> (-0.05%) ⬇️
Impacted Files Coverage Δ
pandas/core/dtypes/cast.py 87.99% <100%> (+0.17%) ⬆️
pandas/core/internals.py 94.37% <100%> (-0.01%) ⬇️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.73% <0%> (-0.1%) ⬇️
pandas/core/indexes/period.py 92.78% <0%> (+0.01%) ⬆️
pandas/io/excel.py 80.4% <0%> (+0.02%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 48d0460...7ed82be. Read the comment docs.

(1.0, 'f8'),
(1j, 'complex128'),
(True, 'bool'),
# (np.timedelta64(20, 'ns'), '<m8[ns]'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this commented for a reason?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, forgot to uncomment it while testing.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. just a small comment. merge when satisfied.

@wesm
Copy link
Member

wesm commented Oct 5, 2017

@TomAugspurger no one's put in any effort on making the pyarrow scalar types usable in any kind of analytical setting, they're mostly for display purposes and coercion to Python types for the moment. We will need to address it at some point in the not-too-distant future

@TomAugspurger
Copy link
Contributor Author

We will need to address it at some point in the not-too-distant future

Ok. I don't think the changes here do anything to help with that, but they shouldn't make anything more difficult.

(1.0, 'f8'),
(1j, 'complex128'),
(True, 'bool'),
(np.timedelta64(20, 'ns'), '<m8[ns]'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually ought to add Timedelta/Timestamp, NaT, np.nan, Period

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The missing data ones are already covered in test_nanops. Timedelta / timestamps in handled in tests/scalar/. Period doesn't support binops. So we should be good.

@TomAugspurger TomAugspurger merged commit 6773694 into pandas-dev:master Oct 5, 2017
@TomAugspurger TomAugspurger deleted the can_hold_type-coercion branch October 5, 2017 12:43
ghost pushed a commit to reef-technologies/pandas that referenced this pull request Oct 16, 2017
* Use argument dtype to inform coercion

Master:

```python
>>> import dask.dataframe as dd

>>> s = dd.core.Scalar({('s', 0): 10}, 's', 'i8')
>>> pdf = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6, 7],
...                     'b': [7, 6, 5, 4, 3, 2, 1]})
>>> (pdf + s).dtypes
a    object
b    object
dtype: object

Head:

```
>>> (pdf + s).dtypes
a    int64
b    int64
dtype: object
```

This is more consistent with 0.20.3, while still most of the changes in
pandas-dev#16821

Closes pandas-dev#17767

* Compat for older numpy where bool(dtype) is False

* Added timedelta
alanbato pushed a commit to alanbato/pandas that referenced this pull request Nov 10, 2017
* Use argument dtype to inform coercion

Master:

```python
>>> import dask.dataframe as dd

>>> s = dd.core.Scalar({('s', 0): 10}, 's', 'i8')
>>> pdf = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6, 7],
...                     'b': [7, 6, 5, 4, 3, 2, 1]})
>>> (pdf + s).dtypes
a    object
b    object
dtype: object

Head:

```
>>> (pdf + s).dtypes
a    int64
b    int64
dtype: object
```

This is more consistent with 0.20.3, while still most of the changes in
pandas-dev#16821

Closes pandas-dev#17767

* Compat for older numpy where bool(dtype) is False

* Added timedelta
No-Stream pushed a commit to No-Stream/pandas that referenced this pull request Nov 28, 2017
* Use argument dtype to inform coercion

Master:

```python
>>> import dask.dataframe as dd

>>> s = dd.core.Scalar({('s', 0): 10}, 's', 'i8')
>>> pdf = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6, 7],
...                     'b': [7, 6, 5, 4, 3, 2, 1]})
>>> (pdf + s).dtypes
a    object
b    object
dtype: object

Head:

```
>>> (pdf + s).dtypes
a    int64
b    int64
dtype: object
```

This is more consistent with 0.20.3, while still most of the changes in
pandas-dev#16821

Closes pandas-dev#17767

* Compat for older numpy where bool(dtype) is False

* Added timedelta
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants