-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use argument dtype to inform coercion #17779
Use argument dtype to inform coercion #17779
Conversation
cc @wesm for the arrow part. The background is that dask has scalar objects that duck-type numpy scalars. We'd like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems reasonable if all passing
maybe want to check some perf
but i doubt it affect things
pandas/core/dtypes/cast.py
Outdated
@@ -483,6 +483,39 @@ def infer_dtype_from_array(arr, pandas_dtype=False): | |||
return arr.dtype, arr | |||
|
|||
|
|||
def _maybe_infer_dtype_type(element): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deprivatize (no leading _)
Master: ```python >>> import dask.dataframe as dd >>> s = dd.core.Scalar({('s', 0): 10}, 's', 'i8') >>> pdf = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6, 7], ... 'b': [7, 6, 5, 4, 3, 2, 1]}) >>> (pdf + s).dtypes a object b object dtype: object Head: ``` >>> (pdf + s).dtypes a int64 b int64 dtype: object ``` This is more consistent with 0.20.3, while still most of the changes in pandas-dev#16821 Closes pandas-dev#17767
a3a2ee2
to
b34c61f
Compare
Codecov Report
@@ Coverage Diff @@
## master #17779 +/- ##
==========================================
- Coverage 91.24% 91.22% -0.02%
==========================================
Files 163 163
Lines 49916 49923 +7
==========================================
- Hits 45544 45542 -2
- Misses 4372 4381 +9
Continue to review full report at Codecov.
|
(1.0, 'f8'), | ||
(1j, 'complex128'), | ||
(True, 'bool'), | ||
# (np.timedelta64(20, 'ns'), '<m8[ns]'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this commented for a reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, forgot to uncomment it while testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. just a small comment. merge when satisfied.
@TomAugspurger no one's put in any effort on making the pyarrow scalar types usable in any kind of analytical setting, they're mostly for display purposes and coercion to Python types for the moment. We will need to address it at some point in the not-too-distant future |
Ok. I don't think the changes here do anything to help with that, but they shouldn't make anything more difficult. |
(1.0, 'f8'), | ||
(1j, 'complex128'), | ||
(True, 'bool'), | ||
(np.timedelta64(20, 'ns'), '<m8[ns]'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually ought to add Timedelta/Timestamp, NaT, np.nan, Period
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The missing data ones are already covered in test_nanops
. Timedelta / timestamps in handled in tests/scalar/
. Period
doesn't support binops. So we should be good.
* Use argument dtype to inform coercion Master: ```python >>> import dask.dataframe as dd >>> s = dd.core.Scalar({('s', 0): 10}, 's', 'i8') >>> pdf = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6, 7], ... 'b': [7, 6, 5, 4, 3, 2, 1]}) >>> (pdf + s).dtypes a object b object dtype: object Head: ``` >>> (pdf + s).dtypes a int64 b int64 dtype: object ``` This is more consistent with 0.20.3, while still most of the changes in pandas-dev#16821 Closes pandas-dev#17767 * Compat for older numpy where bool(dtype) is False * Added timedelta
* Use argument dtype to inform coercion Master: ```python >>> import dask.dataframe as dd >>> s = dd.core.Scalar({('s', 0): 10}, 's', 'i8') >>> pdf = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6, 7], ... 'b': [7, 6, 5, 4, 3, 2, 1]}) >>> (pdf + s).dtypes a object b object dtype: object Head: ``` >>> (pdf + s).dtypes a int64 b int64 dtype: object ``` This is more consistent with 0.20.3, while still most of the changes in pandas-dev#16821 Closes pandas-dev#17767 * Compat for older numpy where bool(dtype) is False * Added timedelta
* Use argument dtype to inform coercion Master: ```python >>> import dask.dataframe as dd >>> s = dd.core.Scalar({('s', 0): 10}, 's', 'i8') >>> pdf = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6, 7], ... 'b': [7, 6, 5, 4, 3, 2, 1]}) >>> (pdf + s).dtypes a object b object dtype: object Head: ``` >>> (pdf + s).dtypes a int64 b int64 dtype: object ``` This is more consistent with 0.20.3, while still most of the changes in pandas-dev#16821 Closes pandas-dev#17767 * Compat for older numpy where bool(dtype) is False * Added timedelta
Master:
Head:
This is more consistent with 0.20.3, while keeping the benefits from #16821
Closes #17767
I may play around with this a bit more. I feel like this is a bit too numpy specific, for example, adding a
DataFrame
and apa.Scalar
still fails:Do we want that to work?