-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: dispatch to EA.astype #22343
API: dispatch to EA.astype #22343
Conversation
@@ -391,7 +392,7 @@ def astype(self, dtype, copy=True): | |||
|
|||
# coerce | |||
data = self._coerce_to_ndarray() | |||
return data.astype(dtype=dtype, copy=False) | |||
return astype_nansafe(data, dtype, copy=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was necessary so that
In [8]: s.values
Out[8]: IntegerArray([0, nan, 2], dtype='Int8')
In [9]: s.astype('uint32')
still raises. Previously, it did
In [4]: pd.core.arrays.IntegerArray([1, None, 2], dtype='uint8').astype('uint32')
Out[4]: array([ 1, 2415919104, 2], dtype=uint32)
which I don't think we want. This was only tested at the Series[IntegerArray].astype
level, which never called EA.astype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test for this?
@@ -15,6 +15,17 @@ class DecimalDtype(ExtensionDtype): | |||
name = 'decimal' | |||
na_value = decimal.Decimal('NaN') | |||
|
|||
def __init__(self, context=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seemed like a good idea to have a parameterized ExtensionDtype in the tests folder. For .astype
, we're testing that Series[decimal].astype(DecimalDtype(new_context))
is passed all the way through.
def __init__(self, context=None): | ||
self.context = context or decimal.getcontext() | ||
|
||
def __eq__(self, other): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our default eq is dangerous for parametrized dtypes, depending on whether the parameters appear in the name :/
Codecov Report
@@ Coverage Diff @@
## master #22343 +/- ##
==========================================
+ Coverage 92.05% 92.05% +<.01%
==========================================
Files 169 169
Lines 50709 50712 +3
==========================================
+ Hits 46679 46682 +3
Misses 4030 4030
Continue to review full report at Codecov.
|
@@ -205,6 +205,24 @@ def test_dataframe_constructor_with_dtype(): | |||
tm.assert_frame_equal(result, expected) | |||
|
|||
|
|||
@pytest.mark.parametrize("frame", [True, False]) | |||
def test_astype_dispatches(frame): | |||
data = pd.Series(DecimalArray([decimal.Decimal(2)]), name='a') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment here why this test is added here (instead of as generic extension test)
Code changes look good! Regarding the discussion items you mention, I am wondering if we need some kind of "negotation protocol" between (EA) dtypes for this. Basically something like:
That would in principle enable |
Your negotiation seems sensible I think. I worry a bit about a divergence between edit: though it would be strange (wrong) for a method like |
That's a good point, yeah, that would be strange. Such a negotiation does not really fit for a method, only for functions probably .. |
How about raising (and catching) an |
Perhaps. We'll leave that for another issue though I think, if we discover a need for it in practice. |
@@ -637,22 +637,25 @@ def _astype(self, dtype, copy=False, errors='raise', values=None, | |||
# force the copy here | |||
if values is None: | |||
|
|||
if issubclass(dtype.type, | |||
(compat.text_type, compat.string_types)): | |||
if self.is_extension: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomAugspurger are you really really sure this is needed?
all of this already works with IntegerArray so not t really sure what problem you are trying to solve
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needed for #22325.
Sparse has special semantics for .astype
ing, Series[sparse].astype(np.dtype)
is interpreted as astyping the underlying .values.sp_values
, rather than densifying and asytping. I'd like to deprecate this, but that's another matter.
In general though, it seems like EAs should have a say in how they're astyped, rather than always going through .get_values
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we already do this is my point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not on master? That hit's https://github.com/pandas-dev/pandas/blob/master/pandas/core/internals/blocks.py#L652, which converts to an ndarray, before ever calling the extension array's astype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least, my new tests fail on master without these changes. I'm not sure if / how IntegerArray is being handled differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean the current code in master or this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this PR
there is a ton of code in astype to dispatch to extension types already
this adds yet another branch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you point to this "ton of code"? I don't see any other dispatch to EAs.
What is done is the generic conversion to array and then the use of astype_nansafe
(which has a dispatch to EAs, but for the target EA dtype, not for the calling EA dtype)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a 2 line change, this extra if condition. Which of those two lines is the convoluted one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I could swear I add an .is_extension
branch already, that's why this looks weird. In any event this has lots of if/then condition. Please make a note / issue to clean this up.
pandas/core/dtypes/cast.py
Outdated
try: | ||
return arr.view(dtype) | ||
except TypeError: | ||
if copy is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this necessary? what breaks this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was some object astype that failed (not sure if it's to or from object). I've reverted this change, we'll see what fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In [10]: np.array([1, 2, 3], dtype='int64').view(object)
...
TypeError: Cannot change data-type for object array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes. thanks. Pushed a simpler fix that hopefully works.
pandas/core/dtypes/cast.py
Outdated
if copy is None: | ||
# allowed to copy if necessary (e.g. object) | ||
return arr.astype(dtype, copy=True) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't need an else
@@ -637,22 +637,25 @@ def _astype(self, dtype, copy=False, errors='raise', values=None, | |||
# force the copy here | |||
if values is None: | |||
|
|||
if issubclass(dtype.type, | |||
(compat.text_type, compat.string_types)): | |||
if self.is_extension: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I could swear I add an .is_extension
branch already, that's why this looks weird. In any event this has lots of if/then condition. Please make a note / issue to clean this up.
pandas/core/dtypes/cast.py
Outdated
copy : bool or None, default True | ||
Whether to copy during the `.astype` (True) or | ||
just return a view (False). Passing `copy=None` will | ||
attempt to return a view, but will copy if necessary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the copy keyword still exists, so I would not remove it completely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small test comment. is the references closing issue at the top-of-the PR correct? (it is the original EA dtype registry)?
@@ -205,6 +205,27 @@ def test_dataframe_constructor_with_dtype(): | |||
tm.assert_frame_equal(result, expected) | |||
|
|||
|
|||
@pytest.mark.parametrize("frame", [True, False]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why don't we add a test in base which raises NotImplementedError so that authors are forced to cover this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But is it useful to force EA authors to do it? This is basically checking the pandas dispatch (which we can do here), not the actual EA.astype implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or how about something like
@pytest.mark.parametrize('typ, check', [
('category', 'is_categorical_dtype'),
...
])
def test_astype_category(self, data):
assert check(data.astype(dtype))
Would that make sense as a base test? I think our default implementation would need to be updated to not fail that.
My main concern is that it would be difficult to override (e.g. skip) just some of the dtypes, so maybe we would have to write those as separate tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm mildly concerned that EA authors will not correctly handle extension types. Our base implementation currently fails to handle them. Although we document it as Cast to a NumPy array with 'dtype'.
. Should we make the return type Union[ndarray, ExtensionArray]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would leave such a test until we better figure out how to handle those cross EA/non-EA astype calls (the discussion we were having above in this PR )
Should we make the return type Union[ndarray, ExtensionArray]?
Yeah, probably yes in any case (since EAs authors that provide multiple dtypes already do that, like IntegerArray)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add an issue to make this test, not blocking this PR on it.
Fixed the referenced issue. |
That's the same issue as for #22345? |
---------- | ||
arr : ndarray | ||
dtype : np.dtype | ||
copy : bool, default True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you specify here what happens with False? (will always do view (except for object), even if the itemsize is incompatible)
@@ -205,6 +205,27 @@ def test_dataframe_constructor_with_dtype(): | |||
tm.assert_frame_equal(result, expected) | |||
|
|||
|
|||
@pytest.mark.parametrize("frame", [True, False]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But is it useful to force EA authors to do it? This is basically checking the pandas dispatch (which we can do here), not the actual EA.astype implementation
|
||
with tm.assert_raises_regex( | ||
ValueError, 'cannot convert float NaN to integer'): | ||
arr.astype('uint32') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put this as a stand-alone test at the bottom of this file? (it's not overwriting a base test, but it is a custom integer-specific test)
Ugh, too early. #21296 is the one. |
@jreback was your comment about the wrong issue number all that was left from your point of view? If so, this should be good to go. |
thanks @TomAugspurger maybe need to modify text in pandas/arrays/base/ExtensionArray (the doc-string) to add astype)? |
Closes #21296
split from #22325
Note that this is just a partial fix to get sparse working. We have some issues:
for better or worse, categorical is still special cased.
Series[EA].astype('category')
will not yet dispatch toEA.astype
. That's good non-pandas EAs won't have to worry about specially handling categorical. It's bad, since it's a special case.And that's the second issue, in general, how should we handle astyping from EA1 -> EA2? In principle, we always have
ndarray[object]
as the lowest common denominator. So something likeEA2(EA.astype(object))
may work. But that won't necessarily be efficient. I'm not sure what's best here.cc @xhochy