Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: dispatch to EA.astype #22343

Merged
merged 10 commits into from
Aug 20, 2018
Merged

Conversation

TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Aug 14, 2018

Closes #21296

split from #22325

Note that this is just a partial fix to get sparse working. We have some issues:

for better or worse, categorical is still special cased. Series[EA].astype('category') will not yet dispatch to EA.astype. That's good non-pandas EAs won't have to worry about specially handling categorical. It's bad, since it's a special case.

And that's the second issue, in general, how should we handle astyping from EA1 -> EA2? In principle, we always have ndarray[object] as the lowest common denominator. So something like EA2(EA.astype(object)) may work. But that won't necessarily be efficient. I'm not sure what's best here.

cc @xhochy

@TomAugspurger TomAugspurger added Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. labels Aug 14, 2018
@@ -391,7 +392,7 @@ def astype(self, dtype, copy=True):

# coerce
data = self._coerce_to_ndarray()
return data.astype(dtype=dtype, copy=False)
return astype_nansafe(data, dtype, copy=None)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was necessary so that

In [8]: s.values
Out[8]: IntegerArray([0, nan, 2], dtype='Int8')

In [9]: s.astype('uint32')

still raises. Previously, it did

In [4]: pd.core.arrays.IntegerArray([1, None, 2], dtype='uint8').astype('uint32')
Out[4]: array([         1, 2415919104,          2], dtype=uint32)

which I don't think we want. This was only tested at the Series[IntegerArray].astype level, which never called EA.astype

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test for this?

@@ -15,6 +15,17 @@ class DecimalDtype(ExtensionDtype):
name = 'decimal'
na_value = decimal.Decimal('NaN')

def __init__(self, context=None):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seemed like a good idea to have a parameterized ExtensionDtype in the tests folder. For .astype, we're testing that Series[decimal].astype(DecimalDtype(new_context)) is passed all the way through.

def __init__(self, context=None):
self.context = context or decimal.getcontext()

def __eq__(self, other):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our default eq is dangerous for parametrized dtypes, depending on whether the parameters appear in the name :/

@codecov
Copy link

codecov bot commented Aug 14, 2018

Codecov Report

Merging #22343 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #22343      +/-   ##
==========================================
+ Coverage   92.05%   92.05%   +<.01%     
==========================================
  Files         169      169              
  Lines       50709    50712       +3     
==========================================
+ Hits        46679    46682       +3     
  Misses       4030     4030
Flag Coverage Δ
#multiple 90.46% <100%> (ø) ⬆️
#single 42.25% <72.72%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/internals/blocks.py 93.84% <100%> (ø) ⬆️
pandas/core/dtypes/cast.py 88.58% <100%> (ø) ⬆️
pandas/core/arrays/integer.py 94.69% <100%> (+0.02%) ⬆️
pandas/core/frame.py 97.24% <0%> (ø) ⬆️
pandas/core/series.py 93.72% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cf70d11...2606d02. Read the comment docs.

@@ -205,6 +205,24 @@ def test_dataframe_constructor_with_dtype():
tm.assert_frame_equal(result, expected)


@pytest.mark.parametrize("frame", [True, False])
def test_astype_dispatches(frame):
data = pd.Series(DecimalArray([decimal.Decimal(2)]), name='a')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment here why this test is added here (instead of as generic extension test)

@jorisvandenbossche
Copy link
Member

Code changes look good!

Regarding the discussion items you mention, I am wondering if we need some kind of "negotation protocol" between (EA) dtypes for this. Basically something like:

  • consider Series(EA).astype(dtype)
  • first try EA.astype(dtype) -> if the EA knows how to handle the specific dtype, return a new EA/array; if it does not know how to convert itself to dtype, return NotImplemented
  • if the above returned NotImplemented, ask dtype if it knows how to convert the EA values to its own dtype -> if it knows, return a new EA/array; otherwise return NotImplemented
  • if both don't know (above return NotImplemented again), either raise an error or fall back to the base implementation (np.array(self, dtype=dtype,, which might also raise an error)

That would in principle enable EA.astype(EA2), and might also mitigate the 'category' problem you mention.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Aug 14, 2018

Your negotiation seems sensible I think. I worry a bit about a divergence between Series.astype(dtype) and Series.values.astype(dtype), but that's maybe unavoidable.

edit: though it would be strange (wrong) for a method like ExtensionArray.astype to return NotImplemented, so this would have to be a back-channel like ExtensionArray._astype, which is the same as astype, but can return NotImplemented? This seems messy.

@jorisvandenbossche
Copy link
Member

I worry a bit about a divergence between Series.astype(dtype) and Series.values.astype(dtype)

That's a good point, yeah, that would be strange. Such a negotiation does not really fit for a method, only for functions probably ..

@h-vetinari
Copy link
Contributor

Such a negotiation does not really fit for a method, only for functions probably...

How about raising (and catching) an UnsupportedConversionError (but letting other errors pass through)? That should easily be compatible with method calls.

@TomAugspurger
Copy link
Contributor Author

Perhaps. We'll leave that for another issue though I think, if we discover a need for it in practice.

@@ -637,22 +637,25 @@ def _astype(self, dtype, copy=False, errors='raise', values=None,
# force the copy here
if values is None:

if issubclass(dtype.type,
(compat.text_type, compat.string_types)):
if self.is_extension:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger are you really really sure this is needed?
all of this already works with IntegerArray so not t really sure what problem you are trying to solve

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed for #22325.

Sparse has special semantics for .astypeing, Series[sparse].astype(np.dtype) is interpreted as astyping the underlying .values.sp_values, rather than densifying and asytping. I'd like to deprecate this, but that's another matter.

In general though, it seems like EAs should have a say in how they're astyped, rather than always going through .get_values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already do this is my point

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not on master? That hit's https://github.com/pandas-dev/pandas/blob/master/pandas/core/internals/blocks.py#L652, which converts to an ndarray, before ever calling the extension array's astype.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least, my new tests fail on master without these changes. I'm not sure if / how IntegerArray is being handled differently.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean the current code in master or this PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this PR

there is a ton of code in astype to dispatch to extension types already
this adds yet another branch

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point to this "ton of code"? I don't see any other dispatch to EAs.
What is done is the generic conversion to array and then the use of astype_nansafe (which has a dispatch to EAs, but for the target EA dtype, not for the calling EA dtype)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a 2 line change, this extra if condition. Which of those two lines is the convoluted one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I could swear I add an .is_extension branch already, that's why this looks weird. In any event this has lots of if/then condition. Please make a note / issue to clean this up.

try:
return arr.view(dtype)
except TypeError:
if copy is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this necessary? what breaks this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was some object astype that failed (not sure if it's to or from object). I've reverted this change, we'll see what fails.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In [10]: np.array([1, 2, 3], dtype='int64').view(object)
...
TypeError: Cannot change data-type for object array.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes. thanks. Pushed a simpler fix that hopefully works.

if copy is None:
# allowed to copy if necessary (e.g. object)
return arr.astype(dtype, copy=True)
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need an else

@@ -637,22 +637,25 @@ def _astype(self, dtype, copy=False, errors='raise', values=None,
# force the copy here
if values is None:

if issubclass(dtype.type,
(compat.text_type, compat.string_types)):
if self.is_extension:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I could swear I add an .is_extension branch already, that's why this looks weird. In any event this has lots of if/then condition. Please make a note / issue to clean this up.

copy : bool or None, default True
Whether to copy during the `.astype` (True) or
just return a view (False). Passing `copy=None` will
attempt to return a view, but will copy if necessary.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the copy keyword still exists, so I would not remove it completely

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small test comment. is the references closing issue at the top-of-the PR correct? (it is the original EA dtype registry)?

@@ -205,6 +205,27 @@ def test_dataframe_constructor_with_dtype():
tm.assert_frame_equal(result, expected)


@pytest.mark.parametrize("frame", [True, False])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't we add a test in base which raises NotImplementedError so that authors are forced to cover this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But is it useful to force EA authors to do it? This is basically checking the pandas dispatch (which we can do here), not the actual EA.astype implementation

Copy link
Contributor Author

@TomAugspurger TomAugspurger Aug 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or how about something like

@pytest.mark.parametrize('typ, check', [
    ('category', 'is_categorical_dtype'),
    ...
])
def test_astype_category(self, data):
    assert check(data.astype(dtype))

Would that make sense as a base test? I think our default implementation would need to be updated to not fail that.

My main concern is that it would be difficult to override (e.g. skip) just some of the dtypes, so maybe we would have to write those as separate tests?

Copy link
Contributor Author

@TomAugspurger TomAugspurger Aug 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm mildly concerned that EA authors will not correctly handle extension types. Our base implementation currently fails to handle them. Although we document it as Cast to a NumPy array with 'dtype'.. Should we make the return type Union[ndarray, ExtensionArray]?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave such a test until we better figure out how to handle those cross EA/non-EA astype calls (the discussion we were having above in this PR )

Should we make the return type Union[ndarray, ExtensionArray]?

Yeah, probably yes in any case (since EAs authors that provide multiple dtypes already do that, like IntegerArray)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add an issue to make this test, not blocking this PR on it.

@jreback jreback added this to the 0.24.0 milestone Aug 16, 2018
@TomAugspurger
Copy link
Contributor Author

Fixed the referenced issue.

@jorisvandenbossche
Copy link
Member

That's the same issue as for #22345?

----------
arr : ndarray
dtype : np.dtype
copy : bool, default True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you specify here what happens with False? (will always do view (except for object), even if the itemsize is incompatible)

@@ -205,6 +205,27 @@ def test_dataframe_constructor_with_dtype():
tm.assert_frame_equal(result, expected)


@pytest.mark.parametrize("frame", [True, False])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But is it useful to force EA authors to do it? This is basically checking the pandas dispatch (which we can do here), not the actual EA.astype implementation


with tm.assert_raises_regex(
ValueError, 'cannot convert float NaN to integer'):
arr.astype('uint32')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put this as a stand-alone test at the bottom of this file? (it's not overwriting a base test, but it is a custom integer-specific test)

@TomAugspurger
Copy link
Contributor Author

Ugh, too early. #21296 is the one.

@TomAugspurger
Copy link
Contributor Author

@jreback was your comment about the wrong issue number all that was left from your point of view? If so, this should be good to go.

@jreback jreback merged commit a3c50a6 into pandas-dev:master Aug 20, 2018
@jreback
Copy link
Contributor

jreback commented Aug 20, 2018

thanks @TomAugspurger

maybe need to modify text in pandas/arrays/base/ExtensionArray (the doc-string) to add astype)?

@TomAugspurger TomAugspurger deleted the ea-astype-dispatch branch August 20, 2018 12:12
Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Series[EA].astype always casts intermediate to NumPy array
4 participants