Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: astype mechanism for extension arrays #22384

Open
TomAugspurger opened this issue Aug 16, 2018 · 41 comments
Open

API: astype mechanism for extension arrays #22384

TomAugspurger opened this issue Aug 16, 2018 · 41 comments
Labels
API Design Astype Closing Candidate May be closeable, needs more eyeballs Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. Needs Discussion Requires discussion from core team before further action

Comments

@TomAugspurger
Copy link
Contributor

In #22343, we're bumping into a few problems around casting to / from extension arrays.

  1. Our current default implementation for ExtensionArray.astype doesn't handle target dtypes that are extension types (so extension_array.astype('category') fails). At the moment, each EA will have to implement their own astyping, which is error prone and difficult.
  2. Some EAs may be more opinionated than others about how astyping should be done. There may exist fastpaths between certain EAs, but only one of the source and destination types may know about that fast path. This (I think) calls for a kind of dispatch mechanism, where each gets to say whether they know how to handle the astyping.

I'm not sure how far down this road we want to go. Should we instead encorage users to use explicit constructors like .assign(x=pd.Categorical(...) rather than .astype('category')?

@TomAugspurger TomAugspurger added the ExtensionArray Extending pandas with custom dtypes or arrays. label Aug 16, 2018
@TomAugspurger
Copy link
Contributor Author

To be clear, there are a couple levels we could take on here.

At a minimum, I think .astype('dtype_name') where dtype_name is in our registry, should work, even if it's inefficient. That can be solved before going down a full dispatch mechanism.

@jorisvandenbossche jorisvandenbossche changed the title astype mechanism for extension arrays API: astype mechanism for extension arrays Aug 16, 2018
@jorisvandenbossche jorisvandenbossche added the Needs Discussion Requires discussion from core team before further action label Aug 16, 2018
@jorisvandenbossche
Copy link
Member

Let's try to summarise the (ideal) requirements:

  • We want that EA's can implement logic of converting itself to another dtype, so we at least need to dispatch somehow to EA's
  • But, we also want that EA's don't have to be aware of all possible target dtypes + some target dtypes might want to take over the conversion (such as categorical, see below)
  • We want EA's to have control over how values are converted to it when the target dtype is an EA
    • This is currently already done by doing dtype.construct_array_type()._from_sequence(..) if dtype is an EA dtype
  • The EA.astype function itself should be user-functional (eg not return NotImplemented for some kind of dispatch mechanism)

Does that sound right?


In the PR I mentioned the idea of a "negotation protocol" (#22343 (comment)) similar to how operators work in Python:

  • When doing an astype operation (eg Series(EA).astype(dtype)), we first check if the values (the array backing the series) know how to do this conversion:
    • if values is a numpy array, we call astype if dtype is a numpy dtype, otherwise go further with next step below
    • if values is an EA, we call EA.__astype(dtype) -> if the EA knows how to handle the specific dtype, it returns a new EA/array; if it does not know how to convert itself to dtype, it returns NotImplemented and we continue with the next step
  • Next, we check if the target dtype knows how to convert the values to its own dtype -> if it knows, it returns a new EA/array; otherwise it returns NotImplemented
  • if both don't know (above return NotImplemented again), either raise an error or fall back to the base implementation (np.array(self, dtype=dtype,, which might also raise an error)

I used here __astype instead of astype, because we don't want the public EA.astype to return NotImplemented (one of the concerns raised in the PR). As Tom noted a bit messy, but not too much I think.

The other concern raised is the divergence between Series.astype(dtype) and Series.values.astype(dtype) (where the first would be able to handle more dtypes. Eg Series.values.astype('category') would not work if the EA didn't implement that itself).

However, I think it should be possible (if the EA author wants that) to have this more consistent, and we could implement a base class astype example that does this, by calling dtype.construct_array_type()._from_sequence(..)., just as we do in pandas.


Another possibility would be that we provide some helper function for EA authors (and for internal use) that basically does the above, but simply inside the astype method. And then just always dispatch to EA.astype.

For example, the EA.astype method could look like:

def astype(self, dtype):
    if dtype ... [check if if I know what to do for this dtype]:
        ...
        return values
    else:
        return pandas_astype(self, dtype)

with

def pandas_astype(values, dtype):
    if is_extension_dtype(dtype):
        return dtype.construct_array_type()._from_sequence(values)
    else:
        return np.array(values, dtype=dtype)

But that then counts on the EA's actually doing this.

@TomAugspurger
Copy link
Contributor Author

Sorry I missed your comment earlier Joris. Your list of ideal requirements sounds correct. Haven't thought much about the implementation yet.

Regardless, not something we can solve for 1.0 so pushing.

@TomAugspurger TomAugspurger modified the milestones: 1.0, Contributions Welcome Jan 2, 2020
@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jan 27, 2020

@jorisvandenbossche Based on something I wrote here (#31204 (comment)) with respect to StringDtype, I'm wondering if the rules you wrote above should be reversed.

You wrote:

When doing an astype operation (eg Series(EA).astype(dtype)), we first check if the values (the array backing the series) know how to do this conversion:
if values is a numpy array, we call astype if dtype is a numpy dtype, otherwise go further with next step below
if values is an EA, we call EA.__astype(dtype) -> if the EA knows how to handle the specific dtype, it returns a new EA/array; if it does not know how to convert itself to dtype, it returns NotImplemented and we continue with the next step
Next, we check if the target dtype knows how to convert the values to its own dtype -> if it knows, it returns a new EA/array; otherwise it returns NotImplemented

I think we ought to consider doing the reverse, namely:

When doing an astype operation (eg Series(EA).astype(dtype)), if the target dtype knows how to convert the values to its own dtype -> if it knows, it returns a new EA/array; otherwise it returns NotImplemented
Then check if the values (the array backing the series) know how to do this conversion:

  • if values is a numpy array, we call astype if dtype is a numpy dtype, otherwise go further with next step below
  • if values is an EA, we call EA.__astype(dtype) -> if the EA knows how to handle the specific dtype, it returns a new EA/array; if it does not know how to convert itself to dtype, it returns NotImplemented and we continue with the next step

In the case of StringDtype, because the elements of an EA should have __str__ implemented, the conversion is straightforward. Asking each EA to "know" about StringDtype seems like overkill.

As I wrote there, I would think that the behavior of pd.Series(some_list, dtype=EAtype) and s.astype(EAtype) would be similar, namely if the EAtype knows how to convert some_list, then it just does it, so we prioritize letting the target dtype do the work, not the other way around.

@giuliobeseghi
Copy link

giuliobeseghi commented Feb 15, 2020

Just a few comments on this, not sure if they're helpful.

Input parameters

pd.Categorical and astype are not equivalent. You can't pass any "input" argument to astype ("input" defined as an argument that influences the conversion, like categories or ordered in pd.Categorical).

Should these arguments be allowed with astype (possibly as args and kwargs)? Probably not, because this is in conflict with how astype is designed. But then you don't have the functionality of pd.Categorical. So +1 on pd.Categorical, which is more versatile.

Return

pd.Categorical and astype return a different object.

import pandas as pd

series = pd.Series(["a", "c", "b", "c", "a", "b", "a", "c"])

explicit = pd.Categorical(series)  # dtype = categorical array
implicit = series.astype("category")  # dtype = series

pd.Categorical converts the input to a categorical array. astype preserves the input type, which I think is more in line with how pandas is designed, as it csn be chained properly.
.assign(x=pd.Categorical(...) would convert the categorical array to a series, so it could be chained too, but it obviously works only with dataframes and not series. So +1 on astype.

Consistency with pd.to_datetime

I've always found type conversion with pandas types a bit confusing. For example, with datetimes:

series = pd.Series([str(y) for y in range(2000, 2020)])
series.astype('datetime64[ns]')  # this works
pd.to_datetime(series)  # this works too

Which one should I use? I guess they're interchangeable, because no warning is raised. But it seems to me that pd.to_datetime is somehow preferred.

By the way, why not a pd.to_categorical as pd.to_datetime? You could pass the args of pd.Categorical and you would always get the right type as output. Also, it unifies the convention of using a "to_pandasdtype" method to convert pandas objects to the right pandas type. I'm kind of +3 on this, before hearing your thoughts :)

@TomAugspurger
Copy link
Contributor Author

You would astype to a CategoricalDtype, not Categorical.

@giuliobeseghi
Copy link

Ok, I got confused with pd.Categorical. Sorry for that!

@jbrockmendel
Copy link
Member

@jorisvandenbossche some suggestions to add to the list of desired behavior:

  • dtype.construct_array_type()._from_sequence(ea, dtype=dtype) should match ea.astype(dtype)
  • Series(ea).astype(dtype) should match Series(ea.astype(dtype))
  • For a masked dtype and a NA-less array, arr.astype(dtype).to_numpy() should match arr.astype(dtype.numpy_dtype)

@jbrockmendel
Copy link
Member

but there are certainly advanced use case of wanting to convert it to integers.

I see this as exactly a use case of .view('i8').

Is there a viable option to punt on this before we go around in circles?

@jorisvandenbossche
Copy link
Member

Is there a viable option to punt on this before we go around in circles?

We can at least move it to a separate discussion. Will open a new issue or PR for this.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Sep 22, 2021

An overview of some dtype casting combinations, and the kind / safety of casting that they could allow:

  • int8 -> int64 (or general to larger bitwidth)
    • Numpy: safe cast
    • Always strictly safe
  • int64 -> int8 (or general to smaller bitwidth)
    • Numpy: same-kind / unsafe cast
    • Not generally safe, but can equality-preserving at runtime (if within int8 bounds)
  • int64 -> float64
    • Numpy: safe cast
    • But could fail at runtime when integer is too large to represent as float
  • float64 -> int64
    • Numpy: unsafe cast
    • Not generally safe, but can be equality-preserving at runtime (if "round" floats without decimals, and if no NaNs are present)
      • We currently already check for NaNs
      • Question: would it be useful to distinguish both failing cases? (decimals and NaNs) Eg with a single keyword safe=True/False or casting="safe"/"unsafe", that would be difficult to express
  • datetime64[s] -> datetime64[ns] (or generally to higher resolution)
    • Numpy: same_kind / unsafe cast
    • Not generally safe, but can be equality-preserving at runtime (if not out-of-bounds)
  • datetime64[ns] -> datetime64[s] (or generally to lower resolution)
    • Numpy: safe cast
    • Generally safe, but we could let it fail at runtime (if nanoseconds would be truncated)
  • int/float/datetime64 -> string
    • Numpy: safe cast
    • Not equality-preserving, but generally roundtrip-preserving, so considered as safe
  • string -> int/float/datetime64
    • Numpy: unsafe cast
    • Not generally safe, but runtime can pass (if each string can be parsed for the specific dtype)
  • float64 -> datetime64:
    • Numpy: unsafe cast
    • This can pass safely at runtime if the float64->int64 cast passes, and then int64 can be cast to datetime64, but I would personally say we could also simply let this error (TypeError: unsupported cast) ? (currently we follow numpy, and this casting works in pandas, so that would be a deprecation)
  • timedelta64 -> datetime64:
    • Numpy: unsafe cast
    • Somewhat similar as the case above, this could in theory be supported (casting the underlying integers, which is what numpy does), but ideally this gives a TypeError (in this case pandas already does this)

@jorisvandenbossche
Copy link
Member

General observation: it might actually be sufficient to consider only two kinds of casting: supported and unsupported casts (i.e. casts that always raise a TypeError). And for the supported casts, we could always require a safe cast by default (i.e. value preserving), with the option to turn off checks to get unsafe cast.

If you look in the overview above for the casts that can work in some cases, numpy considers some safe and some same_kind or unsafe. But in practice, for all those cases, it can be equality-preserving (depending on the actual value). So for pandas point of view, I would not make this distinction up-front about safe/unsafe casts (based on the dtypes), but only handle runtime errors (out of bound values, values that would overflow or get truncated, etc).

@TomAugspurger
Copy link
Contributor Author

I think we should agree on a couple principles, and work from those:

  1. pandas should be consistent with itself between Series(values, dtype=dtype) and values.astype(dtype=dtype).
  2. pandas should by default error at runtime when type casting causes loss of equality (integer overflow, float -> int truncation, ...), with the option to disable that check (since it will be expensive).

Using these rules, especially 2, I think I agree with Joris' point:

I would not make this distinction up-front about safe/unsafe casts (based on the dtypes), but only handle runtime errors (out of bound values, values that would overflow or get truncated, etc).

@jbrockmendel
Copy link
Member

@seberg im curious if any of the types of safety discussed here are going to be handled differently in new numpy casting stuff you're working on

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Sep 22, 2021

  • pandas should be consistent with itself between Series(values, dtype=dtype) and values.astype(dtype=dtype).

I agree with this, and I would like to bring up what I wrote at #22384 (comment) , since I think it is related to the current discussion:

We have to decide where the astype is implemented. Let's say that you have

s1 = pd.Series(somelist, dtype=dtype1)
s2 = s1.astype(dtype2)

where dtype1 and dtype2 are arbitrary types. They could be integer, datetimes, or any EA type.

I believe that in pandas today, the following happens (please correct me if I'm wrong):

  • When s1 is created, the constructor for the underlying array supporting dtype1 is responsible for converting whatever is in somelist to dtype1 (or raising if not possible)
  • When s2 is created, the implementation of the underlying array supporting dtype1 is responsible for converting to dtype2.

I think the second behavior should change. When astype is called on a series of dtype1, the underlying EA implementing dtype2 is responsible for converting objects of type dtype1 to dtype2. Then the same implementation for converting lists/arrays of any type in constructing a Series of dtype2 can be used to convert a Series of any type calling astype(dtype2).

I believe that if we think of it this way, we accomplish the goal of making Series(values, dtype=dtype) and values.astype(dtype=dtype) consistent, and then it is up to the "receiving" dtype of astype() to decide (and possibly document) how it will convert objects of various types.

This then lets us separate discussions into questions about how each receiving type should convert objects of different types based on the receiving type.

Hope this makes sense.

@seberg
Copy link
Contributor

seberg commented Sep 22, 2021

The casting safety is something I have not really attempted to fix/attack. The one general comment that I have (which is probably either useless or enlightening) is that I consider "common dtype" and casting safety as completely unrelated. The other one is that "safe" could also mean safe from a type perspective (i.e. bool → int is not really safe), but actually means more "round-tripping" right now.

From a user-perspective, I also think that an "attempt" casting safety would make sense (I guess your safe=True). Which tries to the cast, but fails if the result would have lost any information (i.e. a typical "safe" cast should never do this, but many unsafe casts could be attempted).

I designed parts of the new API in a way that we could add new casting safety levels at any time, but attempting a cast that may be safe is a bit different, since it requires propagating the information to the actual casting function: It is possible to grow in NumPy, but we have to design new API!

pandas should be consistent with itself between Series(values, dtype=dtype) and values.astype(dtype=dtype).

This one is a tricky one :(. NumPy force-coerces (effectively force casts) for np.array(values, dtype=dtype). I would love to get away from this also. This requires some API decisions though, and it is trickier than for normal casting. We can add that new API though! (My main annoyance is that the normal casting code can be designed in an extensible way, but I am not sure that is feasible for the scalar coercion code.)

(Not sure these notes actually help, basically, I have some thoughts about it, but little API decisions. And I think NumPy should try to move a bit eventually, but I am not exactly sure how/where.)

EDIT: sorry, the comment about common dtype is esoteric, probably just ignore it... It has some reason in: arr.astype(dtype=Integral, casting="safe") which could fall back to the common-dtype operation to decide which integral to use before checking casting.

@seberg
Copy link
Contributor

seberg commented Sep 22, 2021

Maybe I should just say what I do for NumPy's normal casting: I return the casting level from its implementation. I.e. a cast is described by an object:

class Int64ToFloat64Cast(ArrayMethod):
    DTypes = [Int64, Float64]  # classes, not instances

    def resolve_descriptors(self, input_output_dtypes):
         # Handles all preparation logic, but does not know the values
         # returns the dtypes supported by the C cast and the casting safety.
         return (Int64, Float64()), "unsafe"

    def get_loop(...):
        # The C cast function (basically a ufunc inner-loop).
        return c_function_that_does_the_cast

So the resolve_descriptors tells NumPy whether it is a safe cast (or actually if it is just a view). That could indicate if an "attempt" cast is feasible, but I would require some new API for get_loop or the actual inner-loop to handle the rest.

(The distinct resolve_descriptors step is very important, because it allows, for example: Chaining multiple casts and allocating the result arrays, details of which may be important for get_loop)

@jbrockmendel
Copy link
Member

A suggestion for the list of principles: obj.astype(dtype).dtype == dtype (doesnt currently hold for Sparse xref #34457)

@adamgranthendry
Copy link

adamgranthendry commented Oct 31, 2021

Per #44117, please consider adding keyword argument errors (default value raise) for astype to optionally convert all values to a specified type (pd.NA or np.nan if cannot cast) at creation. (Users often know the data types they want to convert their columns to at creation.)

e.g. Currently, the following raises a ValueError:

>>> a = pd.Series(['a', 1, 3, ''], dtype=np.int)
ValueError: invalid literal for int() with base 10: 'a'

However, it would be nice to have (something like) the following capability:

>>> a = pd.Series(['a', 1, 3, '']).astype(np.int, errors="coerce")
>>> a
0    NaN
1    1
2    3
3    NaN
dtype: int32

Such an enhancement would centralize functionality existing in the multiple to_...() methods and also mimick the converters argument to the method read_excel.

(ASIDE: DataFrame.convert_dtypes is insufficient since it doesn't have coerce functionality, so the Series a in the above example would simply be converted to type object rather than the desired dtype)

@jbrockmendel
Copy link
Member

xref #45151

@jbrockmendel
Copy link
Member

Re-reading this thread, I think it can be split into two more-or-less distinct issues:

  1. How do we determine whether .astype dispatches to _from_sequence or the other way around?

  2. What types of safety/coercion to do we want to enforce/allow?

For 1), ATM the base class EA.astype dispatches to _from_sequence, so de facto I think we've landed on a Reasonable Default (tm). Is there still a need for something like a negotiation protocol? I move that we declare this problem solved until a compelling use case is spotted.

For 2), I'm increasingly skeptical of a one-size-fits-all keyword argument with a fixed set of options. I'm wavering between a) "one-size-fits-most to handle the most common 90%", b) "something customizable", c) "the status quo isn't so bad"

@jorisvandenbossche
Copy link
Member

So for the second point:

2. What types of safety/coercion to do we want to enforce/allow?

I opened a new, dedicated issue (-> #45588), so we can further focus here on the first point (potential dispatch mechanism)

Per #44117, please consider adding keyword argument errors (default value raise) for astype to optionally convert all values to a specified type (pd.NA or np.nan if cannot cast) at creation.

@adamgranthendry yes, we have always had a bit the divide between "specialized" conversion functions (eg to_datetime, to_numeric) that have this functionality (and more) versus the generic casting (astype) method, and to what extent astype should be able to do everything that the more specialized ones can do.
I think we also have older issues about this topic, but for now I reopened your issue, because I think we can keep that as a separate discussion (whatever mechanism we decide here, there will always be some place to pass extra keywords, just as we will have to pass a keyword for casting safety).

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Feb 2, 2022

On the actual topic of a "mechanism" to dispatch to the source or target dtype to perform the actual cast, I made some explanatory code mock-ups last week, to think through what a potential mechanism would look like or what the advantages / disadvantages would be.

Numpy-like CastImplementation registry

First, I tried to make a mockup of a mechanims that mimics numpy (https://numpy.org/neps/nep-0042-new-dtypes.html#the-cast-operation), but a little bit simplified. But it keeps the idea of a CastImplementation class and instances of this can be registered for a specific (source_dtype -> target_dtype) combination

The abstract API for a base class and the registry:

class CastImplementation:
    
    def resolve_dtypes(
        self, dtypes: Tuple[ExtensionDtype, ExtensionDtype], **kwargs
    ) -> Tuple[ExtensionDtype, ExtensionDtype]:
        """given the from/to dtypes, for which dtypes is the cast implemented"""
        pass
    
    def do_cast(
        self, array: ExtensionArray, dtype: ExtensionDtype, **kwargs
    ) -> ExtensionDtype:
        """does the actual cast"""
        pass


cast_registry = {}


def register_cast_impl(dtypes, cast_impl: CastImplementation | None):
    """Register cast implementation for (source, target) dtype combination.
    Register as `None` for explicitly registering it as an unsupported cast
    """"
    if dtypes in cast_registry:
        raise ValueError("cast already registered")
    cast_registry[(dtype_from, dtype_to)] = cast_impl


def get_cast_impl(dtypes) -> CastImplementation | None:
    return cast_registry.get(dtypes, DefaultFallBackCast)

(the DefaultFallBackCast could be one that falls back to _from_sequenceof the target dtype)

Dtypes can then implement and register those, here as simple example a subset of what would be registered for (Float, Int):

class FloatToIntCast(CastImplementation):

    def resolve_dtypes(self, dtypes, **kwargs):
        # non-parametrized dtypes, so always return the input
        return dtypes

    def do_cast(self, array, dtype, **kwargs):
        return astype_float_to_int(array, dtype, **kwargs)


for dtype_from in [pd.Float32Dtype, pd.Float64Dtype]:
    for dtype_to in [pd.Int8Dtype, pd.Int16Dtype, pd.Int32Dtype, pd.Int64Dtype]:
        register_cast_impl((dtype_from, dtype_to), FloatToIntCast)

and then the actual astype method can be implemented as a single function that handles all potential dtypes:

def astype(arr, dtype, **kwargs):
    # get the cast implementation class (based on dtype classes, not instances)
    cast = get_cast_impl((type(arr.dtype), type(dtype)))
    
    # if cast is known to be not supported:
    if cast is None:
        raise TypeError(f"Cannot cast from {arr.dtype} to {dtype}.")

    # in case of parametrized dtypes, check from which to which dtype we can cast
    resolved_dtype_from, resolved_dtype_to = cast.resolve_dtypes((arr.dtype, dtype))
    
    # if needed, first cast the input array to a different parametrization
    if arr.dtype != resolved_dtype_from:
        cast_from = get_cast_impl((type(arr.dtype), type(arr.dtype)))
        arr = cast_from.do_cast(arr, resolved_dtype_from)
    
    # do the main cast
    arr = cast.do_cast(arr, resolved_dtype_to)
    
    # if needed, cast the result array to a different parametrization
    if dtype != resolved_dtype_to:
        cast_to = get_cast_impl((type(resolved_dtype_to), type(resolved_dtype_to)))
        arr = cast_to.do_cast(arr, dtype)

    return arr

This is a quite complicated API (and also quite different from other interfaces we currently have in pandas with the class), and maybe too complicated for what we need. But some nice things about it:

  • ExtenstionDtypes can register the dtype casts they implement, and then the astype method is done automatically for them (so no long if/else chain of "if this dtype then .. elif this dtype then .. elif ... else" in every EA's astype).
  • For parametrized dtypes, you don't have to implement every combination.
    • As a concrete example (considering geopandas' geometry dtype), suppose one does Series[geometry].astype("string[pyarrow]").. Casting to strings is something the GeometryDtype knows best how to do, so they will implement that (and not the StringDtype), but with the above mechanism they don't necessarily need to implement casting exactly to the pyarrow-backed string dtype. Suppose GeometryDtype only implements (geometry -> string[python]), then the above mechanism will automatically split this cast in two steps: (geometry -> string[python]) and (string[python] -> string[pyarrow]), and this without GeometryDtype having to look out for this (every parametrized Dtype would be expected to register such a cast implementation for between two instances of itself).

While in the code snippet above the cast implementation is a class, this could maybe also be simplified to registering cast functions for given dtype (kind of function dispatch based on the (source, target) dtypes).

Simplified cast from/to ExtensionDtype methods

Something simpler, but still with the possibility to have both the target dtype as source dtype implement the actual cast, would be to add two methods for this to the ExtensionDtype class:

class ExtensionDtype:

    def _cast_from(self, array: ExtensionArray, dtype: ExtensionDtype, **kwargs) -> ExtensionArray | None:
        """cast from this dtype to another dtype"""
        if isinstance(dtype, IntegerDtype):
            return astype_to_int(array, dtype)
        elif isinstance(dtype, StringDtype):
            result = to_string(array)
            # calling out to the string dtype's cast to ensure we have the exact dtype parametrization
            return result.astype(dtype, copy=False)
        elif ...
        
        return None

    def _cast_to(self, array: ExtensionArray, **kwargs) -> ExtensionArray | None:
        """cast from another dtype to this dtype"""
        ...
        return None

Those two methods either return a resulting array if they know how to cast (i.e. if they are the dtype that implements the cast) or otherwise they return None.
And then, the actual astype method could also be implemented generically as:

def astype(arr, dtype, **kwargs):

    # does the target dtype know how to cast to it?
    result = dtype._cast_to(arr, **kwargs)
    if result is not None:
        return result
    
    # else try if the array itself knows how to cast to other
    result = arr.dtype._cast_from(arr, dtype, **kwargs)
    if result is not None:
        return result
    
    # fall back to from_sequence
    return dtype.construct_array()._from_sequence(arr, dtype, **kwargs)

This is a bit simpler than the numpy-like CastImplementation mechanism, and might be sufficient for pandas. Note however that while the actual astype here looks simpler than in the version above, this is also because some of the complexity is moved into _cast_from/_cast_to of each dtype class. For example, the case for parametrized dtypes that I mentioned above now needs to be handled manually in each _cast_from.

Formalizing the current situation

As @jbrockmendel brought up just above, formalizing what we currently already do basically in practice for most dtypes is also an option:

For 1), ATM the base class EA.astype dispatches to _from_sequence, so de facto I think we've landed on a Reasonable Default (tm). Is there still a need for something like a negotiation protocol? I move that we declare this problem solved until a compelling use case is spotted.

In practice, this means that every ExtensionArray subclass needs to define its own astype starting with a couple of checks for dtypes it knows how to handle, and otherwise let the target dtype do it:

class ExtensionArray:

    def astype(self, dtype: ExtensionDtype, **kwargs) -> ExtensionArray:
        # first handle the cases where the array itself knows how to cast
        # to the target dtype
        if isinstance(dtype, ...):
            return ...
        elif ...
        
        # if we don't know ourselves, leave it to the target dtype
        return dtype.construct_array()._from_sequence(self, dtype)

(the above is leaving out the fact that we actually convert to numpy.array for numpy target dtype, but so just illustrative for the flow if handling extension dtypes)

If we would decide this is Good Enough, then I would argue for using a different method than _from_sequence, but specific for this use case (and can still use _from_sequence in the base class implementation for this for back compat), in case an ExtensionDtype would like to distinguish behaviour for casting.
Although this is maybe then just the ExtensionDtype._cast_to from the proposal above. And if you put the initial if/else block in ExtensionDtype._cast_from for code organization, then you basically have the same as above.

But it is indeed good to ask ourselves: is a negotiation protocol actually needed? (apart from a single _from_sequence-like fallback method)

I think the main use case I can think about now for the more complicated interface (the first proposal with CastImplementation) is the nicety about not having to know each parametrization of dtypes (the .astype("string[pyarrow]") example). But I don't know if that is sufficient to warrant an actual more complex protocol.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Feb 2, 2022

In practice, this means that every ExtensionArray subclass needs to define its own astype starting with a couple of checks for dtypes it knows how to handle, and otherwise let the target dtype do it:

If we would decide this is Good Enough, then I would argue for using a different method than _from_sequence, but specific for this use case (and can still use _from_sequence in the base class implementation for this for back compat), in case an ExtensionDtype would like to distinguish behaviour for casting.
Although this is maybe then just the ExtensionDtype._cast_to from the proposal above. And if you put the initial if/else block in ExtensionDtype._cast_from for code organization, then you basically have the same as above.

I think it is "good enough".

You also have to consider what happens for user-implemented extension dtypes. If we just require the extension dtypes to implement _cast_to(), meaning that if you call Series.astype(MyExtensionDtype), then pandas just calls _cast_to() under the hood, and then the responsibility of converting from any dtype to the user-implemented dtype is delegated to the extension dtype implementation. _cast_to() can convert dtypes it knows how to handles, and raise an exception for ones it doesn't. For example, imagine two user-implemented extension dtypes MyExtensionDtype_1 and MyExtensionDtype_2. Maybe it is easy to go from MyExtensionDtype_1 to MyExtensionDtype_2 but not easy to go from MyExtensionDtype_2 to MyExtensionDtype_1. The implementations of _cast_to() for each of those classes decide which types they can convert and which types they can't.

I don't think you then need to have _cast_from() at all (or at least I don't see the benefit). For me, if I'm implementing an extension dtype, I don't want to think about how to convert from my type to all the other types that might be out there. But I'm willing to think about how to convert specific types to my extension dtypes.

@seberg
Copy link
Contributor

seberg commented Feb 2, 2022

I will note one two things about my API for NumPy.

  • np.can_cast drops out "for free". It sometime does a bit more work than you might like, but you don't have to duplicate logic.
  • "Promotion" drops out in the same step. The resolve_dtypes method also answers the question that S8 -> Unicode cast ends up as S8 -> U8 (fills in the U8 without having to do the actual cast).
  • CastingImpl is actually ArrayMethod and shared identically for ufuncs (there are some things ufuncs don't use, but it is very little). The actual implementation adds *view_offset as an output, which allows you to signal that astype can actually just return a view into the old array (NumPy does not use this for astype currently/yet – it does use it for ufuncs that require an internal cast).

But, while I think my API choice is awesome ;), I am not sure which parts fit to pandas ExtensionArrays. The important point is that NumPy provides the container and DTypes are designed completely container agnostic. So, for example, NumPy has to allocate the result array, but the ExtensionArray approach ties the extension DType to a specialized container, so it naturally delegates this.

The point of having both directions available (in some form), is that if you write a bfloat16 and I write an arbitrary_precision_float, I can define the cast to and from your bfloat16. You don't have to know about my extension dtype even existing.
Effectively, NumPy does this using a simple dictionary:

MyExtensionDType_1._casts_to[MyExtensionDType_2](values)

Right now the API is a bit restrictive about adding new casts to the dictionary, but effectively that is what defining a cast means: Adding an ArrayMethod/CastingImpl to such an internal dictionary. I think a simple callable there will do for you, so long as you are in Python, you can always decide to extend it with methods (replace it with a CastingImpl class).

@jorisvandenbossche
Copy link
Member

I don't think you then need to have _cast_from() at all (or at least I don't see the benefit). For me, if I'm implementing an extension dtype, I don't want to think about how to convert from my type to all the other types that might be out there. But I'm willing to think about how to convert specific types to my extension dtypes.

@Dr-Irv I think it's important to be able to control the cast from your dtype to any other dtype as well, and not only to your dtype. Sebastian gave an example of doing a custom float dtype. And another example that I often use to think this through is the GeometryDtype from geopandas. I mentioned above the cast between geometry and string, and in this case as extension dtype author I want to be able to control both the string -> geometry and geometry -> string casts.
While the StringDtype indeed can cast any other dtype to string ("In the case of StringDtype, because the elements of an EA should have __str__ implemented, the conversion is straightforward. Asking each EA to "know" about StringDtype seems like overkill. you mentioned a long time ago above #22384 (comment)), in this specific case geopandas is much better suited to do the conversion to strings (we have a ufunc that does such conversion, which is much faster as calling str(..) on each element).

(now, as long as the EA can override astype and add some logic to cast from itself to any dtype before calling the other's dtype _cast_to, then you of course don't necessarily "need" _cast_from, but the comment above is mostly about the idea that an ExtensionDtype wants to control this, regardless of whether we let them do it in _cast_from or astype itself)

But what complicates the "asking each EA to know about StringDtype" is that there is not a single StringDtype, but that it is parametrized (python or pyarrow storage). And the nice thing about the first (numpy-like) proposal is that it doesn't require the EA to know about every StringDtype parametrization, but allows to only know about one.

But, while I think my API choice is awesome ;), I am not sure which parts fit to pandas

@seberg I like your API as well :), but it's indeed not necessarily needed for pandas. In numpy it operates on a lower level, interacts with ufuncs, etc. And as you mention, we don't only have the dtype, but also the array container (EA subclass) which is already supposed to do hold some of this logic (eg if casting is needed in an operation, such as arr1 + arr2, this is currently expected to be handled in your EA.__add__)

@jbrockmendel
Copy link
Member

I think it's important to be able to control the cast from your dtype to any other dtype as well, and not only to your dtype

I think the existing pattern discussed above can do this with the exception of a corner case I'll describe below. I'm curious if anyone thinks this corner case is A Big Deal.

Suppose Sally is implementing FooDtype/FooArray and wants to specify the behavior of BarDtype/BarArray.astype (of which she is not the author). Suppose that BarArray.astype uses the pattern discussed above of the form:

def astype(self, dtype, copy):
     if self.has_special_logic_for_dtype(dtype):
          [...]
     else:
         return dtype._construct_array_type()._from_sequence(self, dtype=dtype, copy=copy)

Then in FooArray._from_sequence she can do:

def _from_sequence(cls, scalars, dtype, copy):
    if isinstance(dtype, BarDtype):
        return my_special_logic(...)
    [...]

So Sally can accomplish the two-way control using the existing pattern except for if BarArray.has_special_logic_for_dtype already has special logic for converting to FooDtype. I guess this is relevant for if Sally thinks BarArray.astype is specifically wrong? My impression is that the relevant cases are more about when BarArray.astype doesn't know anything about FooDtype. Am I missing something here?

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Feb 3, 2022

(now, as long as the EA can override astype and add some logic to cast from itself to any dtype before calling the other's dtype _cast_to, then you of course don't necessarily "need" _cast_from, but the comment above is mostly about the idea that an ExtensionDtype wants to control this, regardless of whether we let them do it in _cast_from or astype itself)

yes, that makes sense. I now see that the symmetry of _cast_from() and _cast_to() makes sense.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Feb 3, 2022

My impression is that the relevant cases are more about when BarArray.astype doesn't know anything about FooDtype.

So if Sally implements FooDtype._cast_to(), then BarArray.astype(any_dtype) can just call any_dtype._cast_to() without having to even know that FooDtype exists. Right?

@jbrockmendel
Copy link
Member

So if Sally implements FooDtype._cast_to(), then BarArray.astype(any_dtype) can just call any_dtype._cast_to() without having to even know that FooDtype exists. Right?

That's my understanding of _cast_to, yes. My point was that this behavior already exists with the existing pattern with _from_sequence taking the place of _cast_to.

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@jbrockmendel jbrockmendel added the Closing Candidate May be closeable, needs more eyeballs label Apr 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Astype Closing Candidate May be closeable, needs more eyeballs Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

9 participants