Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ExtensionArray constructor issue #19906

Closed
TomAugspurger opened this issue Feb 26, 2018 · 13 comments
Closed

ExtensionArray constructor issue #19906

TomAugspurger opened this issue Feb 26, 2018 · 13 comments
Labels
API Design ExtensionArray Extending pandas with custom dtypes or arrays.
Milestone

Comments

@TomAugspurger
Copy link
Contributor

I don't think that pandas should make any assumptions on how subclasses __init__ method. Currently we assume that the first positional argument accepts an instance of the class or a sequence of the class's scalar type.

I'd rather break that into two distinct methods.

@classmedthod
def from_extension_array(cls, extension_array: ExtensionArray) -> ExtensionArray:
    """Construct a new ExtensionArray from an instance of the same type

    Parameters
    ----------
    extension_array : ExtensionArray
        An instance of 'cls'

    Returns
    -------
    ExtensionArray
    """


ScalarType = ExtensionArray.dtype.type

@classmethod
def from_scalars(cls, scalars: Sequence[ScalarType]) -> ExtensionArray:
    """Construct a new ExtensionArray from a sequence of the scalar type

    Parameters
    ----------
    scalars : Sequence[ScalarType]
        A sequence of cls.dtype.type, the scalar type for this array

    Returns
    -------
    ExtensionArray
    """

I think these should be abstract. For many subclasses, a simple cls(arg) should suffice.

@TomAugspurger TomAugspurger added this to the 0.23.0 milestone Feb 26, 2018
@TomAugspurger TomAugspurger added ExtensionArray Extending pandas with custom dtypes or arrays. API Design labels Feb 26, 2018
@jreback
Copy link
Contributor

jreback commented Feb 26, 2018

what is the reason for this?

@TomAugspurger
Copy link
Contributor Author

In case a subclass has a different __init__ signature.

@jorisvandenbossche
Copy link
Member

+1

I only would make them private _from_scalars and _from_extension_array, to give the extension author to give them more meaningful public names if they want to add something like that (eg in geopandas we currently call 'from_scalars' from_shapely).

Also we could here make clear expectations about data being copied or not (and for that, maybe add a copy keyword.

@TomAugspurger
Copy link
Contributor Author

Also we could here make clear expectations about data being copied or not

Yes, I thought about making that part of the _from_extension_array one, but I wasn't sure if all subclasses would be able to honor it, and what implications that would have for pandas using it, since we couldn't really rely on it. But if it's just a performance thing, subclasses maybe having to copy, even with copy=Falsse, isn't such a big deal.

@jorisvandenbossche
Copy link
Member

I think the fact we cannot rely upon is if it always can be done without copy, but we can rely on requiring a copy? (similar to the copy keyword in np.array)

@TomAugspurger
Copy link
Contributor Author

similar to the copy keyword in np.array

Yes, that sounds right.

Also +1 to making them private

Are there any other cases we should consider? In cyberpandas I have two others

@jreback
Copy link
Contributor

jreback commented Feb 26, 2018

and we already have and use _construtor in pandas proper. Having alternative constructors is fine, except we have no way to actively select when / how one is called. So this is a huge break with current pracitce. I am aa huge -1 on doing this w/o actually changing pandas to do this. I don't think this is possible.

@TomAugspurger
Copy link
Contributor Author

This is as much (or more so) for external code as it is for pandas. We can't impose the same restrictions on 3rd party libraries as we do internally.

The use of these constructors will be limited. The only place returning ExtensionArrays will be in ExtensionArray itself. It's not something that's going to pollute all of our internals.

@jorisvandenbossche
Copy link
Member

We are adding functionality here. There is no place in pandas where we currently call this for ExtensionArrays, so each call we add we can evaluate to see which one we should use.

For example, Categorical (currenlty the only ExtensionArray in pandas) has a _constructor, but it is used in a very specific way (it is always used with passing codes (with fastpath=True)), so that is not something we could rely upon in general anyhow. I think it would be clearer if we had an internal _from_codes or _constructor_from_codes (with less checks than the public from_codes) instead of a _constructor(..., fastpath=True)

@jorisvandenbossche
Copy link
Member

The use of these constructors will be limited. The only place returning ExtensionArrays will be in ExtensionArray itself. It's not something that's going to pollute all of our internals.

Indeed, eg the _from_scalars would currently mainly be used in the tests (apart from the 'fallback' unique implementation)

@shoyer
Copy link
Member

shoyer commented Feb 26, 2018

I agree with this change.

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Feb 26, 2018
Adds two new (private, but part of the interface) constructors to EA.

Closes pandas-dev#19906
@jorisvandenbossche
Copy link
Member

Do we actually currently have cases where we would use _from_extension_array? Because if we already know it is an extension array, why passing it through a constructor instead of just returning it or taking a copy of it directly?

@TomAugspurger
Copy link
Contributor Author

Do we actually currently have cases where we would use _from_extension_array?

Good point. I initially included it because that's a case I ran into with cyberpandas' IPArray.__init__. But indeed, we don't need it on pandas side since we can just return the EA (copying if needed).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

No branches or pull requests

4 participants