Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC: 2D support for 1D ExtensionArray subclasses #26954

Closed
wants to merge 3 commits into from

Conversation

jbrockmendel
Copy link
Member

[ci skip]

Companion to #26914 trying out a way to implement 2D methods for EA subclasses that don't do it natively.

Chose the arrow EA as the example to start with, then discovered that equality checks return a scalar instead of operating elementwise, so mostly just comparing the shapes.

@jbrockmendel jbrockmendel changed the title D2wrap POC: 2D support for 1D ExtensionArray subclasses Jun 20, 2019
@TomAugspurger
Copy link
Contributor

Very interesting, thanks for putting this together.

I'll take a closer look later, but for now: can you think of ways to not require making that requirement on __init__? Thus far we've avoided saying anything about it, which I think has been useful. Is there anything in your ReshapeWrapper mixin that absolutely needs to be done in this init?

I wonder if we're able to achieve the same thing by just requiring that .shape be settable? All of the reshape operations are supposed to be zero-copy, right?

@jbrockmendel
Copy link
Member Author

can you think of ways to not require making that requirement on __init__? [...] I wonder if we're able to achieve the same thing by just requiring that .shape be settable?

Probably. Pinning attributes outside of __init__ is a pattern I try to avoid, but if the relevant tradeoffs make it worthwhile, so it goes. Probably not going to bother for the proof of concept.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jun 20, 2019

Pinning attributes outside of init is a pattern I try to avoid, but if the relevant tradeoffs make it worthwhile, so it goes.

Complete agree with the point about __init__, but I don't think there's a tradeoff here. My point is that if we allow something similar to NumPy

In [17]: a = np.arange(12)

In [18]: a.shape = (3, 4)

In [19]: a
Out[19]:
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

by making an ExtensionArray.shape.setter, then we don't need to changes to __init__. Anywhere in this POC where you call

return type(self)(self, shape=shape)

could be replaced with

self.shape = shape
return self

That's my hope anyway.

@TomAugspurger
Copy link
Contributor

Not necessarily worth deciding here, but we should think about whether we want this methods (reshape, T, ravel, etc.) surfacing on the user-facing classes. I think the answer right now is "yes". It'll be a bit unfortunate for users to try and do something like array.reshape(3, 4) and have it raise saying that only (N, 1) or (1, N) is allowed, but I think that's unavoidable if we want to avoid a bunch of boilerplate type checking in our internals.

@jbrockmendel
Copy link
Member Author

by making an ExtensionArray.shape.setter, then we don't need to changes to __init__

I'd be fine with that.

could be replaced with

self.shape = shape
return self

Presumably we'd want to be operating on a non-deep copy, so something like:

def _with_shape(self, shape):
    other = self.copy(deep=False)
    other.shape = shape
    return other

Would that satisfy all of the discussed criteria?

@TomAugspurger
Copy link
Contributor

Good point about the (non deep) copy. I forgot that the equivalent ndarray operations were no-copy, but not inplace.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jun 20, 2019

@jorisvandenbossche you may want to skip through this discussion before the meeting, if you have a chance.


Focusing on the minute for ExtensionArray.take, how do we achieve the goal of not leaking this 2-D complexity down to an ExtensionArray subclass? My tentative preference is to add a class decorator to ExtensionArray that does a bit of black magic to rewrite the subclasses .take method. In the re-written .take method, we

  1. Simply pass through indices when self.ndim == 1
  2. Rewrite the 2D indices to the equivalent 1D indices when self.ndim ==1, pass that through to the original .take, and appropriately reshape the result.

Is that feasible? Is it too magical? What other methods would we need to patch (__getitem__, ...)?

@jorisvandenbossche
Copy link
Member

Yes, will try to look at this PR in more detail before the meeting.

@TomAugspurger
Copy link
Contributor

FWIW, I think the conversation is more important than the PR at this point (no slight intended, Brock).

@jbrockmendel
Copy link
Member Author

What other methods would we need to patch (getitem, ...)?

__getitem__ is the main one that got non-trivially patched in the other PR. Things like astype can get simply patched to give super().astype(*args, **kwargs).reshape(self.shape)

Other things not patched in the other PR that would give simplifications include shift and diff.

how do we achieve the goal of not leaking this 2-D complexity down to an ExtensionArray subclass?

Can you expand a bit on the problem you're trying to avoid? Do you think asking users to mix-in/wrap/metaclass is too invasive? Or are you concerned about the secretly-2D variant accidentally getting exposed to the user?

My tentative preference is to add a class decorator to ExtensionArray that does a bit of black magic

This is definitely prettier than the mixin version used in the arrow test here, but yah, the black magic could be an issue.

It'll be a bit unfortunate for users to try and do something like array.reshape(3, 4) and have it raise saying that only (N, 1) or (1, N) is allowed

We don't necessarily have to, and in fact for unstack there is some simplification to be gotten by allowing arbitrary 2D (which we can then separate back to rowlike if we really want; see DatetimeTZBlock.unstack in the other PR).

@TomAugspurger
Copy link
Contributor

Do you think asking users to mix-in/wrap/metaclass is too invasive?

It may be me misunderstanding the MRO, but right now ExtensionArray.take raises AbstractMethodError.

If I as a 3rd-party EA author want to implement .take,

class MyEA(ExtensionArray, ReshapeMixin):
    def take(self, ...):
        # what do I do to not worry about 2d here? How do I get to ReshapeMixin.take?

A class decorator on ExtensionArray will solve that since we can overwrite MyEA.take with our own 2D-aware .take, which uses the original MyEA.take interanlly. (and as a side-benefit, it's source compatible with older pandas since MyEA won't need to be updated).

@gfyoung gfyoung added Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. labels Jun 20, 2019
@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jun 21, 2019

@jbrockmendel a sketch of my thoughts, diff is from master

The basic idea is to have ExtensionArray inherit from a metaclass that patches the subclasses class definition. Just writing that up makes me uncomfortable, but it's a minor patch :)

Things like _rewrite_for_take is just a stub.

diff --git a/pandas/core/arrays/base.py b/pandas/core/arrays/base.py
index c709cd9e9..e9bb6fb2b 100644
--- a/pandas/core/arrays/base.py
+++ b/pandas/core/arrays/base.py
@@ -29,7 +29,33 @@ _not_implemented_message = "{} does not implement {}."
 _extension_array_shared_docs = dict()
 
 
-class ExtensionArray:
+def _rewrite_for_take(indices, shape):
+    return indices
+
+
+def rewrite_sized_ops(cls: 'ExtensionArray'):
+    original_take = cls.take
+
+    def take(self, indices, allow_fill=False, fill_value=None):
+        print(type(self), type(indices))
+        indices = _rewrite_for_take(indices, self.shape)
+        return original_take(self,
+                             indices,
+                             allow_fill=allow_fill,
+                             fill_value=fill_value)
+
+    print(f'patching take for {cls}')
+    cls.take = take
+    return cls
+
+
+class Rewriter(type):
+    def __init__(cls, name, bases, clsdict):
+        rewrite_sized_ops(cls)
+        super().__init__(name, bases, clsdict)
+
+
+class ExtensionArray(metaclass=Rewriter):
     """
     Abstract base class for custom 1-D array types.
 
@@ -112,6 +138,9 @@ class ExtensionArray:
     # Don't override this.
     _typ = 'extension'
 
+    def __init__(self):
+        self._shape = len(self),
+
     # ------------------------------------------------------------------------
     # Constructors
     # ------------------------------------------------------------------------
@@ -298,14 +327,23 @@ class ExtensionArray:
         """
         Return a tuple of the array dimensions.
         """
-        return (len(self),)
+        return self._shape
+
+    @shape.setter
+    def shape(self, value):
+        value = tuple(value)
+        assert len(value) <= 2
+        if len(value) == 2:
+            assert any(v == 1 for v in value)
+
+        self._shape = value
 
     @property
     def ndim(self) -> int:
         """
         Extension Arrays are only allowed to be 1-dimensional.
         """
-        return 1
+        return len(self.shape)
 
     @property
     def nbytes(self) -> int:
diff --git a/pandas/tests/extension/arrow/bool.py b/pandas/tests/extension/arrow/bool.py
index 2263f5354..e8426d127 100644
--- a/pandas/tests/extension/arrow/bool.py
+++ b/pandas/tests/extension/arrow/bool.py
@@ -48,6 +48,7 @@ class ArrowBoolArray(ExtensionArray):
         assert values.type == pa.bool_()
         self._data = values
         self._dtype = ArrowBoolDtype()
+        super().__init__()
 
     def __repr__(self):
         return "ArrowBoolArray({})".format(repr(self._data))

The downsides are

  1. ✨ magic ✨
  2. subclasses need to call super().__init__() so that we can set self._shape. That's best practice, but probably isn't always done. Although we could solve it with more ✨ magic ✨ :)

@jbrockmendel
Copy link
Member Author

The basic idea is to have ExtensionArray inherit from a metaclass that patches the subclasses class definition. Just writing that up makes me uncomfortable, but it's a minor patch :)

Thanks for this; I'll definitely try this out soon since the mixin variant is giving me unexpected MRO behavior (trying to apply the proof of concept here to Categorical)

What you've written here is just for take, but I'm assuming this would also be used for __getitem__ etc? (The other one I've identified as being easy+necessary is __iter__)

def __iter__(self):
    if self.ndim > 1:
        for n in range(len(self)):
            yield self[n]
        return
    for item in super().__iter__():
        yield item

The downsides are

✨ magic ✨
subclasses need to call super().init() so that we can set self._shape. That's best practice, but probably isn't always done. Although we could solve it with more ✨ magic ✨ :)

Yah. But there's pushback on anything more than literally-zero impact on downstream authors, and so far I don't see any other way to achieve that.

What happens with this approach of the downstream author either a) implements 2D natively (possibly using something like the mixin from the other PR) or b) has their own metaclass?

@TomAugspurger
Copy link
Contributor

I think there should some class attribute a subclass can set to disable all the magic.

@jbrockmendel
Copy link
Member Author

    def __init__(self):
        self._shape = len(self),

runs into the problem that len(self) is defined in general as self.shape[0]

@jreback
Copy link
Contributor

jreback commented Jun 22, 2019

    def __init__(self):
        self._shape = len(self),

runs into the problem that len(self) is defined in general as self.shape[0]

i would define it more like this (probably make it a method)

def _shape_2d():
return 1, len(self))

@jbrockmendel
Copy link
Member Author

@jreback the whole point of this is to avoid having to special-case code for whether it is dealing with EA vs ndarray (you've advocated for this elsewhere).

@jreback
Copy link
Contributor

jreback commented Jun 22, 2019

maybe you don’t understand

you also modify PandasArray

then you easily have coherence

we don’t have raw ndarrays any more just EAs

@jbrockmendel
Copy link
Member Author

you also modify PandasArray [...] we don’t have raw ndarrays any more just EAs

This is definitely a benefit to allowing 2D.

then you easily have coherence

You have coherence when accessing block.values from within block, but unfortunately there are other places that treat block.values as non-private.

_shape_2d, _take_2d etc are more workarounds, and the goal here is to get rid of the need for workarounds. That said, it is likely an improvement on the status quo, so let's keep it as plan C in case Tom and I can't figure out a way to make the metaclass approach work.

@jreback
Copy link
Contributor

jreback commented Jun 22, 2019

you also modify PandasArray [...] we don’t have raw ndarrays any more just EAs

This is definitely a benefit to allowing 2D.

then you easily have coherence

You have coherence when accessing block.values from within block, but unfortunately there are other places that treat block.values as non-private.

_shape_2d, _take_2d etc are more workarounds, and the goal here is to get rid of the need for workarounds. That said, it is likely an improvement on the status quo, so let's keep it as plan C in case Tom and I can't figure out a way to make the metaclass approach work.

you can certainly try to get the metaclass approach go work and see how far you can simplify things

but it may be too much magic - that said by all means see how far you can get

a

@jbrockmendel
Copy link
Member Author

you can certainly try to get the metaclass approach go work and see how far you can simplify things
but it may be too much magic

I maintain the best approach is allow-but-don't-require 2D, avoid metaclass magic, and offer a ReshapeMixin (like the other PR) for EAs that wrap ndarray. Then any authors who want the benefits of 2D can implement it themselves, and we can get a lot of the simplification internally. This is also the only approach I see that permits incremental progress.

@jreback
Copy link
Contributor

jreback commented Jun 22, 2019

sure allow a mixin is fine as well

@TomAugspurger
Copy link
Contributor

Can you clarify how allow-but-don’t-require 2D EAs solves the shape inconsistency. I’ve been assuming that we have get all EAs to be reshapeable to 2D somehow, either through our magic or requiring authors to update.

@jbrockmendel
Copy link
Member Author

Can you clarify how allow-but-don’t-require 2D EAs solves the shape inconsistency. I’ve been assuming that we have get all EAs to be reshapeable to 2D somehow, either through our magic or requiring authors to update.

allow-but-don't-require is an intermediate step that allows us to move the ball down the field (in a rollback-able manner) while we figure out how to get to require-2D-compat.

else:
return self[n, :]
if n == -1:
seq = [fill_value] * self.shape[1]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this should be self.shape[0]. either that or above on 1250 it should be self.shape[0]

@jbrockmendel jbrockmendel deleted the d2wrap branch November 20, 2021 23:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants