POC: 2D support for 1D ExtensionArray subclasses #26954

jbrockmendel · 2019-06-20T01:52:41Z

[ci skip]

Companion to #26914 trying out a way to implement 2D methods for EA subclasses that don't do it natively.

Chose the arrow EA as the example to start with, then discovered that equality checks return a scalar instead of operating elementwise, so mostly just comparing the shapes.

TomAugspurger · 2019-06-20T03:05:29Z

Very interesting, thanks for putting this together.

I'll take a closer look later, but for now: can you think of ways to not require making that requirement on __init__? Thus far we've avoided saying anything about it, which I think has been useful. Is there anything in your ReshapeWrapper mixin that absolutely needs to be done in this init?

I wonder if we're able to achieve the same thing by just requiring that .shape be settable? All of the reshape operations are supposed to be zero-copy, right?

jbrockmendel · 2019-06-20T14:18:09Z

can you think of ways to not require making that requirement on __init__? [...] I wonder if we're able to achieve the same thing by just requiring that .shape be settable?

Probably. Pinning attributes outside of __init__ is a pattern I try to avoid, but if the relevant tradeoffs make it worthwhile, so it goes. Probably not going to bother for the proof of concept.

TomAugspurger · 2019-06-20T14:21:52Z

Pinning attributes outside of init is a pattern I try to avoid, but if the relevant tradeoffs make it worthwhile, so it goes.

Complete agree with the point about __init__, but I don't think there's a tradeoff here. My point is that if we allow something similar to NumPy

In [17]: a = np.arange(12)

In [18]: a.shape = (3, 4)

In [19]: a
Out[19]:
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

by making an ExtensionArray.shape.setter, then we don't need to changes to __init__. Anywhere in this POC where you call

return type(self)(self, shape=shape)

could be replaced with

self.shape = shape
return self

That's my hope anyway.

TomAugspurger · 2019-06-20T14:24:26Z

Not necessarily worth deciding here, but we should think about whether we want this methods (reshape, T, ravel, etc.) surfacing on the user-facing classes. I think the answer right now is "yes". It'll be a bit unfortunate for users to try and do something like array.reshape(3, 4) and have it raise saying that only (N, 1) or (1, N) is allowed, but I think that's unavoidable if we want to avoid a bunch of boilerplate type checking in our internals.

jbrockmendel · 2019-06-20T15:14:02Z

by making an ExtensionArray.shape.setter, then we don't need to changes to __init__

I'd be fine with that.

could be replaced with
self.shape = shape
return self

Presumably we'd want to be operating on a non-deep copy, so something like:

def _with_shape(self, shape):
    other = self.copy(deep=False)
    other.shape = shape
    return other

Would that satisfy all of the discussed criteria?

TomAugspurger · 2019-06-20T15:17:57Z

Good point about the (non deep) copy. I forgot that the equivalent ndarray operations were no-copy, but not inplace.

TomAugspurger · 2019-06-20T15:26:12Z

@jorisvandenbossche you may want to skip through this discussion before the meeting, if you have a chance.

Focusing on the minute for ExtensionArray.take, how do we achieve the goal of not leaking this 2-D complexity down to an ExtensionArray subclass? My tentative preference is to add a class decorator to ExtensionArray that does a bit of black magic to rewrite the subclasses .take method. In the re-written .take method, we

Simply pass through indices when self.ndim == 1
Rewrite the 2D indices to the equivalent 1D indices when self.ndim ==1, pass that through to the original .take, and appropriately reshape the result.

Is that feasible? Is it too magical? What other methods would we need to patch (__getitem__, ...)?

jorisvandenbossche · 2019-06-20T15:43:21Z

Yes, will try to look at this PR in more detail before the meeting.

TomAugspurger · 2019-06-20T15:49:26Z

FWIW, I think the conversation is more important than the PR at this point (no slight intended, Brock).

jbrockmendel · 2019-06-20T15:56:23Z

What other methods would we need to patch (getitem, ...)?

__getitem__ is the main one that got non-trivially patched in the other PR. Things like astype can get simply patched to give super().astype(*args, **kwargs).reshape(self.shape)

Other things not patched in the other PR that would give simplifications include shift and diff.

how do we achieve the goal of not leaking this 2-D complexity down to an ExtensionArray subclass?

Can you expand a bit on the problem you're trying to avoid? Do you think asking users to mix-in/wrap/metaclass is too invasive? Or are you concerned about the secretly-2D variant accidentally getting exposed to the user?

My tentative preference is to add a class decorator to ExtensionArray that does a bit of black magic

This is definitely prettier than the mixin version used in the arrow test here, but yah, the black magic could be an issue.

It'll be a bit unfortunate for users to try and do something like array.reshape(3, 4) and have it raise saying that only (N, 1) or (1, N) is allowed

We don't necessarily have to, and in fact for unstack there is some simplification to be gotten by allowing arbitrary 2D (which we can then separate back to rowlike if we really want; see DatetimeTZBlock.unstack in the other PR).

TomAugspurger · 2019-06-20T16:04:03Z

Do you think asking users to mix-in/wrap/metaclass is too invasive?

It may be me misunderstanding the MRO, but right now ExtensionArray.take raises AbstractMethodError.

If I as a 3rd-party EA author want to implement .take,

class MyEA(ExtensionArray, ReshapeMixin):
    def take(self, ...):
        # what do I do to not worry about 2d here? How do I get to ReshapeMixin.take?

A class decorator on ExtensionArray will solve that since we can overwrite MyEA.take with our own 2D-aware .take, which uses the original MyEA.take interanlly. (and as a side-benefit, it's source compatible with older pandas since MyEA won't need to be updated).

TomAugspurger · 2019-06-21T18:39:59Z

@jbrockmendel a sketch of my thoughts, diff is from master

The basic idea is to have ExtensionArray inherit from a metaclass that patches the subclasses class definition. Just writing that up makes me uncomfortable, but it's a minor patch :)

Things like _rewrite_for_take is just a stub.

diff --git a/pandas/core/arrays/base.py b/pandas/core/arrays/base.py
index c709cd9e9..e9bb6fb2b 100644
--- a/pandas/core/arrays/base.py
+++ b/pandas/core/arrays/base.py
@@ -29,7 +29,33 @@ _not_implemented_message = "{} does not implement {}."
 _extension_array_shared_docs = dict()
 
 
-class ExtensionArray:
+def _rewrite_for_take(indices, shape):
+    return indices
+
+
+def rewrite_sized_ops(cls: 'ExtensionArray'):
+    original_take = cls.take
+
+    def take(self, indices, allow_fill=False, fill_value=None):
+        print(type(self), type(indices))
+        indices = _rewrite_for_take(indices, self.shape)
+        return original_take(self,
+                             indices,
+                             allow_fill=allow_fill,
+                             fill_value=fill_value)
+
+    print(f'patching take for {cls}')
+    cls.take = take
+    return cls
+
+
+class Rewriter(type):
+    def __init__(cls, name, bases, clsdict):
+        rewrite_sized_ops(cls)
+        super().__init__(name, bases, clsdict)
+
+
+class ExtensionArray(metaclass=Rewriter):
     """
     Abstract base class for custom 1-D array types.
 
@@ -112,6 +138,9 @@ class ExtensionArray:
     # Don't override this.
     _typ = 'extension'
 
+    def __init__(self):
+        self._shape = len(self),
+
     # ------------------------------------------------------------------------
     # Constructors
     # ------------------------------------------------------------------------
@@ -298,14 +327,23 @@ class ExtensionArray:
         """
         Return a tuple of the array dimensions.
         """
-        return (len(self),)
+        return self._shape
+
+    @shape.setter
+    def shape(self, value):
+        value = tuple(value)
+        assert len(value) <= 2
+        if len(value) == 2:
+            assert any(v == 1 for v in value)
+
+        self._shape = value
 
     @property
     def ndim(self) -> int:
         """
         Extension Arrays are only allowed to be 1-dimensional.
         """
-        return 1
+        return len(self.shape)
 
     @property
     def nbytes(self) -> int:
diff --git a/pandas/tests/extension/arrow/bool.py b/pandas/tests/extension/arrow/bool.py
index 2263f5354..e8426d127 100644
--- a/pandas/tests/extension/arrow/bool.py
+++ b/pandas/tests/extension/arrow/bool.py
@@ -48,6 +48,7 @@ class ArrowBoolArray(ExtensionArray):
         assert values.type == pa.bool_()
         self._data = values
         self._dtype = ArrowBoolDtype()
+        super().__init__()
 
     def __repr__(self):
         return "ArrowBoolArray({})".format(repr(self._data))

The downsides are

✨ magic ✨
subclasses need to call super().__init__() so that we can set self._shape. That's best practice, but probably isn't always done. Although we could solve it with more ✨ magic ✨ :)

jbrockmendel · 2019-06-22T00:04:23Z

The basic idea is to have ExtensionArray inherit from a metaclass that patches the subclasses class definition. Just writing that up makes me uncomfortable, but it's a minor patch :)

Thanks for this; I'll definitely try this out soon since the mixin variant is giving me unexpected MRO behavior (trying to apply the proof of concept here to Categorical)

What you've written here is just for take, but I'm assuming this would also be used for __getitem__ etc? (The other one I've identified as being easy+necessary is __iter__)

def __iter__(self):
    if self.ndim > 1:
        for n in range(len(self)):
            yield self[n]
        return
    for item in super().__iter__():
        yield item

The downsides are

✨ magic ✨
subclasses need to call super().init() so that we can set self._shape. That's best practice, but probably isn't always done. Although we could solve it with more ✨ magic ✨ :)

Yah. But there's pushback on anything more than literally-zero impact on downstream authors, and so far I don't see any other way to achieve that.

What happens with this approach of the downstream author either a) implements 2D natively (possibly using something like the mixin from the other PR) or b) has their own metaclass?

TomAugspurger · 2019-06-22T01:30:55Z

I think there should some class attribute a subclass can set to disable all the magic.

jbrockmendel · 2019-06-22T14:49:35Z

    def __init__(self):
        self._shape = len(self),

runs into the problem that len(self) is defined in general as self.shape[0]

jreback · 2019-06-22T14:56:09Z

    def __init__(self):
        self._shape = len(self),
runs into the problem that len(self) is defined in general as self.shape[0]

i would define it more like this (probably make it a method)

def _shape_2d():
return 1, len(self))

jbrockmendel · 2019-06-22T14:57:31Z

@jreback the whole point of this is to avoid having to special-case code for whether it is dealing with EA vs ndarray (you've advocated for this elsewhere).

jreback · 2019-06-22T15:05:12Z

maybe you don’t understand

you also modify PandasArray

then you easily have coherence

we don’t have raw ndarrays any more just EAs

jbrockmendel · 2019-06-22T16:15:07Z

you also modify PandasArray [...] we don’t have raw ndarrays any more just EAs

This is definitely a benefit to allowing 2D.

then you easily have coherence

You have coherence when accessing block.values from within block, but unfortunately there are other places that treat block.values as non-private.

_shape_2d, _take_2d etc are more workarounds, and the goal here is to get rid of the need for workarounds. That said, it is likely an improvement on the status quo, so let's keep it as plan C in case Tom and I can't figure out a way to make the metaclass approach work.

jreback · 2019-06-22T16:19:51Z

you also modify PandasArray [...] we don’t have raw ndarrays any more just EAs

This is definitely a benefit to allowing 2D.

then you easily have coherence

You have coherence when accessing block.values from within block, but unfortunately there are other places that treat block.values as non-private.

_shape_2d, _take_2d etc are more workarounds, and the goal here is to get rid of the need for workarounds. That said, it is likely an improvement on the status quo, so let's keep it as plan C in case Tom and I can't figure out a way to make the metaclass approach work.

you can certainly try to get the metaclass approach go work and see how far you can simplify things

but it may be too much magic - that said by all means see how far you can get

a

jbrockmendel · 2019-06-22T16:26:32Z

you can certainly try to get the metaclass approach go work and see how far you can simplify things
but it may be too much magic

I maintain the best approach is allow-but-don't-require 2D, avoid metaclass magic, and offer a ReshapeMixin (like the other PR) for EAs that wrap ndarray. Then any authors who want the benefits of 2D can implement it themselves, and we can get a lot of the simplification internally. This is also the only approach I see that permits incremental progress.

jreback · 2019-06-22T16:38:07Z

sure allow a mixin is fine as well

TomAugspurger · 2019-06-22T17:48:34Z

Can you clarify how allow-but-don’t-require 2D EAs solves the shape inconsistency. I’ve been assuming that we have get all EAs to be reshapeable to 2D somehow, either through our magic or requiring authors to update.

jbrockmendel · 2019-06-22T18:31:30Z

Can you clarify how allow-but-don’t-require 2D EAs solves the shape inconsistency. I’ve been assuming that we have get all EAs to be reshapeable to 2D somehow, either through our magic or requiring authors to update.

allow-but-don't-require is an intermediate step that allows us to move the ball down the field (in a rollback-able manner) while we figure out how to get to require-2D-compat.

jbrockmendel · 2019-07-22T16:46:28Z

pandas/core/arrays/base.py

+                else:
+                    return self[n, :]
+            if n == -1:
+                seq = [fill_value] * self.shape[1]


i think this should be self.shape[0]. either that or above on 1250 it should be self.shape[0]

jbrockmendel added 3 commits June 19, 2019 16:11

implement ReshapeWrapper

ed03c75

wrap getitem

c48be14

implement take

b22f866

jbrockmendel changed the title ~~D2wrap~~ POC: 2D support for 1D ExtensionArray subclasses Jun 20, 2019

gfyoung added Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. labels Jun 20, 2019

TomAugspurger mentioned this pull request Jun 21, 2019

WIP/POC: Allow (N, 1) and (1,N) EAs #26914

Closed

jbrockmendel mentioned this pull request Jun 22, 2019

Non-Deep copy for Categorical? #26995

Closed

3 tasks

jbrockmendel closed this Jul 1, 2019

jbrockmendel mentioned this pull request Jul 3, 2019

EA: support basic 2D operations #27142

Closed

jbrockmendel commented Jul 22, 2019

View reviewed changes

jbrockmendel deleted the d2wrap branch November 20, 2021 23:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

POC: 2D support for 1D ExtensionArray subclasses #26954

POC: 2D support for 1D ExtensionArray subclasses #26954

jbrockmendel commented Jun 20, 2019

TomAugspurger commented Jun 20, 2019

jbrockmendel commented Jun 20, 2019

TomAugspurger commented Jun 20, 2019 •

edited

Loading

TomAugspurger commented Jun 20, 2019

jbrockmendel commented Jun 20, 2019

TomAugspurger commented Jun 20, 2019

TomAugspurger commented Jun 20, 2019 •

edited

Loading

jorisvandenbossche commented Jun 20, 2019

TomAugspurger commented Jun 20, 2019

jbrockmendel commented Jun 20, 2019

TomAugspurger commented Jun 20, 2019

TomAugspurger commented Jun 21, 2019 •

edited

Loading

jbrockmendel commented Jun 22, 2019

TomAugspurger commented Jun 22, 2019

jbrockmendel commented Jun 22, 2019

jreback commented Jun 22, 2019

jbrockmendel commented Jun 22, 2019

jreback commented Jun 22, 2019

jbrockmendel commented Jun 22, 2019

jreback commented Jun 22, 2019

jbrockmendel commented Jun 22, 2019

jreback commented Jun 22, 2019

TomAugspurger commented Jun 22, 2019

jbrockmendel commented Jun 22, 2019

jbrockmendel Jul 22, 2019

POC: 2D support for 1D ExtensionArray subclasses #26954

POC: 2D support for 1D ExtensionArray subclasses #26954

Conversation

jbrockmendel commented Jun 20, 2019

TomAugspurger commented Jun 20, 2019

jbrockmendel commented Jun 20, 2019

TomAugspurger commented Jun 20, 2019 • edited Loading

TomAugspurger commented Jun 20, 2019

jbrockmendel commented Jun 20, 2019

TomAugspurger commented Jun 20, 2019

TomAugspurger commented Jun 20, 2019 • edited Loading

jorisvandenbossche commented Jun 20, 2019

TomAugspurger commented Jun 20, 2019

jbrockmendel commented Jun 20, 2019

TomAugspurger commented Jun 20, 2019

TomAugspurger commented Jun 21, 2019 • edited Loading

jbrockmendel commented Jun 22, 2019

TomAugspurger commented Jun 22, 2019

jbrockmendel commented Jun 22, 2019

jreback commented Jun 22, 2019

jbrockmendel commented Jun 22, 2019

jreback commented Jun 22, 2019

jbrockmendel commented Jun 22, 2019

jreback commented Jun 22, 2019

jbrockmendel commented Jun 22, 2019

jreback commented Jun 22, 2019

TomAugspurger commented Jun 22, 2019

jbrockmendel commented Jun 22, 2019

jbrockmendel Jul 22, 2019

Choose a reason for hiding this comment

TomAugspurger commented Jun 20, 2019 •

edited

Loading

TomAugspurger commented Jun 20, 2019 •

edited

Loading

TomAugspurger commented Jun 21, 2019 •

edited

Loading