-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
POC: 2D support for 1D ExtensionArray subclasses #26954
Conversation
Very interesting, thanks for putting this together. I'll take a closer look later, but for now: can you think of ways to not require making that requirement on I wonder if we're able to achieve the same thing by just requiring that |
Probably. Pinning attributes outside of |
Complete agree with the point about In [17]: a = np.arange(12)
In [18]: a.shape = (3, 4)
In [19]: a
Out[19]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]]) by making an
could be replaced with
That's my hope anyway. |
Not necessarily worth deciding here, but we should think about whether we want this methods (reshape, T, ravel, etc.) surfacing on the user-facing classes. I think the answer right now is "yes". It'll be a bit unfortunate for users to try and do something like |
I'd be fine with that.
Presumably we'd want to be operating on a non-deep copy, so something like:
Would that satisfy all of the discussed criteria? |
Good point about the (non deep) copy. I forgot that the equivalent ndarray operations were no-copy, but not inplace. |
@jorisvandenbossche you may want to skip through this discussion before the meeting, if you have a chance. Focusing on the minute for
Is that feasible? Is it too magical? What other methods would we need to patch ( |
Yes, will try to look at this PR in more detail before the meeting. |
FWIW, I think the conversation is more important than the PR at this point (no slight intended, Brock). |
Other things not patched in the other PR that would give simplifications include
Can you expand a bit on the problem you're trying to avoid? Do you think asking users to mix-in/wrap/metaclass is too invasive? Or are you concerned about the secretly-2D variant accidentally getting exposed to the user?
This is definitely prettier than the mixin version used in the arrow test here, but yah, the black magic could be an issue.
We don't necessarily have to, and in fact for |
It may be me misunderstanding the MRO, but right now If I as a 3rd-party EA author want to implement class MyEA(ExtensionArray, ReshapeMixin):
def take(self, ...):
# what do I do to not worry about 2d here? How do I get to ReshapeMixin.take? A class decorator on ExtensionArray will solve that since we can overwrite |
@jbrockmendel a sketch of my thoughts, diff is from master The basic idea is to have ExtensionArray inherit from a metaclass that patches the subclasses class definition. Just writing that up makes me uncomfortable, but it's a minor patch :) Things like diff --git a/pandas/core/arrays/base.py b/pandas/core/arrays/base.py
index c709cd9e9..e9bb6fb2b 100644
--- a/pandas/core/arrays/base.py
+++ b/pandas/core/arrays/base.py
@@ -29,7 +29,33 @@ _not_implemented_message = "{} does not implement {}."
_extension_array_shared_docs = dict()
-class ExtensionArray:
+def _rewrite_for_take(indices, shape):
+ return indices
+
+
+def rewrite_sized_ops(cls: 'ExtensionArray'):
+ original_take = cls.take
+
+ def take(self, indices, allow_fill=False, fill_value=None):
+ print(type(self), type(indices))
+ indices = _rewrite_for_take(indices, self.shape)
+ return original_take(self,
+ indices,
+ allow_fill=allow_fill,
+ fill_value=fill_value)
+
+ print(f'patching take for {cls}')
+ cls.take = take
+ return cls
+
+
+class Rewriter(type):
+ def __init__(cls, name, bases, clsdict):
+ rewrite_sized_ops(cls)
+ super().__init__(name, bases, clsdict)
+
+
+class ExtensionArray(metaclass=Rewriter):
"""
Abstract base class for custom 1-D array types.
@@ -112,6 +138,9 @@ class ExtensionArray:
# Don't override this.
_typ = 'extension'
+ def __init__(self):
+ self._shape = len(self),
+
# ------------------------------------------------------------------------
# Constructors
# ------------------------------------------------------------------------
@@ -298,14 +327,23 @@ class ExtensionArray:
"""
Return a tuple of the array dimensions.
"""
- return (len(self),)
+ return self._shape
+
+ @shape.setter
+ def shape(self, value):
+ value = tuple(value)
+ assert len(value) <= 2
+ if len(value) == 2:
+ assert any(v == 1 for v in value)
+
+ self._shape = value
@property
def ndim(self) -> int:
"""
Extension Arrays are only allowed to be 1-dimensional.
"""
- return 1
+ return len(self.shape)
@property
def nbytes(self) -> int:
diff --git a/pandas/tests/extension/arrow/bool.py b/pandas/tests/extension/arrow/bool.py
index 2263f5354..e8426d127 100644
--- a/pandas/tests/extension/arrow/bool.py
+++ b/pandas/tests/extension/arrow/bool.py
@@ -48,6 +48,7 @@ class ArrowBoolArray(ExtensionArray):
assert values.type == pa.bool_()
self._data = values
self._dtype = ArrowBoolDtype()
+ super().__init__()
def __repr__(self):
return "ArrowBoolArray({})".format(repr(self._data)) The downsides are
|
Thanks for this; I'll definitely try this out soon since the mixin variant is giving me unexpected MRO behavior (trying to apply the proof of concept here to Categorical) What you've written here is just for
Yah. But there's pushback on anything more than literally-zero impact on downstream authors, and so far I don't see any other way to achieve that. What happens with this approach of the downstream author either a) implements 2D natively (possibly using something like the mixin from the other PR) or b) has their own metaclass? |
I think there should some class attribute a subclass can set to disable all the magic. |
runs into the problem that len(self) is defined in general as self.shape[0] |
i would define it more like this (probably make it a method) def _shape_2d(): |
@jreback the whole point of this is to avoid having to special-case code for whether it is dealing with EA vs ndarray (you've advocated for this elsewhere). |
maybe you don’t understand you also modify PandasArray then you easily have coherence we don’t have raw ndarrays any more just EAs |
This is definitely a benefit to allowing 2D.
You have coherence when accessing block.values from within block, but unfortunately there are other places that treat block.values as non-private. _shape_2d, _take_2d etc are more workarounds, and the goal here is to get rid of the need for workarounds. That said, it is likely an improvement on the status quo, so let's keep it as plan C in case Tom and I can't figure out a way to make the metaclass approach work. |
you can certainly try to get the metaclass approach go work and see how far you can simplify things but it may be too much magic - that said by all means see how far you can get a |
I maintain the best approach is allow-but-don't-require 2D, avoid metaclass magic, and offer a ReshapeMixin (like the other PR) for EAs that wrap ndarray. Then any authors who want the benefits of 2D can implement it themselves, and we can get a lot of the simplification internally. This is also the only approach I see that permits incremental progress. |
sure allow a mixin is fine as well |
Can you clarify how allow-but-don’t-require 2D EAs solves the shape inconsistency. I’ve been assuming that we have get all EAs to be reshapeable to 2D somehow, either through our magic or requiring authors to update. |
allow-but-don't-require is an intermediate step that allows us to move the ball down the field (in a rollback-able manner) while we figure out how to get to require-2D-compat. |
else: | ||
return self[n, :] | ||
if n == -1: | ||
seq = [fill_value] * self.shape[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this should be self.shape[0]. either that or above on 1250 it should be self.shape[0]
[ci skip]
Companion to #26914 trying out a way to implement 2D methods for EA subclasses that don't do it natively.
Chose the arrow EA as the example to start with, then discovered that equality checks return a scalar instead of operating elementwise, so mostly just comparing the shapes.