-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EA: support basic 2D operations #27142
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can u add a test that asserts shape is defined
and that len is also reasonable (if it’s overridden)
Remind me, why is this necessary? Is this for the reshape mixin? This will be a breaking change for EA authors. |
Yes. It is a breaking change, but it is also much more robust than the other options (e.g. metaclass) which I found to be fragile MRO-wise. |
Updated to include parts of #27153, the upshot of which is that this now includes all the expected breaking changes for EA authors.
Other breaking changes that aren't EA-author-specific:
|
can you update the title of the PR to better reflect things |
from pandas._libs.lib import is_integer | ||
|
||
|
||
def implement_2d(cls): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the reasoning behind making this a decorator as opposed to just having base class method? it seems much simpler and more obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main pushback on this idea was the onus put on EA authors who only support 1D. ATM this decorator mainly just renames __len__
to size
so that it can re-define shape
in a 2D-compatible way. So asking authors to use the decorator instead of doing that re-definition themselves is kind of a wash.
But the step after this is to patch __getitem__
, __setitem__
, take
, __iter__
, all of which are easier to do with the decorator than by asking authors to do it themselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the plan still to have an "opt me out of this, I natively support 2D arrays"? If we go down the route of putting every Block.values inside an EA (with PandasArray for the current NumPy-backed Blocks), then we'll want that, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see you're applying this to subclasses, rather than ExtensionArray itself. Motivation for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Motivation for that?
less magic than a metaclass (and my first attempt at using a metaclass failed)
Is the plan still to have an "opt me out of this, I natively support 2D arrays"?
Defined EA._allows_2d = False
. Authors set that to True if they implement this themselves. This decorator should be updated to check that and be a no-op in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, right, because simply decorating
@implements_2d
class ExtensionArray:
would work for subclasses right? It has to be a metaclass. I'll poke around at the meta class approach to see what's going on. May be too magical.
But if we do go with a
@jbrockmendel is this the PR to focus on for 2D EA stuff? Is there anything you definitely want to have in the release / release candidate? |
Yes.
If we're going to do a deprecation cycle for the behavior of Categorical.ravel, I'd like to get that in. |
I am a bit lost in the different PRs / POC you made. |
Fair enough. Quick summary:
The approach for this PR centers around two goals:
We can patch most of the relevant methods of a 1D-only EA using a decorator, here implemented as One thing that AFAICT we can't patch and do need the EA authors to implement is Finally, the |
@jreback @jbrockmendel are you free for a call this afternoon? I'm free any time after ~1:00 central. |
i would be available at 4pm est or after
or thurs / friday
… On Jul 23, 2019, at 1:01 PM, Tom Augspurger ***@***.***> wrote:
@jreback @jbrockmendel are you free for a call this afternoon? I'm free any time after ~1:00 central.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Today after 4 EST works for me. |
Let's do 4:00 EST / 3:00 Central / 1:00 Pacific then. meet.google.com/bxs-pjaf-oqi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still looking through the patched methods.
|
||
def __iter__(self): | ||
if self.ndim == 1: | ||
for obj in orig_iter(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does just return orig_iter(self)
work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC from one of the proofs-of-concept this actually mattered, but I dont remember which direction it mattered in
for obj in orig_iter(self): | ||
yield obj | ||
else: | ||
for n in range(self.shape[0]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this case be written as something like
reduced = self[:, 0]
return orig_iter(reduced)`
? I worry a bit about the cost of calling __getitem__
on each iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely dont want to call orig_iter here since we want to yield ndim-1 arrays
from pandas._libs.lib import is_integer | ||
|
||
|
||
def implement_2d(cls): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the plan still to have an "opt me out of this, I natively support 2D arrays"? If we go down the route of putting every Block.values inside an EA (with PandasArray for the current NumPy-backed Blocks), then we'll want that, right?
|
||
key = expand_key(key, self.shape) | ||
if is_integer(key[0]): | ||
assert key[0] in [0, -1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this at a glance. Who does it have to be in [0, -1]
? (I get that we're only allowing N x 1
and 1 X N
arrays, but I don't get why key[0]
is the one being checked).
Just from looking at it, I would expect this to break this __getitem__
on something like
arr = pd.array([1, 2, 3], dtype="Int64")
# arr = arra.reshape(...)
arr[1]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I wrote this with only (1, N) in mind
""" | ||
if len(shape) == 1: | ||
return True | ||
if len(shape) > 2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this come up in practice? I'm having a hard time judging whether NotImplementedError or False is the right behavior in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think NotImplementedError is more accurate, but for the current implementation this should never be reached
def tuplify_shape(size: int, shape, restrict=True) -> Tuple[int, ...]: | ||
""" | ||
Convert a passed shape into a valid tuple. | ||
Following ndarray.reshape, we accept either `reshape(a, b)` or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't suppose NumPy has a function we can borrow here? I'm not aware of one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
numpy/numpy#13768 is the closest thing I found, but I agree it seems like the kind of thing that should exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a somewhat dumb approach:
def tuplify_shape(size, shape):
return np.broadcast_to(None, size).reshape(*shape).shape
Tested as:
In [20]: tuplify_shape(100, (2, -1, 5))
Out[20]: (2, 10, 5)
In [21]: tuplify_shape(100, ((2, -1, 5)))
Out[21]: (2, 10, 5)
In [22]: tuplify_shape(100, ((2, 11, 5)))
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-22-f71a2f488ea8> in <module>()
----> 1 tuplify_shape(100, ((2, 11, 5)))
<ipython-input-19-6b4139eccdff> in tuplify_shape(size, shape)
1 def tuplify_shape(size, shape):
----> 2 return np.broadcast_to(None, size).reshape(*shape).shape
ValueError: cannot reshape array of size 100 into shape (2,11,5)
pandas/core/arrays/base.py
Outdated
----- | ||
- This must return a *new* object, not self. | ||
- The only case that *must* be implemented is with dtype=None, | ||
giving a view with the same dtype as self. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can / should we make any requirements on this being no-copy? i.e.
a = my_array(...)
b = a.view(dtype=int)
b[0] = 10
assert a[0] == b[0]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
definitely needs to be no-copy; i'll add that to the docstring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a test for the no-copy bit
@@ -932,6 +977,26 @@ def _formatting_values(self) -> np.ndarray: | |||
# Reshaping | |||
# ------------------------------------------------------------------------ | |||
|
|||
def reshape(self, *shape): | |||
""" | |||
Return a view on this array with the given shape. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to document that only (1, N)
and (N, 1)
is allowed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good thinking. ATM I wrote most of this with only (1, N) in mind
from pandas._libs.lib import is_integer | ||
|
||
|
||
def implement_2d(cls): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see you're applying this to subclasses, rather than ExtensionArray itself. Motivation for that?
@TomAugspurger what is the google link? |
Did you have a call? If so, is there a summary of what has been discussed? |
@jorisvandenbossche yes we did, but nothing written AFAIK. The main takeaway I got is that jreback is on board contingent on there being zero additional (required) complexity for downstream authors. |
Which essentially means a either
My preference is for 1 if it's doable. |
@@ -888,6 +888,7 @@ Other API changes | |||
- :meth:`ExtensionArray.argsort` places NA values at the end of the sorted array. (:issue:`21801`) | |||
- :meth:`DataFrame.to_hdf` and :meth:`Series.to_hdf` will now raise a ``NotImplementedError`` when saving a :class:`MultiIndex` with extention data types for a ``fixed`` format. (:issue:`7775`) | |||
- Passing duplicate ``names`` in :meth:`read_csv` will now raise a ``ValueError`` (:issue:`17346`) | |||
- :meth:`Categorical.ravel` will now return a :class:`Categorical` instead of a NumPy array. (:issue:`27153`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this change in 0.25? (IIRC yes?), if not can you break it out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we only recently deprecated the old behavior. will double-check and see whats appropriate
|
||
cls.__iter__ = __iter__ | ||
|
||
return cls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we are still doing this approach, but if we are
I would break out each of the implementations into a well named function that takes cls. then impelemented_2d
is pretty straightforward to read w/o having to understand the details, you can immediately see what is being changed and the details exist in the functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
ATM the relevant discussion about how to proceed is in Tom's PR here
Closing pending #28389. |
This is one of (I think) two things we will actually require of EA authors. The other is we will need something like view to return a new EA object with a view on the same data. That will be implemented in a separate PR following discussion.
cc @jreback @jorisvandenbossche @TomAugspurger