-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Datetime ExtensionArray #22845
Comments
Do you have plans to work on this? If so I can push my branch and focus on PeriodArray for now. |
I've got a gameplan for getting the tests ready before actually subclassing EA, but no specific timeline. Hoping to make a big push after the meeting. |
Sounds good. I'll switch over to period for the moment then and check back
in later.
…On Wed, Sep 26, 2018 at 3:32 PM jbrockmendel ***@***.***> wrote:
I've got a gameplan for getting the tests ready before actually
subclassing EA, but no specific timeline. Hoping to make a big push after
the meeting.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#22845 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHInyziG7wvrv6nHJBGR0r27D00JdGks5ue-RwgaJpZM4W7YwC>
.
|
Any particular reason for two instead of one (like DatetimeIndex)?
This (and an efficient
One of my favorite design choices recently was adopting |
Do we like the design of
Noted :) Transpose is an extreme case where a 2D array can be done with zero copy, while an array split will need lots of data copying. That said, I don't view transpose as a really "core" dataframe operation, since it doesn't work with heterogenous data. A more compelling case is a wide dataframe of homogenous EAs. With (consolidated) 2D EAs that would be relatively fast. But, I think the hope is we can rewrite the BlockManager in Cython so that the non-consolidated case is much closer to the consolidated one. |
I personally see To that end, I would personally be fine with having datetime64[ns] columns no longer being able to be consolidated (I think the cases where people rely on this or its performance will be even more rare with datetime data). @TomAugspurger in your example, you are using
I still don't really see why 2D EA's are needed here. Yes, it will give performance hits, but we were still considering getting rid of the Block based internals in a possible pandas 2.0 to have a simpler concept of collection of 1D arrays, and that is AFAIR also one of the reasons to not yet move to full EAs also for int/floats (numpy) dtypes. @TomAugspurger I think it is useful anyway to push the branch, potentially it can already give some concrete background for the meeting tomorrow |
err, yes obviously. My bad. |
Yes, fits with the "DatetimeArray is just a vectorized Timestamp" paradigm.
Definitely no need to go further off on this tangent. I'll try to consolidate my thoughts on the matter in the agenda doc. |
i agree here tz=None is fine, no need to have more than one DatetimeArray also ok with ‘breaking’ this meaning we change .values to return a EA for a naive datetime dtype |
@jbrockmendel what inheritance structure do you have in mind for DatetimeArray / DatetimeIndex? I'm struggling a bit with the PeriodArray refactor right now because things are getting mixed up. If we follow the Categorical pattern, we have
but it seems like the DatetimeArrayMixin / PeriodArrayMixin has something like
IMO, the fact that PeriodIndex / DatetimeIndex shouldn't inherit from PeriodArrayMixin, since PeriodArray, since they'll have different Here's my WIP branch: master...TomAugspurger:ea-period On that branch, I've removed PeriodArrayMixin, and just subsumed its methods in PeriodArray (which inherits from DatetimelikeArrayMixin). Ops and indexing aren't working yet. |
I'll take a look at the branch. I haven't given much thought to the |
I essentially changed PeriodArrayMixin to be PeriodArray. I'm turning to ops right now, so I'll see what I can get done in the next 30 minutes, but initially everything broke. |
Yes, IMO the Index classes should no longer inherit from the current DatetimeArrayMixins (that was also the reason I was not really fond of merging those things already). It might be good if one of you could list some of those discussion points for the call, if there are others of course. This is maybe the main "big" design-wise question. |
@TomAugspurger can you open a PR with the WIP branch? That would make it easier to comment inline. I'm writing up notes from the chat, will review Sparse Extension shortly. |
Closing in favor of #23185 |
Collecting some design thoughts here. I've spent ~4 hours today building a DatetimeTZArray. That's mostly working on its own, but now all of pandas needs to be updated to use it.
I think the most sensible way forward is to implement two new EAs
Initially I tried just DatetimeTZArray, since numpy natively supports datetimes without timezones. But that will require a lot of checks internal to pandas. Better to have both be EAs. One potential issue here is that
DatetimeBlock
can apparently be consolidated (it doesn't inherit fromNonConsolidatableMixin
likeDatetimeTZBlock
). However, I've been unable to construct a DataFrame with consolidated DatetimeTZ blocks. @jreback do you know if that's possible?note the two blocks.
As far as user-facing changes go, we still haven't settled on the types of
Series[DatetimeDtype].values
Series[DatetimeTZDtype].values
We can support anything. Backwards compatibility would make those
ndarray[datetime64ns]
(after a conversion to UTC & dropping the timezone for TZ-aware). Consistency with other EAs would have those be EAs). We could please nobody and make tz-naive an ndarray and tz-aware an EA. It's not clear to me what's best here.The text was updated successfully, but these errors were encountered: