Datetime ExtensionArray #22845

TomAugspurger · 2018-09-26T20:05:13Z

Collecting some design thoughts here. I've spent ~4 hours today building a DatetimeTZArray. That's mostly working on its own, but now all of pandas needs to be updated to use it.

I think the most sensible way forward is to implement two new EAs

DatetimeArray
DatetimeTZArray

Initially I tried just DatetimeTZArray, since numpy natively supports datetimes without timezones. But that will require a lot of checks internal to pandas. Better to have both be EAs. One potential issue here is that DatetimeBlock can apparently be consolidated (it doesn't inherit from NonConsolidatableMixin like DatetimeTZBlock). However, I've been unable to construct a DataFrame with consolidated DatetimeTZ blocks. @jreback do you know if that's possible?

In [20]: pd.DataFrame({"A": pd.date_range('2017', periods=12, tz='UTC'), 'B': pd.date_range('2017', periods=12, tz="UTC")})._data.consolidate()
Out[20]:
BlockManager
Items: Index(['A', 'B'], dtype='object')
Axis 1: RangeIndex(start=0, stop=12, step=1)
DatetimeTZBlock: slice(0, 1, 1), 1 x 12, dtype: datetime64[ns, UTC]
DatetimeTZBlock: slice(1, 2, 1), 1 x 12, dtype: datetime64[ns, UTC]

note the two blocks.

As far as user-facing changes go, we still haven't settled on the types of

Series[DatetimeDtype].values
Series[DatetimeTZDtype].values

We can support anything. Backwards compatibility would make those ndarray[datetime64ns] (after a conversion to UTC & dropping the timezone for TZ-aware). Consistency with other EAs would have those be EAs). We could please nobody and make tz-naive an ndarray and tz-aware an EA. It's not clear to me what's best here.

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2018-09-26T20:15:13Z

cc @jbrockmendel

Do you have plans to work on this? If so I can push my branch and focus on PeriodArray for now.

jbrockmendel · 2018-09-26T20:32:41Z

I've got a gameplan for getting the tests ready before actually subclassing EA, but no specific timeline. Hoping to make a big push after the meeting.

TomAugspurger · 2018-09-26T20:39:24Z

Sounds good. I'll switch over to period for the moment then and check back in later.

…

On Wed, Sep 26, 2018 at 3:32 PM jbrockmendel ***@***.***> wrote: I've got a gameplan for getting the tests ready before actually subclassing EA, but no specific timeline. Hoping to make a big push after the meeting. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#22845 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHInyziG7wvrv6nHJBGR0r27D00JdGks5ue-RwgaJpZM4W7YwC> .

jbrockmendel · 2018-09-26T20:39:54Z

two new EAs

Any particular reason for two instead of one (like DatetimeIndex)?

However, I've been unable to construct a DataFrame with consolidated DatetimeTZ blocks.

This (and an efficient transpose) are big parts of why I think we should reconsider allowing 2D EAs.

As far as user-facing changes go

One of my favorite design choices recently was adopting _ndarray_values with a very specific, very clear "this is guaranteed to always be an np.ndarray (but may be lossy)". Making .values "this is guaranteed to never be lossy" is my preference. Possibly always-returns-EA, but that's a ways away.

TomAugspurger · 2018-09-26T21:05:18Z

Any particular reason for two instead of one (like DatetimeIndex)?

Do we like the design of DatetimeIndex.tz being None? If so, then one EA should suffice, though I haven't looked closely enough at ops, etc. to know whether that's doable.

This (and an efficient transpose) are big parts of why I think we should reconsider allowing 2D EAs.

Noted :) Transpose is an extreme case where a 2D array can be done with zero copy, while an array split will need lots of data copying. That said, I don't view transpose as a really "core" dataframe operation, since it doesn't work with heterogenous data. A more compelling case is a wide dataframe of homogenous EAs. With (consolidated) 2D EAs that would be relatively fast. But, I think the hope is we can rewrite the BlockManager in Cython so that the non-consolidated case is much closer to the consolidated one.

jorisvandenbossche · 2018-09-26T21:14:18Z

I personally see tz as metadata (parametrized dtype), so if that works fine, I think going with a single EA is OK

To that end, I would personally be fine with having datetime64[ns] columns no longer being able to be consolidated (I think the cases where people rely on this or its performance will be even more rare with datetime data).

@TomAugspurger in your example, you are using tz, so you are showing that DatetimeTZBlock is not consolidatable, as expected. But without a tz, it indeed consolidates:

In [152]: pd.DataFrame({"A": pd.date_range('2017', periods=12), 'B': pd.date_range('2017', periods=12)})._data.consolidate()
Out[152]: 
BlockManager
Items: Index(['A', 'B'], dtype='object')
Axis 1: RangeIndex(start=0, stop=12, step=1)
DatetimeBlock: slice(0, 2, 1), 2 x 12, dtype: datetime64[ns]

However, I've been unable to construct a DataFrame with consolidated DatetimeTZ blocks.

This (and an efficient transpose) are big parts of why I think we should reconsider allowing 2D EAs.

I still don't really see why 2D EA's are needed here. Yes, it will give performance hits, but we were still considering getting rid of the Block based internals in a possible pandas 2.0 to have a simpler concept of collection of 1D arrays, and that is AFAIR also one of the reasons to not yet move to full EAs also for int/floats (numpy) dtypes.
But maybe let's touch on this during the meeting tomorrow.

@TomAugspurger I think it is useful anyway to push the branch, potentially it can already give some concrete background for the meeting tomorrow

TomAugspurger · 2018-09-26T21:27:42Z

@TomAugspurger in your example, you are using tz, so you are showing that DatetimeTZBlock is not consolidatable, as expected. But without a tz, it indeed consolidates:

err, yes obviously. My bad.

master...TomAugspurger:datetimetz-ea

jbrockmendel · 2018-09-26T23:49:05Z

Do we like the design of DatetimeIndex.tz being None?

Yes, fits with the "DatetimeArray is just a vectorized Timestamp" paradigm.

I still don't really see why 2D EA's are needed here. [...] But maybe let's touch on this during the meeting tomorrow.

Definitely no need to go further off on this tangent. I'll try to consolidate my thoughts on the matter in the agenda doc.

jreback · 2018-09-27T00:16:48Z

i agree here

tz=None is fine, no need to have more than one DatetimeArray

also ok with ‘breaking’ this meaning we change .values to return a EA for a naive datetime dtype

TomAugspurger · 2018-09-27T15:47:12Z

@jbrockmendel what inheritance structure do you have in mind for DatetimeArray / DatetimeIndex?

I'm struggling a bit with the PeriodArray refactor right now because things are getting mixed up. If we follow the Categorical pattern, we have

Categorical(ExtensionArray)
CategoricalIndex(Index)

but it seems like the DatetimeArrayMixin / PeriodArrayMixin has something like

PeriodArray(PeriodArrayMixin, ExtensionArray)
PeriodIndex(PeriodArrayMixin, DatelikeOps, DatetimeIndexOpsMixin, Int64Index):

IMO, the fact that PeriodIndex / DatetimeIndex shouldn't inherit from PeriodArrayMixin, since PeriodArray, since they'll have different .values. For PeriodArray, that's an ndarray[int]. For PeriodIndex that's a PeriodArray.

Here's my WIP branch: master...TomAugspurger:ea-period

On that branch, I've removed PeriodArrayMixin, and just subsumed its methods in PeriodArray (which inherits from DatetimelikeArrayMixin).

Ops and indexing aren't working yet.

jbrockmendel · 2018-09-27T16:07:19Z

what inheritance structure do you have in mind for DatetimeArray / DatetimeIndex?

I'll take a look at the branch.

I haven't given much thought to the .values bit. The vague thought I've had is that when the tests are ready the FooArrayMixin classes could just become FooArray, with the inheritance structure unchanged. The less-vague thought is that I know @jorisvandenbossche would rather use composition than inheritance for these, so that is an unlikely outcome.

TomAugspurger · 2018-09-27T16:10:33Z

The vague thought I've had is that when the tests are ready the FooArrayMixin classes could just become FooArray, with the inheritance structure unchanged.

I essentially changed PeriodArrayMixin to be PeriodArray. I'm turning to ops right now, so I'll see what I can get done in the next 30 minutes, but initially everything broke.

jorisvandenbossche · 2018-09-27T16:43:31Z

Yes, IMO the Index classes should no longer inherit from the current DatetimeArrayMixins (that was also the reason I was not really fond of merging those things already).
That would be a similar pattern as you did for IntervalArray I think, @TomAugspurger ?

It might be good if one of you could list some of those discussion points for the call, if there are others of course. This is maybe the main "big" design-wise question.

jbrockmendel · 2018-09-27T19:16:33Z

@TomAugspurger can you open a PR with the WIP branch? That would make it easier to comment inline.

I'm writing up notes from the chat, will review Sparse Extension shortly.

TomAugspurger · 2018-09-27T19:43:11Z

#22862

TomAugspurger · 2018-11-08T12:45:12Z

Closing in favor of #23185

TomAugspurger added Datetime Datetime data dtype Timezones Timezone data dtype ExtensionArray Extending pandas with custom dtypes or arrays. labels Sep 26, 2018

jorisvandenbossche added this to the 0.24.0 milestone Sep 26, 2018

TomAugspurger closed this as completed Nov 8, 2018

TomAugspurger mentioned this issue Dec 1, 2018

REF: Refactor Date/TimeLikeOps #24038

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datetime ExtensionArray #22845

Datetime ExtensionArray #22845

TomAugspurger commented Sep 26, 2018

TomAugspurger commented Sep 26, 2018

jbrockmendel commented Sep 26, 2018

TomAugspurger commented Sep 26, 2018 via email

jbrockmendel commented Sep 26, 2018

TomAugspurger commented Sep 26, 2018

jorisvandenbossche commented Sep 26, 2018

TomAugspurger commented Sep 26, 2018

jbrockmendel commented Sep 26, 2018

jreback commented Sep 27, 2018

TomAugspurger commented Sep 27, 2018 •

edited

Loading

jbrockmendel commented Sep 27, 2018

TomAugspurger commented Sep 27, 2018

jorisvandenbossche commented Sep 27, 2018

jbrockmendel commented Sep 27, 2018

TomAugspurger commented Sep 27, 2018

TomAugspurger commented Nov 8, 2018

Datetime ExtensionArray #22845

Datetime ExtensionArray #22845

Comments

TomAugspurger commented Sep 26, 2018

TomAugspurger commented Sep 26, 2018

jbrockmendel commented Sep 26, 2018

TomAugspurger commented Sep 26, 2018 via email

jbrockmendel commented Sep 26, 2018

TomAugspurger commented Sep 26, 2018

jorisvandenbossche commented Sep 26, 2018

TomAugspurger commented Sep 26, 2018

jbrockmendel commented Sep 26, 2018

jreback commented Sep 27, 2018

TomAugspurger commented Sep 27, 2018 • edited Loading

jbrockmendel commented Sep 27, 2018

TomAugspurger commented Sep 27, 2018

jorisvandenbossche commented Sep 27, 2018

jbrockmendel commented Sep 27, 2018

TomAugspurger commented Sep 27, 2018

TomAugspurger commented Nov 8, 2018

TomAugspurger commented Sep 27, 2018 •

edited

Loading