Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datetime ExtensionArray #22845

Closed
TomAugspurger opened this issue Sep 26, 2018 · 16 comments
Closed

Datetime ExtensionArray #22845

TomAugspurger opened this issue Sep 26, 2018 · 16 comments
Labels
Datetime Datetime data dtype ExtensionArray Extending pandas with custom dtypes or arrays. Timezones Timezone data dtype
Milestone

Comments

@TomAugspurger
Copy link
Contributor

Collecting some design thoughts here. I've spent ~4 hours today building a DatetimeTZArray. That's mostly working on its own, but now all of pandas needs to be updated to use it.

I think the most sensible way forward is to implement two new EAs

  • DatetimeArray
  • DatetimeTZArray

Initially I tried just DatetimeTZArray, since numpy natively supports datetimes without timezones. But that will require a lot of checks internal to pandas. Better to have both be EAs. One potential issue here is that DatetimeBlock can apparently be consolidated (it doesn't inherit from NonConsolidatableMixin like DatetimeTZBlock). However, I've been unable to construct a DataFrame with consolidated DatetimeTZ blocks. @jreback do you know if that's possible?

In [20]: pd.DataFrame({"A": pd.date_range('2017', periods=12, tz='UTC'), 'B': pd.date_range('2017', periods=12, tz="UTC")})._data.consolidate()
Out[20]:
BlockManager
Items: Index(['A', 'B'], dtype='object')
Axis 1: RangeIndex(start=0, stop=12, step=1)
DatetimeTZBlock: slice(0, 1, 1), 1 x 12, dtype: datetime64[ns, UTC]
DatetimeTZBlock: slice(1, 2, 1), 1 x 12, dtype: datetime64[ns, UTC]

note the two blocks.

As far as user-facing changes go, we still haven't settled on the types of

  • Series[DatetimeDtype].values
  • Series[DatetimeTZDtype].values

We can support anything. Backwards compatibility would make those ndarray[datetime64ns] (after a conversion to UTC & dropping the timezone for TZ-aware). Consistency with other EAs would have those be EAs). We could please nobody and make tz-naive an ndarray and tz-aware an EA. It's not clear to me what's best here.

@TomAugspurger TomAugspurger added Datetime Datetime data dtype Timezones Timezone data dtype ExtensionArray Extending pandas with custom dtypes or arrays. labels Sep 26, 2018
@TomAugspurger
Copy link
Contributor Author

cc @jbrockmendel

Do you have plans to work on this? If so I can push my branch and focus on PeriodArray for now.

@jbrockmendel
Copy link
Member

I've got a gameplan for getting the tests ready before actually subclassing EA, but no specific timeline. Hoping to make a big push after the meeting.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Sep 26, 2018 via email

@jbrockmendel
Copy link
Member

two new EAs

Any particular reason for two instead of one (like DatetimeIndex)?

However, I've been unable to construct a DataFrame with consolidated DatetimeTZ blocks.

This (and an efficient transpose) are big parts of why I think we should reconsider allowing 2D EAs.

As far as user-facing changes go

One of my favorite design choices recently was adopting _ndarray_values with a very specific, very clear "this is guaranteed to always be an np.ndarray (but may be lossy)". Making .values "this is guaranteed to never be lossy" is my preference. Possibly always-returns-EA, but that's a ways away.

@jorisvandenbossche jorisvandenbossche added this to the 0.24.0 milestone Sep 26, 2018
@TomAugspurger
Copy link
Contributor Author

Any particular reason for two instead of one (like DatetimeIndex)?

Do we like the design of DatetimeIndex.tz being None? If so, then one EA should suffice, though I haven't looked closely enough at ops, etc. to know whether that's doable.

This (and an efficient transpose) are big parts of why I think we should reconsider allowing 2D EAs.

Noted :) Transpose is an extreme case where a 2D array can be done with zero copy, while an array split will need lots of data copying. That said, I don't view transpose as a really "core" dataframe operation, since it doesn't work with heterogenous data. A more compelling case is a wide dataframe of homogenous EAs. With (consolidated) 2D EAs that would be relatively fast. But, I think the hope is we can rewrite the BlockManager in Cython so that the non-consolidated case is much closer to the consolidated one.

@jorisvandenbossche
Copy link
Member

I personally see tz as metadata (parametrized dtype), so if that works fine, I think going with a single EA is OK

To that end, I would personally be fine with having datetime64[ns] columns no longer being able to be consolidated (I think the cases where people rely on this or its performance will be even more rare with datetime data).

@TomAugspurger in your example, you are using tz, so you are showing that DatetimeTZBlock is not consolidatable, as expected. But without a tz, it indeed consolidates:

In [152]: pd.DataFrame({"A": pd.date_range('2017', periods=12), 'B': pd.date_range('2017', periods=12)})._data.consolidate()
Out[152]: 
BlockManager
Items: Index(['A', 'B'], dtype='object')
Axis 1: RangeIndex(start=0, stop=12, step=1)
DatetimeBlock: slice(0, 2, 1), 2 x 12, dtype: datetime64[ns]

However, I've been unable to construct a DataFrame with consolidated DatetimeTZ blocks.

This (and an efficient transpose) are big parts of why I think we should reconsider allowing 2D EAs.

I still don't really see why 2D EA's are needed here. Yes, it will give performance hits, but we were still considering getting rid of the Block based internals in a possible pandas 2.0 to have a simpler concept of collection of 1D arrays, and that is AFAIR also one of the reasons to not yet move to full EAs also for int/floats (numpy) dtypes.
But maybe let's touch on this during the meeting tomorrow.

@TomAugspurger I think it is useful anyway to push the branch, potentially it can already give some concrete background for the meeting tomorrow

@TomAugspurger
Copy link
Contributor Author

@TomAugspurger in your example, you are using tz, so you are showing that DatetimeTZBlock is not consolidatable, as expected. But without a tz, it indeed consolidates:

err, yes obviously. My bad.

master...TomAugspurger:datetimetz-ea

@jbrockmendel
Copy link
Member

Do we like the design of DatetimeIndex.tz being None?

Yes, fits with the "DatetimeArray is just a vectorized Timestamp" paradigm.

I still don't really see why 2D EA's are needed here. [...] But maybe let's touch on this during the meeting tomorrow.

Definitely no need to go further off on this tangent. I'll try to consolidate my thoughts on the matter in the agenda doc.

@jreback
Copy link
Contributor

jreback commented Sep 27, 2018

i agree here

tz=None is fine, no need to have more than one DatetimeArray

also ok with ‘breaking’ this meaning we change .values to return a EA for a naive datetime dtype

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Sep 27, 2018

@jbrockmendel what inheritance structure do you have in mind for DatetimeArray / DatetimeIndex?

I'm struggling a bit with the PeriodArray refactor right now because things are getting mixed up. If we follow the Categorical pattern, we have

  • Categorical(ExtensionArray)
  • CategoricalIndex(Index)

but it seems like the DatetimeArrayMixin / PeriodArrayMixin has something like

  • PeriodArray(PeriodArrayMixin, ExtensionArray)
  • PeriodIndex(PeriodArrayMixin, DatelikeOps, DatetimeIndexOpsMixin, Int64Index):

IMO, the fact that PeriodIndex / DatetimeIndex shouldn't inherit from PeriodArrayMixin, since PeriodArray, since they'll have different .values. For PeriodArray, that's an ndarray[int]. For PeriodIndex that's a PeriodArray.

Here's my WIP branch: master...TomAugspurger:ea-period

On that branch, I've removed PeriodArrayMixin, and just subsumed its methods in PeriodArray (which inherits from DatetimelikeArrayMixin).

Ops and indexing aren't working yet.

@jbrockmendel
Copy link
Member

what inheritance structure do you have in mind for DatetimeArray / DatetimeIndex?

I'll take a look at the branch.

I haven't given much thought to the .values bit. The vague thought I've had is that when the tests are ready the FooArrayMixin classes could just become FooArray, with the inheritance structure unchanged. The less-vague thought is that I know @jorisvandenbossche would rather use composition than inheritance for these, so that is an unlikely outcome.

@TomAugspurger
Copy link
Contributor Author

The vague thought I've had is that when the tests are ready the FooArrayMixin classes could just become FooArray, with the inheritance structure unchanged.

I essentially changed PeriodArrayMixin to be PeriodArray. I'm turning to ops right now, so I'll see what I can get done in the next 30 minutes, but initially everything broke.

@jorisvandenbossche
Copy link
Member

Yes, IMO the Index classes should no longer inherit from the current DatetimeArrayMixins (that was also the reason I was not really fond of merging those things already).
That would be a similar pattern as you did for IntervalArray I think, @TomAugspurger ?


It might be good if one of you could list some of those discussion points for the call, if there are others of course. This is maybe the main "big" design-wise question.

@jbrockmendel
Copy link
Member

@TomAugspurger can you open a PR with the WIP branch? That would make it easier to comment inline.

I'm writing up notes from the chat, will review Sparse Extension shortly.

@TomAugspurger
Copy link
Contributor Author

#22862

@TomAugspurger
Copy link
Contributor Author

Closing in favor of #23185

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype ExtensionArray Extending pandas with custom dtypes or arrays. Timezones Timezone data dtype
Projects
None yet
Development

No branches or pull requests

4 participants