Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF/API: DatetimeTZDtype #23990

Merged
merged 20 commits into from
Dec 3, 2018
Merged

REF/API: DatetimeTZDtype #23990

merged 20 commits into from
Dec 3, 2018

Conversation

TomAugspurger
Copy link
Contributor

A cleanup of DatetimeTZDtype

  • Remove magic constructor from string. It seemed dangerous to overload unit with either unit or a full datetime64[ns, tz] string. This required changing DatetimeTZDtype.construct_from_string and changing a number of places to use construct_from_string rather than the main __new__
  • Change __new__ to __init__ and remove caching. It seemed to be causing more problems that it was worth. You could too easily create nonsense DatetimeTZDtypes like
In [3]: t = pd.core.dtypes.dtypes.DatetimeTZDtype(unit=None, tz=None)
  • Change .tz and .unit to properties instead of attributes. I've not provided setters. We could in theory, since we're getting rid of caching, but I'd rather wait till there's a demand..

The remaining changes in the DatetimeArray PR will be to

  1. Inherit from ExtensionDtype
  2. Implement construct_array_type
  3. Register the dtype

* Remove magic constructor from string
* Remove Caching

The remaining changes in the DatetimeArray PR will be to

1. Inherit from ExtensionDtype
2. Implement construct_array_type
3. Register
@TomAugspurger TomAugspurger added Datetime Datetime data dtype Dtype Conversions Unexpected or buggy dtype conversions Timezones Timezone data dtype labels Nov 29, 2018
@TomAugspurger TomAugspurger added this to the 0.24.0 milestone Nov 29, 2018
@pep8speaks
Copy link

Hello @TomAugspurger! Thanks for submitting the PR.

@jreback
Copy link
Contributor

jreback commented Nov 29, 2018

pls perf test things

caching is essential as these are created quite a lot and they are the same virtually every time

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Nov 29, 2018

On microbenchmarks, things are fine

master:

In [2]: %time a = pd.core.dtypes.dtypes.DatetimeTZDtype(unit='ns', tz="UTC")
CPU times: user 39 µs, sys: 31 µs, total: 70 µs
Wall time: 75.6 µs

In [3]: %time a = pd.core.dtypes.dtypes.DatetimeTZDtype(unit='ns', tz="UTC")
CPU times: user 15 µs, sys: 0 ns, total: 15 µs
Wall time: 18.8 µs

PR:

In [2]: %time a = pd.core.dtypes.dtypes.DatetimeTZDtype(unit='ns', tz="UTC")
CPU times: user 29 µs, sys: 23 µs, total: 52 µs
Wall time: 56 µs

In [3]: %time a = pd.core.dtypes.dtypes.DatetimeTZDtype(unit='ns', tz="UTC")
CPU times: user 16 µs, sys: 0 ns, total: 16 µs
Wall time: 19.8 µs

ASV for the timeseries, timestamps, and offsets

       before           after         ratio
     [6b3490f4]       [982c169a]
-         143±9ms          127±4ms     0.89  
timeseries.DatetimeIndex.time_add_timedelta('tz_aware')
-     2.73±0.05μs      2.45±0.02μs     0.90  timestamp.TimestampOps.time_tz_convert(tzutc())

Line-profiling reveals that basically all the time is spent on the timezone check.


In [5]: %lprun -s -f DatetimeTZDtype.__init__ DatetimeTZDtype(tz='utc')
Timer unit: 1e-06 s

Total time: 4e-05 s
File: /Users/taugspurger/sandbox/pandas/pandas/core/dtypes/dtypes.py
Function: __init__ at line 494

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   494                                               def __init__(self, unit="ns", tz=None):
   495                                                   """
   496                                                   An ExtensionDtype for timezone-aware datetime data.
   497
   498                                                   Parameters
   499                                                   ----------
   500                                                   unit : str, default "ns"
   501                                                       The precision of the datetime data. Currently limited
   502                                                       to ``"ns"``.
   503                                                   tz : str, int, or datetime.tzinfo
   504                                                       The timezone.
   505
   506                                                   Raises
   507                                                   ------
   508                                                   pytz.UnknownTimeZoneError
   509                                                       When the requested timezone cannot be found.
   510
   511                                                   Examples
   512                                                   --------
   513                                                   >>> pd.core.dtypes.dtypes.DatetimeTZDtype(tz='UTC')
   514                                                   datetime64[ns, UTC]
   515
   516                                                   >>> pd.core.dtypes.dtypes.DatetimeTZDtype(tz='dateutil/US/Central')
   517                                                   datetime64[ns, tzfile('/usr/share/zoneinfo/US/Central')]
   518                                                   """
   519         1          7.0      7.0     17.5          if isinstance(unit, DatetimeTZDtype):
   520                                                       unit, tz = unit.unit, unit.tz
   521
   522         1          1.0      1.0      2.5          if unit != 'ns':
   523                                                       raise ValueError("DatetimeTZDtype only supports ns units")
   524
   525         1          0.0      0.0      0.0          if tz:
   526         1         27.0     27.0     67.5              tz = timezones.maybe_get_tz(tz)
   527                                                   elif tz is not None:
   528                                                       raise pytz.UnknownTimeZoneError(tz)
   529                                                   elif tz is None:
   530                                                       raise TypeError("A 'tz' is required.")
   531
   532         1          4.0      4.0     10.0          self._unit = unit
   533         1          1.0      1.0      2.5          self._tz = tz

I switched the properties to cache_readonly (which we can do, since we aren't using _cache for caching instances now).

_metadata = ('unit', 'tz')
_match = re.compile(r"(datetime64|M8)\[(?P<unit>.+), (?P<tz>.+)\]")
_cache = {}
# TODO: restore caching? who cares though? It seems needlessly complex.
# np.dtype('datetime64[ns]') isn't a singleton
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a huge performance penalty w/o caching

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you see the perf numbers I posted in
#23990 (comment)? It seems to be slightly faster without caching (though within noise).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try running a good part of the test suite. Its the repeated consruction that's a problem, not the single contruction which is fine. W/o caching you end up creating a huge number of these

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

guess could remove the comment now

pandas/core/dtypes/dtypes.py Outdated Show resolved Hide resolved
@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Nov 29, 2018 via email

@jreback
Copy link
Contributor

jreback commented Nov 29, 2018

maybe things changed since I created this originally, but w/o caching perf was not great.

* Use pandas_dtype
* removed cache_readonly
@codecov
Copy link

codecov bot commented Nov 29, 2018

Codecov Report

Merging #23990 into master will increase coverage by 49.87%.
The diff coverage is 97.95%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #23990       +/-   ##
===========================================
+ Coverage   42.38%   92.25%   +49.87%     
===========================================
  Files         161      161               
  Lines       51701    51689       -12     
===========================================
+ Hits        21914    47687    +25773     
+ Misses      29787     4002    -25785
Flag Coverage Δ
#multiple 90.66% <97.95%> (?)
#single 42.51% <63.26%> (+0.12%) ⬆️
Impacted Files Coverage Δ
pandas/core/dtypes/common.py 95.62% <ø> (+25.23%) ⬆️
pandas/core/arrays/datetimelike.py 96.34% <100%> (+45.91%) ⬆️
pandas/core/arrays/datetimes.py 98.43% <100%> (+34.63%) ⬆️
pandas/core/internals/blocks.py 93.69% <100%> (+41.32%) ⬆️
pandas/core/dtypes/missing.py 93.1% <100%> (+35.63%) ⬆️
pandas/core/dtypes/dtypes.py 95.33% <97.14%> (+19.17%) ⬆️
pandas/core/computation/pytables.py 92.37% <0%> (+0.3%) ⬆️
pandas/io/pytables.py 92.3% <0%> (+0.92%) ⬆️
pandas/util/_test_decorators.py 93.24% <0%> (+4.05%) ⬆️
pandas/compat/__init__.py 58.36% <0%> (+8.17%) ⬆️
... and 125 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 08395af...5cde369. Read the comment docs.

@TomAugspurger
Copy link
Contributor Author

I've removed _coerce_to_dtype.

@jreback
Copy link
Contributor

jreback commented Nov 29, 2018

dont you need to remove the _coerce_to_dtype tests?

@TomAugspurger
Copy link
Contributor Author

7ab2a74 removed all the _coerce_to_dtype tests.

@TomAugspurger
Copy link
Contributor Author

Do we typically add release notes for something like this (deprecations that aren't part of the public API, but are possibly in use downstream)?

@TomAugspurger
Copy link
Contributor Author

Added to the deprecation log.

@jreback
Copy link
Contributor

jreback commented Dec 3, 2018

Do we typically add release notes for something like this (deprecations that aren't part of the public API, but are possibly in use downstream)?

this IS certainly part of the public API as DatetimeTZDtype is public.

@jreback
Copy link
Contributor

jreback commented Dec 3, 2018

I know pyarrow uses this at the very least. (I don't know if the 2 form of the construction is used though). cc @wesm

@TomAugspurger
Copy link
Contributor Author

Ah, I didn't realize it was exported in api.types, it's not in the API docs. I'll add it to api.rst at the top-level (as of #23581)

@TomAugspurger
Copy link
Contributor Author

Added a release note. Going to followup on api.rst later, since we don't have an array section for datetime array there yet.

@jreback
Copy link
Contributor

jreback commented Dec 3, 2018

thanks @TomAugspurger lgtm. let's merge on green. can you add a ref into #6581

@TomAugspurger
Copy link
Contributor Author

@datapythonista
Copy link
Member

Haven't seen those errors before, I guess something is failing in their side. @davidstaheli do you mind taking a look at https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=4597

@jreback
Copy link
Contributor

jreback commented Dec 3, 2018

@TomAugspurger needs a rebase as well

@datapythonista
Copy link
Member

@TomAugspurger I was trying to create an incident for azure, but the url does some weird redirect with the language and it crashes. Not sure if it's Linux specific, can you check if the "Create incident" button works for you: https://azure.microsoft.com/en-us/support/create-ticket/ (the Basic technical support one)

Or @vtbassmatt, may be you can help with the above? See the error in https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=4597, we are getting that often

@datapythonista
Copy link
Member

I was checking, and numba is getting the same error from azure. I guess something is down for the whole azure-pipelines.

https://dev.azure.com/numba/numba/_build/results?buildId=471

@vtbassmatt
Copy link
Contributor

I suspect it's a little bit of abuse fighting that was turned up a little too high. I think we've addressed it now.

@jreback jreback merged commit 3fe697f into pandas-dev:master Dec 3, 2018
@jreback
Copy link
Contributor

jreback commented Dec 3, 2018

thanks @TomAugspurger

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Dtype Conversions Unexpected or buggy dtype conversions Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants