Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xr.infer_freq #4033

Merged
merged 20 commits into from
May 30, 2020
Merged

xr.infer_freq #4033

merged 20 commits into from
May 30, 2020

Conversation

aulemahal
Copy link
Contributor

@aulemahal aulemahal commented May 5, 2020

  • Tests added
  • Passes isort -rc . && black . && mypy . && flake8
  • Fully documented, including whats-new.rst for all changes and api.rst for new API

This PR adds a xr.infer_freq method to copy pandas infer_freq but on CFTimeIndex objects. I tried to subclass pandas _FrequencyInferer and to only override as little as possible.

Two things are problematic right now and I would like to get feedback on how to implement them if this PR gets the dev's approval.

  1. pd.DatetimeIndex.asi8 returns integers representing nanoseconds since 1970-1-1, while xr.CFTimeIndex.asi8 returns microseconds. In order not to break the API, I patched the _CFTimeFrequencyInferer to store 1000x the values. Not sure if this is the best, but it works.

  2. As of now, xr.infer_freq will fail on weekly indexes. This is because pandas is using datetime.weekday() at some point but cftime objects do not implement that (they use dayofwk instead). I'm not sure what to do? Cftime could implement it to completly mirror python's datetime or pandas could use dayofwk since it's available on the TimeStamp objects.

Another option, cleaner but longer, would be to reimplement _FrequencyInferer from scratch. I may have time for this, cause I really think a xr.infer_freq method would be useful.

@TomNicholas TomNicholas requested a review from spencerkclark May 7, 2020 12:37
Copy link
Member

@spencerkclark spencerkclark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aulemahal this is super cool -- thanks for starting on it!

To give some feedback on your two points:

  1. Yes, it was an intentional choice to use microseconds as the unit for asi8 on the CFTimeIndex. cftime dates have microsecond precision, and this allows us to represent dates from a much longer time range than pd.Timestamp objects, which have nanosecond precision.

    • You noted this as an option toward the end of your post, and indeed I would actually encourage you to vendor/re-write pandas's _FrequencyInferrer class such that it is specifically designed for cftime dates and microsecond precision. This is for a couple reasons:

      • Multiplying asi8 by 1000 can lead to problems due to integer overflow, e.g. for frequencies greater than ~292 years:
      In [2]: times = xr.cftime_range("2000", periods=10, freq="106945D")
      
      In [3]: xr.infer_freq(times)
      Out[3]: '9240048S'
      
      • _FrequencyInferrer is private API in pandas, which means that they could change and break things we might rely on without warning. I've had a look at pandas's implementation and it doesn't seem to depend too much on other internal functionality to pandas -- in fact you've already implemented some of the more significant internal functions it depends on, build_field_sarray and month_position_check -- which means that hopefully it shouldn't be too onerous to copy the rest over.
  2. Given that I am recommending you do not subclass _FrequencyInferrer, I think this is no longer super relevant (i.e. use cftime.datetime.dayofwk in your implementation where needed). I will note that we do not currently implement business-related or weekly cftime offset objects, which will limit the utility of inferring those kinds of frequencies for the time being, but you're definitely not required to add those in this PR!

@aulemahal
Copy link
Contributor Author

@spencerkclark Thanks for the feedback!

I rewrote the _CFTimeFrequencyInferer class independently of pandas private objects. I removed outputs that wouldn't match offsets defined in xarray: weekly frequencies, week-of-month (WOM) and all business-related anchors. It wouldn't be difficult to add them later on if need be and if cftime_offsets.py defines them. Note that infer_freq is quite independent of the rest of xarray, this choice was purely so that the behavior matches the rest of xr. I did leave the sub-seconds frequencies (L for ms and U for µs).

I expect the performance of xr.infer_freq to be slightly worst than pandas' since they use optimized and compiled code whereas I only used common numpy and xarray stuff. However, I assume that this part of the workflow is far from being the bottleneck of most xarray usecases...

Will write up some docs soon.

Copy link
Member

@spencerkclark spencerkclark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @aulemahal, this is looking really nice. I made a first pass of a review.

I removed outputs that wouldn't match offsets defined in xarray: weekly frequencies, week-of-month (WOM) and all business-related anchors. It wouldn't be difficult to add them later on if need be and if cftime_offsets.py defines them. Note that infer_freq is quite independent of the rest of xarray, this choice was purely so that the behavior matches the rest of xr.

Yeah this is totally fine. No one so far has requested these types of frequencies for cftime objects, and it sort of makes sense. While we can somewhat arbitrarily define them for non-standard calendars, the distinction between weekdays and weekends generally only really matters for the kinds of data that would be indexed with a standard calendar.

I expect the performance of xr.infer_freq to be slightly worst than pandas' since they use optimized and compiled code whereas I only used common numpy and xarray stuff. However, I assume that this part of the workflow is far from being the bottleneck of most xarray usecases...

It's indeed slower for cftime objects, but I think much of that is due to the slowness of CFTimeIndex.asi8 compared to DatetimeIndex.asi8 (not any code you wrote in this PR!). I agree this is likely not a bottleneck for most xarray use-cases.

xarray/tests/test_cftimeindex.py Outdated Show resolved Hide resolved
xarray/tests/test_cftimeindex.py Show resolved Hide resolved
xarray/coding/frequencies.py Show resolved Hide resolved
xarray/coding/frequencies.py Show resolved Hide resolved
xarray/coding/frequencies.py Show resolved Hide resolved
xarray/tests/test_cftimeindex.py Show resolved Hide resolved
xarray/coding/frequencies.py Outdated Show resolved Hide resolved
xarray/coding/frequencies.py Show resolved Hide resolved
xarray/coding/frequencies.py Show resolved Hide resolved
xarray/coding/frequencies.py Outdated Show resolved Hide resolved
@aulemahal aulemahal changed the title [WIP] xr.infer_freq xr.infer_freq May 25, 2020
@aulemahal aulemahal requested a review from spencerkclark May 25, 2020 22:43
Copy link
Member

@spencerkclark spencerkclark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the updates! I think this is almost ready to go from my perspective. I just have a few more very minor requests for tests, based on the latest coverage report:

  • Test that a non-monotonic index results in None returned as a frequency.
  • Test that an index with non-unique values results in None returned as a frequency.
  • Test that an index with more than one monthly delta results in None as a returned frequency, e.g. ["2000-01-01", "2000-02-01", "2000-04-01"].
  • Test that an index with more than one quarterly delta results in None as a returned frequency, e.g. ["2000-01-01", "2000-04-01", "2000-10-01"].

Finally could you add an issue requesting that microsecond and millisecond cftime offsets be added? Those would be straightforward to add for someone motivated, but are not required as part of this PR.

xarray/coding/frequencies.py Show resolved Hide resolved
doc/weather-climate.rst Outdated Show resolved Hide resolved
doc/whats-new.rst Outdated Show resolved Hide resolved
doc/weather-climate.rst Outdated Show resolved Hide resolved
aulemahal and others added 2 commits May 26, 2020 14:46
Co-authored-by: Spencer Clark <spencerkclark@gmail.com>
Co-authored-by: Mathias Hauser <mathause@users.noreply.github.com>
@aulemahal
Copy link
Contributor Author

@spencerkclark Here you go! Tests added for all cases you mentioned.

Copy link
Member

@spencerkclark spencerkclark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @aulemahal — I’ll merge this in a few days if there are no further comments.

@dcherian dcherian mentioned this pull request May 27, 2020
23 tasks
@spencerkclark spencerkclark merged commit fd9e620 into pydata:master May 30, 2020
@spencerkclark
Copy link
Member

Thanks @aulemahal! Nice work.

dcherian added a commit to TomNicholas/xarray that referenced this pull request Jun 24, 2020
…o-combine

* 'master' of github.com:pydata/xarray: (81 commits)
  use builtin python types instead of the numpy alias (pydata#4170)
  Revise pull request template (pydata#4039)
  pint support for Dataset (pydata#3975)
  drop eccodes in docs (pydata#4162)
  Update issue templates inspired/based on dask (pydata#4154)
  Fix failing upstream-dev build & remove docs build (pydata#4160)
  Improve typehints of xr.Dataset.__getitem__ (pydata#4144)
  provide a error summary for assert_allclose (pydata#3847)
  built-in accessor documentation (pydata#3988)
  Recommend installing cftime when time decoding fails. (pydata#4134)
  parameter documentation for DataArray.sel (pydata#4150)
  speed up map_blocks (pydata#4149)
  Remove outdated note from datetime accessor docstring (pydata#4148)
  Fix the upstream-dev pandas build failure (pydata#4138)
  map_blocks: Allow passing dask-backed objects in args (pydata#3818)
  keep attrs in reset_index (pydata#4103)
  Fix open_rasterio() for WarpedVRT with specified src_crs (pydata#4104)
  Allow non-unique and non-monotonic coordinates in get_clean_interp_index and polyfit (pydata#4099)
  update numpy's intersphinx url (pydata#4117)
  xr.infer_freq (pydata#4033)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants