Update xcdat.open_mfdataset time decoding logic #161

pochedls · 2021-11-18T22:51:17Z

Description

Refactor xcdat.open_mfdataset() to use xarray's in-built time decoding if the dataset has cf_compliant time units
Closes open_mfdataset not decoding times correctly #158

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
My changes generate no new warnings
Any dependent changes have been merged and published in downstream modules

If applicable:

I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass with my changes (locally and CI/CD build)
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have noted that this is a breaking change for a major release (fix or feature that would cause existing functionality to not work as expected)

codecov-commenter · 2021-11-18T22:54:50Z

Codecov Report

Merging #161 (cd9491d) into main (201a575) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##              main      #161   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            8         8           
  Lines          381       423   +42     
=========================================
+ Hits           381       423   +42

Impacted Files	Coverage Δ
xcdat/__init__.py	`100.00% <100.00%> (ø)`
xcdat/dataset.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 85786c5...cd9491d. Read the comment docs.

tomvothecoder · 2021-11-22T19:16:03Z

The check for CF compliance in _check_dataset_for_cf_compliant_time() is repeated in decode_time_units().
https://github.com/XCDAT/xcdat/blob/b010930ecf96117c793ce7fc04d89f5074ba8271/xcdat/dataset.py#L329-L341
We should remove these checks in decode_time_units() and refactor _check_dataset_for_cf_compliant_time() so that it works for both a single dataset (through xcdat.open_dataset()) and multi-file datasets (through xcdat.open_mfdataset()).

Comments from #158

The code excerpt above shows that xcdat yields a tas time series that has only 12 time steps, but xarray has 1872. We can reproduce the xcdat result by adding decode_times=False (which is included in the xcdat.open_mfdataset() logic here.

I expected xr.open_mfdataset(decode_times=False) to still be able to join all non-decoded timestamps, but apparently it doesn't. Instead, it only keeps the last file's timestamps for some reason (probably below).

"I think (this is a guess) this may arise because the time units in each netCDF in this directory have different units (but perhaps the same time values)."

I believe your explanation might be right, since each file has their own units ("days since ..."). This might be causing xarray to not join the timestamps across files.

Another important note is that the updates in this PR only handles multi-file datasets with CF compliant time.
We still have to handle multi-file datasets that have non-CF compliant time, which forces us to use decode_times=False and thus leading to a dropping of timestamps. I'm trying to see why xarray only keeps the last file's timestamps in certain cases (e.g., diff units).

Example:

    # if cf_compliant, let xarray decode the time units
    # otherwise, decode using decode_time_units
    if cf_compliant:
        ds = xr.open_mfdataset(paths, decode_times=True, data_vars=data_vars, **kwargs)
    else:
        # FIXME: If non-CF compliant, using `decode_times=False` will still only
        # keep the last file's timestamps. There's an issue with how
        # `decode_times=False` works in `xr.open_mfdataset()`.
        ds = xr.open_mfdataset(paths, decode_times=False, data_vars=data_vars, **kwargs)
        if ds.cf.dims.get("T") is not None:
            ds = decode_time_units(ds)

tomvothecoder · 2021-11-22T23:37:29Z

I'm running into a weird xarray issue in PR #107 involving datasets with CF compliant time units that are decoded using xr.decode_cf() and masking specific time bounds coordinates for incomplete DJF seasons.

I opened up the issue on the xarray repo (pydata/xarray#6015). The updates in this branch should allow us to avoid using xr.decode_cf() through decode_time_units().

pochedls · 2021-11-23T19:28:46Z

Another important note is that the updates in this PR only handles multi-file datasets with CF compliant time.
We still have to handle multi-file datasets that have non-CF compliant time, which forces us to use decode_times=False and thus leading to a dropping of timestamps. I'm trying to see why xarray only keeps the last file's timestamps in certain cases (e.g., diff units).

Okay - I did not realize this - I see that this is not working for non-CF compliant multi-file datasets:

flist = ['/p/user_pub/climate_work/pochedley1/cmip6_msu/tmt_Amon_CanESM5_historical_r1i1p1f1_gr_185001-201412.nc',
         '/p/user_pub/climate_work/pochedley1/cmip6_msu/tmt_Amon_CanESM5_ssp370_r1i1p1f1_gr_201501-210012.nc']
ds = xcdat.open_mfdataset(flist)
print(ds.time)

<xarray.DataArray 'time' (time: 3012)>
array(['1800-01-01T00:00:00.000000000', '1800-02-01T00:00:00.000000000',
'1800-03-01T00:00:00.000000000', ..., '2050-10-01T00:00:00.000000000',
'2050-11-01T00:00:00.000000000', '2050-12-01T00:00:00.000000000'],
dtype='datetime64[ns]')
Coordinates:

time (time) datetime64[ns] 1800-01-01 1800-02-01 ... 2050-12-01
Attributes:
bounds: time_bnds
units: months since 1800
calendar: noleap
axis: T
realtopology: linear

The decoded time only goes to 2050 (it should go to 2100). It seems like this is an issue with decode_time_units. Is that the correct interpretation? Note that when I try to open either of these files using xcdat.open_dataset() I get a time vector that (incorrectly) begins in January 1800.

tomvothecoder · 2021-11-29T16:54:46Z

The decoded time only goes to 2050 (it should go to 2100). It seems like this is an issue with decode_time_units. Is that the correct interpretation? Note that when I try to open either of these files using xcdat.open_dataset() I get a time vector that (incorrectly) begins in January 1800.

I will investigate this issue using your example files once /user_pub is back online.

tomvothecoder · 2021-11-29T23:54:44Z

The decoded time only goes to 2050 (it should go to 2100). It seems like this is an issue with decode_time_units. Is that the correct interpretation? Note that when I try to open either of these files using xcdat.open_dataset() I get a time vector that (incorrectly) begins in January 1800.

Yes, the issue is related to decode_time_units().

For your example, which includes non-CF time units, the reference date of January 1800 is being set as the start for the time series. Instead, it should be January 1800 + the first coordinate value (600 months in this case when decode_times=False).

>>> ds_xr1 = xr.open_dataset(flist[0], decode_times=False)
>>> ds_xr1.time.units
'months since 1800'
>>> ds_xr1.time.data[0]
600.0

Related code:
https://github.com/XCDAT/xcdat/blob/69dd39d249fdf63a78e39ba44dca2e968b714e83/xcdat/dataset.py#L345-L353

I'll include a fix in this PR, thanks!

tomvothecoder · 2021-11-30T22:41:46Z

Continuing work in PR #173, which will merge into this PR.

xcdat/dataset.py

tomvothecoder

This PR looks good to me now. I added an additional fix, which I've pointed out in a comment.

Feel free to merge after you do a final review.

pochedls · 2021-12-23T22:22:39Z

@tomvothecoder: I am assuming these types of messages for pytest are fine?

....2021-12-23 14:15:35,270 [DEBUG]: dataset.py(infer_or_keep_var:461) >> This dataset contains more than one regular data variable ('tas', 'ts'). If desired, pass the data_var kwarg to reduce down to one regular data var.
2021-12-23 14:15:35,270 [DEBUG]: dataset.py(infer_or_keep_var:461) >> This dataset contains more than one regular data variable ('tas', 'ts'). If desired, pass the data_var kwarg to reduce down to one regular data var.
......2021-12-23 14:15:35,328 [DEBUG]: dataset.py(get_inferred_var:534) >> The data variable 'ts' was inferred from the Dataset attr 'xcdat_infer' for this operation.
2021-12-23 14:15:35,328 [DEBUG]: dataset.py(get_inferred_var:534) >> The data variable 'ts' was inferred from the Dataset attr 'xcdat_infer' for this operation.

tomvothecoder · 2021-12-24T00:13:23Z

@pochedls Yeah, those are debug level messages that appear in the test suite. I might end up removing it because it will throw an error if xcdat_infer isn't set anyways and they get distracting in the test suite.

pochedls · 2021-12-24T00:25:21Z

@tomvothecoder - This looks good to me. Could you merge it (or I can merge it with some guidance from you after the holidays)? Thanks for all your contributions to this bug/PR!

tomvothecoder · 2022-01-04T16:24:54Z

@pochedls Thanks for the final review! I'll handle rebasing and merging the PR today.

- Fixes issue with incorrect time values being decoded in `decode_time_units()` for non-CF compliant time units - The fix is to use the time values as offsets to the reference date in the "units" attribute - Fixes calling `open_mfdataset(decode_times=False)` when datasets have the same numerically encoded time values, but differing non-CF compliant time units (e.g., "months since 2000-01-01", "months since 2001-01-01"), resulting in time values being dropped. Summary of Changes - Add optional boolean kwarg `decode_times` to `open_dataset()` and `open_mfdataset()` - Add conditionals to handle this kwarg when True or False - Add optional callable kwarg `preprocess` to `open_mfdataset()` - Add `_preprocess_non_cf_dataset()` function to decode datasets' time values with non-CF compliant units before concatenating (handles cases where the datasets have the same time values and different time units, which would otherwise result in dropping of time values) - Update `decode_non_cf_time()` - Rename from `decode_time_units()` to `decode_non_cf_time()` - Remove logic for checking cf compliance, which is now handled by `_has_cf_compliant_time()` - Fix incorrect start date for decoded time coordinates by forming them using offsets and reference dates, instead of reference date as the start point and a fixed `pd.date_range()` - Using `pd.date_range()` incorrectly assumes no gaps/missing data and that coordinate points started at the beginning of the month. It also did not handle calendar types correctly (e.g,. leap years), and would reset offsets at the beginning of the month or year if they weren't already. - Add decoding of time bounds - Add utility function `_split_time_units_attr()` for splitting "units" attribute into units and reference date strings - Update docstrings of methods - Update test fixtures for correctness and readable syntax

- Fix type annotation in docstrings

tomvothecoder added the type: enhancement New enhancement request label Nov 29, 2021

tomvothecoder self-requested a review November 29, 2021 23:05

tomvothecoder mentioned this pull request Dec 1, 2021

Additional fixes for PR #158 #173

Merged

9 tasks

tomvothecoder force-pushed the bugfix/158-mfdataset-cfimtes-logic branch from 9483cf9 to ec6e20c Compare December 2, 2021 23:07

tomvothecoder force-pushed the bugfix/158-mfdataset-cfimtes-logic branch from d003a9f to 0e904c4 Compare December 16, 2021 21:50

tomvothecoder reviewed Dec 16, 2021

View reviewed changes

xcdat/dataset.py Show resolved Hide resolved

tomvothecoder approved these changes Dec 16, 2021

View reviewed changes

pochedls and others added 6 commits January 4, 2022 08:49

Initial attempt at #158.

223e246

update unit test notes

c22ee07

Update has_cf_compliant_time() to support pathlib and list of lists

882ed22

Fix order of conditionals in has_cf_compliant_time()

1e6187d

- Fix type annotation in docstrings

Update has_cf_compliant_time() docstring

bd80636

tomvothecoder force-pushed the bugfix/158-mfdataset-cfimtes-logic branch from fa374e1 to bd80636 Compare January 4, 2022 16:54

tomvothecoder added 2 commits January 4, 2022 09:07

Update comments

004363c

Remove line breaks

cd9491d

tomvothecoder merged commit 5d2dda1 into main Jan 4, 2022

tomvothecoder deleted the bugfix/158-mfdataset-cfimtes-logic branch January 4, 2022 17:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update xcdat.open_mfdataset time decoding logic #161

Update xcdat.open_mfdataset time decoding logic #161

pochedls commented Nov 18, 2021 •

edited

Loading

codecov-commenter commented Nov 18, 2021 •

edited

Loading

tomvothecoder commented Nov 22, 2021 •

edited

Loading

tomvothecoder commented Nov 22, 2021

pochedls commented Nov 23, 2021

tomvothecoder commented Nov 29, 2021

tomvothecoder commented Nov 29, 2021 •

edited

Loading

tomvothecoder commented Nov 30, 2021

tomvothecoder left a comment

pochedls commented Dec 23, 2021

tomvothecoder commented Dec 24, 2021

pochedls commented Dec 24, 2021

tomvothecoder commented Jan 4, 2022

Update xcdat.open_mfdataset time decoding logic #161

Update xcdat.open_mfdataset time decoding logic #161

Conversation

pochedls commented Nov 18, 2021 • edited Loading

Description

Checklist

codecov-commenter commented Nov 18, 2021 • edited Loading

Codecov Report

tomvothecoder commented Nov 22, 2021 • edited Loading

tomvothecoder commented Nov 22, 2021

pochedls commented Nov 23, 2021

tomvothecoder commented Nov 29, 2021

tomvothecoder commented Nov 29, 2021 • edited Loading

tomvothecoder commented Nov 30, 2021

tomvothecoder left a comment

Choose a reason for hiding this comment

pochedls commented Dec 23, 2021

tomvothecoder commented Dec 24, 2021

pochedls commented Dec 24, 2021

tomvothecoder commented Jan 4, 2022

pochedls commented Nov 18, 2021 •

edited

Loading

codecov-commenter commented Nov 18, 2021 •

edited

Loading

tomvothecoder commented Nov 22, 2021 •

edited

Loading

tomvothecoder commented Nov 29, 2021 •

edited

Loading