Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update xcdat.open_mfdataset time decoding logic #161

Merged
merged 8 commits into from
Jan 4, 2022

Conversation

pochedls
Copy link
Collaborator

@pochedls pochedls commented Nov 18, 2021

Description

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules

If applicable:

  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass with my changes (locally and CI/CD build)
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have noted that this is a breaking change for a major release (fix or feature that would cause existing functionality to not work as expected)

@codecov-commenter
Copy link

codecov-commenter commented Nov 18, 2021

Codecov Report

Merging #161 (cd9491d) into main (201a575) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##              main      #161   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            8         8           
  Lines          381       423   +42     
=========================================
+ Hits           381       423   +42     
Impacted Files Coverage Δ
xcdat/__init__.py 100.00% <100.00%> (ø)
xcdat/dataset.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 85786c5...cd9491d. Read the comment docs.

@tomvothecoder
Copy link
Collaborator

tomvothecoder commented Nov 22, 2021

The check for CF compliance in _check_dataset_for_cf_compliant_time() is repeated in decode_time_units().
https://github.com/XCDAT/xcdat/blob/b010930ecf96117c793ce7fc04d89f5074ba8271/xcdat/dataset.py#L329-L341
We should remove these checks in decode_time_units() and refactor _check_dataset_for_cf_compliant_time() so that it works for both a single dataset (through xcdat.open_dataset()) and multi-file datasets (through xcdat.open_mfdataset()).

Comments from #158

The code excerpt above shows that xcdat yields a tas time series that has only 12 time steps, but xarray has 1872. We can reproduce the xcdat result by adding decode_times=False (which is included in the xcdat.open_mfdataset() logic here.

I expected xr.open_mfdataset(decode_times=False) to still be able to join all non-decoded timestamps, but apparently it doesn't. Instead, it only keeps the last file's timestamps for some reason (probably below).

"I think (this is a guess) this may arise because the time units in each netCDF in this directory have different units (but perhaps the same time values)."

I believe your explanation might be right, since each file has their own units ("days since ..."). This might be causing xarray to not join the timestamps across files.


Another important note is that the updates in this PR only handles multi-file datasets with CF compliant time.
We still have to handle multi-file datasets that have non-CF compliant time, which forces us to use decode_times=False and thus leading to a dropping of timestamps. I'm trying to see why xarray only keeps the last file's timestamps in certain cases (e.g., diff units).

Example:

    # if cf_compliant, let xarray decode the time units
    # otherwise, decode using decode_time_units
    if cf_compliant:
        ds = xr.open_mfdataset(paths, decode_times=True, data_vars=data_vars, **kwargs)
    else:
        # FIXME: If non-CF compliant, using `decode_times=False` will still only
        # keep the last file's timestamps. There's an issue with how
        # `decode_times=False` works in `xr.open_mfdataset()`.
        ds = xr.open_mfdataset(paths, decode_times=False, data_vars=data_vars, **kwargs)
        if ds.cf.dims.get("T") is not None:
            ds = decode_time_units(ds)

@tomvothecoder
Copy link
Collaborator

I'm running into a weird xarray issue in PR #107 involving datasets with CF compliant time units that are decoded using xr.decode_cf() and masking specific time bounds coordinates for incomplete DJF seasons.

I opened up the issue on the xarray repo (pydata/xarray#6015). The updates in this branch should allow us to avoid using xr.decode_cf() through decode_time_units().

@pochedls
Copy link
Collaborator Author

Another important note is that the updates in this PR only handles multi-file datasets with CF compliant time.
We still have to handle multi-file datasets that have non-CF compliant time, which forces us to use decode_times=False and thus leading to a dropping of timestamps. I'm trying to see why xarray only keeps the last file's timestamps in certain cases (e.g., diff units).

Okay - I did not realize this - I see that this is not working for non-CF compliant multi-file datasets:

flist = ['/p/user_pub/climate_work/pochedley1/cmip6_msu/tmt_Amon_CanESM5_historical_r1i1p1f1_gr_185001-201412.nc',
         '/p/user_pub/climate_work/pochedley1/cmip6_msu/tmt_Amon_CanESM5_ssp370_r1i1p1f1_gr_201501-210012.nc']
ds = xcdat.open_mfdataset(flist)
print(ds.time)

<xarray.DataArray 'time' (time: 3012)>
array(['1800-01-01T00:00:00.000000000', '1800-02-01T00:00:00.000000000',
'1800-03-01T00:00:00.000000000', ..., '2050-10-01T00:00:00.000000000',
'2050-11-01T00:00:00.000000000', '2050-12-01T00:00:00.000000000'],
dtype='datetime64[ns]')
Coordinates:

  • time (time) datetime64[ns] 1800-01-01 1800-02-01 ... 2050-12-01
    Attributes:
    bounds: time_bnds
    units: months since 1800
    calendar: noleap
    axis: T
    realtopology: linear

The decoded time only goes to 2050 (it should go to 2100). It seems like this is an issue with decode_time_units. Is that the correct interpretation? Note that when I try to open either of these files using xcdat.open_dataset() I get a time vector that (incorrectly) begins in January 1800.

@tomvothecoder tomvothecoder added the type: enhancement New enhancement request label Nov 29, 2021
@tomvothecoder
Copy link
Collaborator

The decoded time only goes to 2050 (it should go to 2100). It seems like this is an issue with decode_time_units. Is that the correct interpretation? Note that when I try to open either of these files using xcdat.open_dataset() I get a time vector that (incorrectly) begins in January 1800.

I will investigate this issue using your example files once /user_pub is back online.

@tomvothecoder
Copy link
Collaborator

tomvothecoder commented Nov 29, 2021

The decoded time only goes to 2050 (it should go to 2100). It seems like this is an issue with decode_time_units. Is that the correct interpretation? Note that when I try to open either of these files using xcdat.open_dataset() I get a time vector that (incorrectly) begins in January 1800.

Yes, the issue is related to decode_time_units().

For your example, which includes non-CF time units, the reference date of January 1800 is being set as the start for the time series. Instead, it should be January 1800 + the first coordinate value (600 months in this case when decode_times=False).

>>> ds_xr1 = xr.open_dataset(flist[0], decode_times=False)
>>> ds_xr1.time.units
'months since 1800'
>>> ds_xr1.time.data[0]
600.0

Related code:
https://github.com/XCDAT/xcdat/blob/69dd39d249fdf63a78e39ba44dca2e968b714e83/xcdat/dataset.py#L345-L353

I'll include a fix in this PR, thanks!

@tomvothecoder
Copy link
Collaborator

Continuing work in PR #173, which will merge into this PR.

@tomvothecoder tomvothecoder mentioned this pull request Dec 1, 2021
9 tasks
@tomvothecoder tomvothecoder force-pushed the bugfix/158-mfdataset-cfimtes-logic branch from 9483cf9 to ec6e20c Compare December 2, 2021 23:07
@tomvothecoder tomvothecoder force-pushed the bugfix/158-mfdataset-cfimtes-logic branch from d003a9f to 0e904c4 Compare December 16, 2021 21:50
Copy link
Collaborator

@tomvothecoder tomvothecoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks good to me now. I added an additional fix, which I've pointed out in a comment.

Feel free to merge after you do a final review.

@pochedls
Copy link
Collaborator Author

@tomvothecoder: I am assuming these types of messages for pytest are fine?

....2021-12-23 14:15:35,270 [DEBUG]: dataset.py(infer_or_keep_var:461) >> This dataset contains more than one regular data variable ('tas', 'ts'). If desired, pass the data_var kwarg to reduce down to one regular data var.
2021-12-23 14:15:35,270 [DEBUG]: dataset.py(infer_or_keep_var:461) >> This dataset contains more than one regular data variable ('tas', 'ts'). If desired, pass the data_var kwarg to reduce down to one regular data var.
......2021-12-23 14:15:35,328 [DEBUG]: dataset.py(get_inferred_var:534) >> The data variable 'ts' was inferred from the Dataset attr 'xcdat_infer' for this operation.
2021-12-23 14:15:35,328 [DEBUG]: dataset.py(get_inferred_var:534) >> The data variable 'ts' was inferred from the Dataset attr 'xcdat_infer' for this operation.

@tomvothecoder
Copy link
Collaborator

@pochedls Yeah, those are debug level messages that appear in the test suite. I might end up removing it because it will throw an error if xcdat_infer isn't set anyways and they get distracting in the test suite.

@pochedls
Copy link
Collaborator Author

@tomvothecoder - This looks good to me. Could you merge it (or I can merge it with some guidance from you after the holidays)? Thanks for all your contributions to this bug/PR!

@tomvothecoder
Copy link
Collaborator

@pochedls Thanks for the final review! I'll handle rebasing and merging the PR today.

pochedls and others added 6 commits January 4, 2022 08:49
- Fixes issue with incorrect time values being decoded in `decode_time_units()` for non-CF compliant time units
  -  The fix is to use the time values as offsets to the reference date in the "units" attribute
- Fixes calling `open_mfdataset(decode_times=False)` when datasets have the same numerically encoded time values, but differing non-CF compliant time units (e.g., "months since 2000-01-01", "months since 2001-01-01"), resulting in time values being dropped.

Summary of Changes
- Add optional boolean kwarg `decode_times` to `open_dataset()` and `open_mfdataset()`
  -  Add conditionals to handle this kwarg when True or False
- Add optional callable kwarg  `preprocess` to `open_mfdataset()`
  - Add `_preprocess_non_cf_dataset()` function to decode datasets' time values with non-CF compliant units before concatenating (handles cases where the datasets have the same time values and different time units, which would otherwise result in dropping of time values)
- Update `decode_non_cf_time()`
  - Rename from `decode_time_units()` to `decode_non_cf_time()`
  - Remove logic for checking cf compliance, which is now handled by `_has_cf_compliant_time()`
  - Fix incorrect start date for decoded time coordinates by forming them using offsets and reference dates, instead of reference date as the start point and a fixed `pd.date_range()`
    - Using `pd.date_range()` incorrectly assumes no gaps/missing data and that coordinate points started at the beginning of the month. It also did not handle calendar types correctly (e.g,. leap years), and would reset offsets at the beginning of the month or year if they weren't already.
  - Add decoding of time bounds
- Add utility function `_split_time_units_attr()` for splitting "units" attribute into units and reference date strings
- Update docstrings of methods
- Update test fixtures for correctness and readable syntax
- Fix type annotation in docstrings
@tomvothecoder tomvothecoder force-pushed the bugfix/158-mfdataset-cfimtes-logic branch from fa374e1 to bd80636 Compare January 4, 2022 16:54
@tomvothecoder tomvothecoder merged commit 5d2dda1 into main Jan 4, 2022
@tomvothecoder tomvothecoder deleted the bugfix/158-mfdataset-cfimtes-logic branch January 4, 2022 17:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: enhancement New enhancement request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

open_mfdataset not decoding times correctly
3 participants