Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional fixes for PR #158 #173

Conversation

tomvothecoder
Copy link
Collaborator

@tomvothecoder tomvothecoder commented Nov 30, 2021

Description

This PR addresses the following issues (for PR #161):

  • Incorrect time values being decoded in decode_time_units() for non-CF compliant time units
    • The fix is to use the time values as offsets to the reference date in the "units" attribute
  • Calling open_mfdataset(decode_times=False) when datasets have the same numerically encoded time values, but differing non-CF compliant time units (e.g., "months since 2000-01-01", "months since 2001-01-01"), resulting in time values being dropped.
    • The fix is decoding time for each dataset individually in the preprocessing step before concatenating.
    • Performance is comparable between decoding each dataset before concatenating, and concatenating then decoding:
    # 1. Decode each individual dataset before concatenating (required for non-CF compliant and units not the same)
    # 551 ms ± 49.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    %timeit -n 3 ds_decode_before = xcdat.open_mfdataset(["input/dummy/test1.nc", "input/dummy/test2.nc"])
    
    # 2. Decode after concatenating datasets (only for non-CF compliant and units the same)
    # 530 ms ± 35.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    def decode_after():
        ds = xr.open_mfdataset(["input/dummy/test1.nc", "input/dummy/test3.nc"],decode_times=False)
        ds = decode_non_cf_time(ds)
        return ds
    %timeit -n 3 decode_after()

Summary of Changes

  • Add optional boolean kwarg decode_times to open_dataset() and open_mfdataset()
    • Add conditionals to handle this kwarg when True or False
  • Add optional callable kwarg preprocess to open_mfdataset()
    • Add _preprocess_non_cf_dataset() function to decode datasets' time values with non-CF compliant units before concatenating (handles cases where the datasets have the same time values and different time units, which would otherwise result in dropping of time values)
  • Update decode_non_cf_time()
    • Rename from decode_time_units() to decode_non_cf_time()
    • Remove logic for checking cf compliance, which is now handled by _has_cf_compliant_time()
    • Fix incorrect start date for decoded time coordinates by forming them using offsets and reference dates, instead of reference date as the start point and a fixed pd.date_range()
      • Using pd.date_range() incorrectly assumes no gaps/missing data and that coordinate points started at the beginning of the month. It also did not handle calendar types correctly (e.g,. leap years), and would reset offsets at the beginning of the month or year if they weren't already.
    • Add decoding of time bounds
  • Add utility function _split_time_units_attr() for splitting "units" attribute into units and reference date strings
  • Update docstrings of methods
  • Update test fixtures for correctness and readable syntax

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules

If applicable:

  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass with my changes (locally and CI/CD build)
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have noted that this is a breaking change for a major release (fix or feature that would cause existing functionality to not work as expected)

@tomvothecoder tomvothecoder added type: bug Inconsistencies or issues which will cause an issue or problem for users or implementors. Priority: High labels Nov 30, 2021
@tomvothecoder tomvothecoder self-assigned this Nov 30, 2021
@tomvothecoder tomvothecoder marked this pull request as ready for review December 2, 2021 20:07

ds = dataset[[data_var] + bounds_vars]
ds.attrs["xcdat_infer"] = data_var
# FIXME: This doesn't handle pathlib paths or a list of lists
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to handle this still

@tomvothecoder tomvothecoder force-pushed the bugfix/158-mfdataset-cfimtes-logic-tom branch from f91be30 to b15a776 Compare December 2, 2021 23:05
@tomvothecoder tomvothecoder force-pushed the bugfix/158-mfdataset-cfimtes-logic branch from 9483cf9 to ec6e20c Compare December 2, 2021 23:07
@tomvothecoder tomvothecoder force-pushed the bugfix/158-mfdataset-cfimtes-logic-tom branch from b15a776 to 378798f Compare December 2, 2021 23:09
- Rename to `decode_non_cf_time()` and remove logic for cf unit decoding
- Fix incorrect start date for pandas date range by adding offset from first coordinate point

Refactor function and variable names
- Update `xcdat_infer` attr from None to "None" since `to_netcdf` does not support None type
- Update fixtures for non cf time bounds and `generate_dataset()`

Update `decode_non_cf_time()` to use offsets correctly
- Using pd.date_range() is incorrect because it assumes no gaps or missing data. It also resets offsets to beginning of the month and year and does not consider leap years
- pd.DataOffset considers relative arithmetic operations based on the calendar

Update `decode_non_cf_time()` to handle time bounds
- Update tests for `decode_non_cf_time()`

Add functions to `__init__.py`
- Update time fixtures with correct attributes

Add `_preprocess_non_cf_dataset()`
- Add test for callable in `_preprocess_non_cf_dataset`
@tomvothecoder tomvothecoder force-pushed the bugfix/158-mfdataset-cfimtes-logic-tom branch from 378798f to 10ffc6b Compare December 2, 2021 23:11
Copy link
Collaborator

@pochedls pochedls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your catch on this issue and fix is really valuable. I read through the code and did not see problems, but I think merging this and testing it with a wide swath of CMIP (and non-CMIP) data will help to further evaluate / validate these methods.

xcdat/dataset.py Outdated Show resolved Hide resolved
xcdat/dataset.py Outdated Show resolved Hide resolved
xcdat/dataset.py Outdated Show resolved Hide resolved
xcdat/dataset.py Outdated Show resolved Hide resolved
@tomvothecoder
Copy link
Collaborator Author

I think your catch on this issue and fix is really valuable. I read through the code and did not see problems, but I think merging this and testing it with a wide swath of CMIP (and non-CMIP) data will help to further evaluate / validate these methods.

Thanks for the review! I'm glad to hear that the changes made sense to you. I also agree, we should stress test these changes using many different datasets.

I will merge this PR to your PR.

@tomvothecoder tomvothecoder merged commit d003a9f into bugfix/158-mfdataset-cfimtes-logic Dec 6, 2021
@tomvothecoder tomvothecoder deleted the bugfix/158-mfdataset-cfimtes-logic-tom branch December 6, 2021 19:50
tomvothecoder added a commit that referenced this pull request Dec 16, 2021
- Fixes issue with incorrect time values being decoded in `decode_time_units()` for non-CF compliant time units
  -  The fix is to use the time values as offsets to the reference date in the "units" attribute
- Fixes calling `open_mfdataset(decode_times=False)` when datasets have the same numerically encoded time values, but differing non-CF compliant time units (e.g., "months since 2000-01-01", "months since 2001-01-01"), resulting in time values being dropped.

Summary of Changes
- Add optional boolean kwarg `decode_times` to `open_dataset()` and `open_mfdataset()`
  -  Add conditionals to handle this kwarg when True or False
- Add optional callable kwarg  `preprocess` to `open_mfdataset()`
  - Add `_preprocess_non_cf_dataset()` function to decode datasets' time values with non-CF compliant units before concatenating (handles cases where the datasets have the same time values and different time units, which would otherwise result in dropping of time values)
- Update `decode_non_cf_time()`
  - Rename from `decode_time_units()` to `decode_non_cf_time()`
  - Remove logic for checking cf compliance, which is now handled by `_has_cf_compliant_time()`
  - Fix incorrect start date for decoded time coordinates by forming them using offsets and reference dates, instead of reference date as the start point and a fixed `pd.date_range()`
    - Using `pd.date_range()` incorrectly assumes no gaps/missing data and that coordinate points started at the beginning of the month. It also did not handle calendar types correctly (e.g,. leap years), and would reset offsets at the beginning of the month or year if they weren't already.
  - Add decoding of time bounds
- Add utility function `_split_time_units_attr()` for splitting "units" attribute into units and reference date strings
- Update docstrings of methods
- Update test fixtures for correctness and readable syntax
tomvothecoder added a commit that referenced this pull request Jan 4, 2022
- Fixes issue with incorrect time values being decoded in `decode_time_units()` for non-CF compliant time units
  -  The fix is to use the time values as offsets to the reference date in the "units" attribute
- Fixes calling `open_mfdataset(decode_times=False)` when datasets have the same numerically encoded time values, but differing non-CF compliant time units (e.g., "months since 2000-01-01", "months since 2001-01-01"), resulting in time values being dropped.

Summary of Changes
- Add optional boolean kwarg `decode_times` to `open_dataset()` and `open_mfdataset()`
  -  Add conditionals to handle this kwarg when True or False
- Add optional callable kwarg  `preprocess` to `open_mfdataset()`
  - Add `_preprocess_non_cf_dataset()` function to decode datasets' time values with non-CF compliant units before concatenating (handles cases where the datasets have the same time values and different time units, which would otherwise result in dropping of time values)
- Update `decode_non_cf_time()`
  - Rename from `decode_time_units()` to `decode_non_cf_time()`
  - Remove logic for checking cf compliance, which is now handled by `_has_cf_compliant_time()`
  - Fix incorrect start date for decoded time coordinates by forming them using offsets and reference dates, instead of reference date as the start point and a fixed `pd.date_range()`
    - Using `pd.date_range()` incorrectly assumes no gaps/missing data and that coordinate points started at the beginning of the month. It also did not handle calendar types correctly (e.g,. leap years), and would reset offsets at the beginning of the month or year if they weren't already.
  - Add decoding of time bounds
- Add utility function `_split_time_units_attr()` for splitting "units" attribute into units and reference date strings
- Update docstrings of methods
- Update test fixtures for correctness and readable syntax
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Inconsistencies or issues which will cause an issue or problem for users or implementors.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants