Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for N axis dimensions mapped to N coordinates #343

Merged
merged 13 commits into from
Nov 8, 2022

Conversation

tomvothecoder
Copy link
Collaborator

@tomvothecoder tomvothecoder commented Sep 12, 2022

Description

Related feature specification document: https://docs.google.com/document/d/1BnwNGzb-_pSUFzH5513x99Yv2IOanZlTtA7RvIF7hJA/edit#heading=h.disypbipz1ic

Overview:

  • xcdat APIs previously expected a 1:1 map between dimensions and coordinates. For example, the time dimension would be mapped to time coord var. However, datasets can have N dimensions mapped to N coordinate variables
  • Examples:
    • E3SM datasets can have an N:N map, which might have a "Z" axis with the dimensions and coords "ilev" and "lev"
    • IPSL datasets might have 1 time dimension mapped to multiple time coordinates ("coord vars" and "aux coord vars" in CF terms)
  • More information: https://docs.xarray.dev/en/stable/user-guide/data-structures.html#coordinates

Feature Updates:

  • Update bounds accessor methods to operate on the dataset or a variable within the dataset using the new var_key kwarg
  • Update add_missing_bounds() to loop over all axes and their coordinate vars and attempt to add bounds for each coordinate var if they do not exist
  • Update decoding function (decode_time()) to decode CF and non-CF compliant time coordiantes as cftime objects
    • This function is now performed lazily, which means a significant decrease in performance runtime!
  • Update dataset longitude swapping function (swap_lon_axis()) to loop over longitude coords and their bounds to swap them
  • Update spatial and temporal accessor class methods to refer to the dimension coordinate variable on the data_var being operated on, rather than the parent dataset

Bug Fixes:

  • Fix add_bounds() not ignoring 0-dim singleton coords in addition to 1-dim singleton coords

Refactor:

  • Refactor open_dataset() and open_mfdataset() to remove legacy private functions for checking CF compliance before attempting to decode
    • decode_time() now checks for CF compliance
  • Rename get_axis_coord() to get_dim_coords() and get_axis_dim() to get_dim_keys() for clarity
  • Update fixtures in fixtures.py to use cftime objects and add decode_times to generate_dataset()
  • Extract utility function _if_multidim_dask_array_then_load() and reuse in several places in the codebase, usually when manipulating bounds

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules

If applicable:

  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass with my changes (locally and CI/CD build)
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have noted that this is a breaking change for a major release (fix or feature that would cause existing functionality to not work as expected)

@tomvothecoder tomvothecoder added type: bug Inconsistencies or issues which will cause an issue or problem for users or implementors. type: enhancement New enhancement request Priority: High labels Sep 12, 2022
@tomvothecoder tomvothecoder self-assigned this Sep 12, 2022
@tomvothecoder tomvothecoder changed the title Update get_axis_coord() for phase 1 Support datasets with axes represented by N dimensions mapped to N coordinates Sep 12, 2022
@tomvothecoder tomvothecoder force-pushed the bugfix/183-285-multiple-axis-vars branch 2 times, most recently from 963cec5 to f0b8715 Compare September 15, 2022 23:24
@tomvothecoder tomvothecoder changed the title Support datasets with axes represented by N dimensions mapped to N coordinates [Bug]: Support datasets with axes represented by N dimensions mapped to N coordinates Sep 16, 2022
xcdat/axis.py Outdated Show resolved Hide resolved
@codecov-commenter
Copy link

codecov-commenter commented Sep 26, 2022

Codecov Report

Base: 100.00% // Head: 100.00% // No change to project coverage 👍

Coverage data is based on head (8575a09) compared to base (31c0ad1).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff            @@
##              main      #343   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           13        13           
  Lines         1202      1242   +40     
=========================================
+ Hits          1202      1242   +40     
Impacted Files Coverage Δ
xcdat/axis.py 100.00% <100.00%> (ø)
xcdat/bounds.py 100.00% <100.00%> (ø)
xcdat/dataset.py 100.00% <100.00%> (ø)
xcdat/regridder/accessor.py 100.00% <100.00%> (ø)
xcdat/regridder/grid.py 100.00% <100.00%> (ø)
xcdat/spatial.py 100.00% <100.00%> (ø)
xcdat/temporal.py 100.00% <100.00%> (ø)
xcdat/utils.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Collaborator Author

@tomvothecoder tomvothecoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial self-review of the PR.

tests/fixtures.py Outdated Show resolved Hide resolved
tests/fixtures.py Outdated Show resolved Hide resolved
tests/fixtures.py Show resolved Hide resolved
tests/test_axis.py Outdated Show resolved Hide resolved
tests/test_axis.py Outdated Show resolved Hide resolved
xcdat/bounds.py Outdated Show resolved Hide resolved
xcdat/bounds.py Outdated Show resolved Hide resolved
xcdat/bounds.py Outdated Show resolved Hide resolved
xcdat/temporal.py Outdated Show resolved Hide resolved
xcdat/utils.py Outdated Show resolved Hide resolved
@tomvothecoder tomvothecoder force-pushed the bugfix/183-285-multiple-axis-vars branch from c0ca5ab to 29efb0c Compare September 30, 2022 22:04
@tomvothecoder tomvothecoder marked this pull request as ready for review September 30, 2022 22:33
Copy link
Collaborator Author

@tomvothecoder tomvothecoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @pochedls, this PR is ready for review. We were spot-on with our estimate of about a month to get this ready.

Let me know if you would like me to tag others for review since this is a big PR, and I will also do a self-review. I marked areas of interest under this comment if you'd like to look over the implementation details.

Some helpful review resources:

Thanks!

xcdat/axis.py Outdated Show resolved Hide resolved
xcdat/axis.py Outdated Show resolved Hide resolved
xcdat/axis.py Outdated Show resolved Hide resolved
xcdat/axis.py Show resolved Hide resolved
xcdat/axis.py Show resolved Hide resolved
xcdat/spatial.py Show resolved Hide resolved
xcdat/spatial.py Show resolved Hide resolved
xcdat/temporal.py Outdated Show resolved Hide resolved
xcdat/temporal.py Show resolved Hide resolved
xcdat/utils.py Show resolved Hide resolved
@pochedls
Copy link
Collaborator

pochedls commented Oct 11, 2022

I am still reviewing, but I keep bumping into one major issue:

import xcdat as xc
fn = '/p/css03/esgf_publish/CMIP6/CMIP/CSIRO-ARCCSS/ACCESS-CM2/historical/r1i1p1f1/Amon/tas/gn/v20191108/tas_Amon_ACCESS-CM2_historical_r1i1p1f1_gn_185001-201412.nc`
ds = xc.open_dataset(fn)

2022-10-11 11:38:22,958 [WARNING]: dataset.py(decode_time:333) >> height does not have a 'calendar' attribute set, so it could not be decoded. Set the 'calendar' attribute (ds.height.attrs['calendar]) and try decoding them again

KeyError Traceback (most recent call last)
File .../bounds.py:271, in BoundsAccessor.add_bounds(self, axis, width)
270 try:
--> 271 bounds_key = ds[coord.name].attrs["bounds"]
272 ds[bounds_key]

KeyError: 'bounds'
...

File .../xcdat/bounds.py:352, in BoundsAccessor._create_bounds(self, axis, coord_var, width)
319 """Creates bounds for an axis using its coordinate points.
320
321 Parameters
(...)
349 .. [2] https://cf-xarray.readthedocs.io/en/latest/generated/xarray.Dataset.cf.add_bounds.html#
350 """
351 # Validate coordinate shape.
--> 352 if coord_var.shape[0] <= 1:
353 raise ValueError(
354 f"Cannot generate bounds for coordinate variable '{coord_var.name}'"
355 " which has a length <= 1."
356 )
358 # Retrieve coordinate dimension to calculate the diffs between points.

IndexError: tuple index out of range

I think this issue is related to a singleton height coordinate in this dataset.

If I open with xarray, things work okay (either with cftime=True or False).

docs/faqs.rst Outdated Show resolved Hide resolved
docs/faqs.rst Outdated Show resolved Hide resolved
docs/index.rst Outdated
Comment on lines 46 to 76
* Optional selection of single data variable to keep in the Dataset (bounds are also
kept if they exist)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happily surprised this is re-implemented – looking forward to checking this out.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This specific option has been available in open_dataset()/open_mfdataset() but not documented in the features list:

xcdat/xcdat/dataset.py

Lines 51 to 53 in 36a3539

data_var: Optional[str], optional
The key of the non-bounds data variable to keep in the Dataset,
alongside any existing bounds data variables, by default None.

I think you're referring to xcdat guessing the data variable to use in single variable datasets.

xcdat/dataset.py Outdated Show resolved Hide resolved
xcdat/bounds.py Show resolved Hide resolved
xcdat/axis.py Outdated Show resolved Hide resolved
xcdat/axis.py Outdated Show resolved Hide resolved
xcdat/axis.py Outdated Show resolved Hide resolved
@tomvothecoder tomvothecoder force-pushed the bugfix/183-285-multiple-axis-vars branch from 3c34fdb to c76f9fc Compare October 12, 2022 21:53
docs/faqs.rst Outdated Show resolved Hide resolved
xcdat/axis.py Outdated Show resolved Hide resolved
@lee1043
Copy link
Collaborator

lee1043 commented Nov 4, 2022

@pochedls great, thanks for confirming. I will work on this tomorrow or early next week.

keys = keys + obj.cf.coordinates[cf_attrs["coordinate"]]
except KeyError:
pass

Copy link
Collaborator

@pochedls pochedls Nov 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if CF_ATTR_MAP['T']['coordinate'] in obj.coords.keys():
keys = keys + [CF_ATTR_MAP[axis]['coordinate']]

This would fix this issue. I think this logic is reasonable and it seems like all unit tests still pass.

Copy link
Collaborator

@pochedls pochedls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomvothecoder - PR looks great. The lazy loading functionality you loaded goes above and beyond and is a really nice addition. Over all, I like how you cleaned up the open_*dataset and decode_time functionality.

I've verified that the current PR (with my one suggestion) passes all tests, resolves the linked issues/bugs, and spatial averaging and regridding look good.

I have one suggestion, which we can discuss if you're not a fan.

@lee1043
Copy link
Collaborator

lee1043 commented Nov 4, 2022

@tomvothecoder @pochedls - I've verified that current PR does not influence to the temporal and spatial average operations. My statistics (correlation, rmse, standard deviation, etc) that includes spatial and/or temporal average during its calculations were not changed by this PR.

xcdat/dataset.py Outdated Show resolved Hide resolved
@tomvothecoder tomvothecoder force-pushed the bugfix/183-285-multiple-axis-vars branch from 2171067 to f622a8d Compare November 7, 2022 19:27
xcdat/axis.py Outdated
Comment on lines 32 to 41
# A dictionary that maps common variable names to coordinate variables. This
# map is used as fall-back when coordinate variables don't have CF attributes
# set for ``cf_xarray`` to interpret using `CF_ATTR_MAP`.
VAR_NAME_MAP: Dict[CFAxisKey, List[str]] = {
"X": ["longitude", "lon"],
"Y": ["latitude", "lat"],
"T": ["time"],
"Z": coordinate_criteria["Z"]["standard_name"],
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mapping table addresses the issue here #343 (comment).

Copy link
Collaborator Author

@tomvothecoder tomvothecoder Nov 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used cf_xarray's coordinate_criteria["Z"] because there are way too many to list ourselves. Let me know if there is a shorter static list we might want to use instead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

@tomvothecoder tomvothecoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final self-review code updates

xcdat/bounds.py Outdated Show resolved Hide resolved
xcdat/dataset.py Outdated Show resolved Hide resolved
xcdat/dataset.py Outdated Show resolved Hide resolved
xcdat/dataset.py Outdated Show resolved Hide resolved
xcdat/dataset.py Outdated Show resolved Hide resolved
xcdat/dataset.py Outdated Show resolved Hide resolved
xcdat/temporal.py Outdated Show resolved Hide resolved
Overview:
  - `xcdat` APIs previously expected a 1:1 map between axis dimensions and coordinates. For example, the `time` axis dimension would be mapped to `time` coord var. However, datasets can have N dimensions mapped to N coordinate variables
  - For example, datasets that have N dims mapped to N coord vars include E3SM, which might have a "Z" axis with the dimensions and coords "ilev" and "lev"

Feature Updates:
  - Update bounds accessor methods to operate on the dataset or a variable within the dataset using the new `var_key` kwarg
  - Update `add_missing_bounds()` to loop over all axes and their coordinate vars and attempt to add bounds for each coordinate var if they do not exist
  - Update decoding function (`decode_time()`) to decode CF and non-CF compliant time coordiantes as `cftime` objects
  - Update dataset longitude swapping function (`swap_lon_axis()`) to loop over longitude coords and their bounds to swap them
- Update spatial and temporal accessor class methods to refer to the dimension coordinate variable on the `data_var` being operated on, rather than the parent dataset

Bug Fixes:
  - Fix `add_bounds()` not ignoring 0-dim singleton coords in addition to 1-dim singleton coords

Refactor:
  - Refactor `open_dataset()` and `open_mfdataset()` to remove unnecessary preprocessing functions
  - Rename `get_axis_coord()` to `get_dim_coords()` and `get_axis_dim()` to `get_dim_keys()`
  - Update fixtures in `fixtures.py` to use `cftime` objects and add `decode_times` to `generate_dataset()`
  - Extract utility function `_if_multidim_dask_array_then_load()` and reuse in several places in the codebase, usually when manipulating bounds
- This improves runtime performance, since `.values` performs type conversions to `np.array()`
- Refactor `_get_cftime_coords()` to use `xr.coding.times.decode_cf_datetime()` for decoding CF compliant time units
- Update `_is_decoded()` to check `.encoding` attributes rather than the object types in the array for performance purposes, since accessing `.values` can involve type conversion
- Refactor imports and fix docstrings
- This is required for cases where time units are different between files in multi-file datasets, but the offsets are the same. More info in the `_preprocess()` docstring
- Add raise ValueError for decoding datasets with time coords without CF attrs set
- Add `decode_time()` test for units in days
to interpret common var names
- Update tests
@tomvothecoder tomvothecoder force-pushed the bugfix/183-285-multiple-axis-vars branch from 246f30b to 6591295 Compare November 7, 2022 23:54
Copy link
Collaborator Author

@tomvothecoder tomvothecoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit-picking updates

xcdat/dataset.py Outdated Show resolved Hide resolved
xcdat/bounds.py Outdated Show resolved Hide resolved
xcdat/dataset.py Outdated Show resolved Hide resolved
xcdat/dataset.py Outdated Show resolved Hide resolved
xcdat/dataset.py Outdated Show resolved Hide resolved
xcdat/dataset.py Outdated Show resolved Hide resolved
xcdat/dataset.py Outdated Show resolved Hide resolved
@tomvothecoder
Copy link
Collaborator Author

Thank you @pochedls for your extensive review, and @lee1043 for jumping in to test the temporal APIs.

I performed a final review and now we are good to go. I will be merging this now!

@tomvothecoder tomvothecoder changed the title [Bug]: Support datasets with axes represented by N dimensions mapped to N coordinates Support for N axis dimensions mapped to N coordinates Nov 8, 2022
@tomvothecoder tomvothecoder merged commit c6587b8 into main Nov 8, 2022
@tomvothecoder tomvothecoder deleted the bugfix/183-285-multiple-axis-vars branch November 8, 2022 00:29
@pochedls
Copy link
Collaborator

pochedls commented Nov 8, 2022

@tomvothecoder – this is an awesome PR – thanks for all of your work on it the past 2 months – it is a great advance for xcdat!

@lee1043
Copy link
Collaborator

lee1043 commented Nov 8, 2022

Excellent work @tomvothecoder and @pochedls, thank you for your hard work for this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Inconsistencies or issues which will cause an issue or problem for users or implementors. type: docs Updates to documentation type: enhancement New enhancement request
Projects
None yet
4 participants