Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making coordinate variables searchable #212

Open
wants to merge 3 commits into
base: 186-return-type
Choose a base branch
from

Conversation

charles-turner-1
Copy link
Contributor

@charles-turner-1 charles-turner-1 commented Oct 3, 2024

Functionality changes in this branch:

  • source.builders.BaseBuilder.parse_access_ncfile indexes all variables (for var in ds.data_vars => for var in ds.variables), rather than just data variables, & tests have been updated to reflect additional data obtained.
    • NOTE: This is dependent on changes to intake-esm proposed by @dougiesquire in isssue 660.
    • Coordinate variables are accessible through the variable flag in the access-nri intake catalogue, rather than a separate coord flag, as suggested in Making coordinate variables searchable #201 by @anton-seaice.
      • When searching for coordinate variables only, only coordinate variables are returned. No concatenation of assets is performed - only the first asset in the dataset is. This avoids crashing resulting from xr.combine_by_coords.
      • When searching for coordinate & data_variables, the data & coordinate variables searched, as well as coordinate variables that the searched data variables depend on.
      • Other searches will return all data & coordinate variables, as before.

Refactoring

  • Updated source.builders.BaseBuilder to use a dataclass (source.utils._AccessNCFileInfo) to pass around information about variables, rather than an outputs tuple.
  • Added a _DataVarInfo dataclass to hold attribute information from data variables, rather than appending them all into separate lists passed to the outputs tuple.

Minor changes

  • Add a bunch of type checking (type annotations, added mypy to pre-commit, etc)
  • Replaced clauses of the pattern
    try:
        x = x[key]
    except:
       pass
    with
    x = x.get(key,x)
    
  • A couple of minor docstring updates

Notes

intake-esm

The code changes in this branch should lead to catalogues being built with coordinate variables included in their variable flag: eg

from access_nri_intake.source.builders import AccessOm2Builder

builder = AccessOm2Builder(
    path="/g/data/ik11/outputs/access-om2-01/01deg_jra55v13_ryf9091"
)

builder.build()

CATALOG_PATH = ...

builder.save(
    directory='CATALOG_PATH',
    name="ACCESS-OM2_01deg_jra55v13_ryf9091", 
    description="...",
)

will build a catalog which coordinate variables are searchable from. However, unless you're working with a version of intake-esm which also has coordinate variables searchable (alterations noted in Functionality changes), this will present two issues:

  1. Files comprising only coordinate variables, eg. ocean_grid.nc files, will not be indexed.
  2. Searchs purely for a coordinate variable will fail when calling .to_dask() will fail on coordinate indexing if a multi file dataset is obtained from the search, or return an empty dataset if a single file is returned.

Issue 1. appears to be ESMDataSource._open_dataset thinking the .nc files are empty unless they contain coordinate variables & ignoring them.
Issue 2. is the result of the requested_variables used in intake-esm.

Neither of these is testable or fixable in access-nri-intake-catalog, as these issues are handled by intake-esm.

PR size

This PR contains a couple of code changes, as well as some other stuff (typing/pre-commit, etc). I'm happy to split this apart into separate pull requests if that makes things more tractable to review.

@charles-turner-1 charles-turner-1 linked an issue Oct 3, 2024 that may be closed by this pull request
Copy link

codecov bot commented Oct 3, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.26%. Comparing base (f5644e7) to head (ac97823).

Additional details and impacted files
@@               Coverage Diff                @@
##           186-return-type     #212   +/-   ##
================================================
  Coverage            97.26%   97.26%           
================================================
  Files                    9        9           
  Lines                  657      657           
================================================
  Hits                   639      639           
  Misses                  18       18           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@charles-turner-1
Copy link
Contributor Author

I've been through and linked a bunch of issues that I think the changes here resolve - about to verify.

Copy link
Collaborator

@marc-white marc-white left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong, but this PR doesn't actually do anything to read the coordinate variables itself (compared to the read in the main version of the code currently does) - it's simply been prepped to handle the coordinates once intake-esm gets updated?

for var in ds.data_vars:
dvars = _DataVarInfo()

for var in ds.variables:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New behaviour - scans all variables in dataset, including coordinate variables

@charles-turner-1
Copy link
Contributor Author

Not precisely - it does additionally scan .nc files for coordinates (see above), but crucially also requires the esm_datastore object to do the same. If we have the same behaviour in the esm_datastore (defined in intake-esm) there this PR won't affect the variables we read.

@charles-turner-1
Copy link
Contributor Author

charles-turner-1 commented Oct 11, 2024

I've checked, and the changes in this PR don't resolve #117 or #192.

In both instances, the issues specifically result from attempted concatenation of grid variables, rather than coordinate variables. Grid variables are stored as data variables in the netCDF files that hold them, but are don't depend on time & so the time variable is dropped before concatenation, which is the source of the 'no dimension to concatenate along' errors.

I've changed this PR & the concomitant PR on intake-esm to draft status whilst I work out how closely related these issues are & whether we can neatly tie together all the changes we need.

Edit: The obvious resolution to the grid variable problems mentioned above would be to retain all dimensions until the final dataset is produced - currently looking into that.

Requesting time as a coordinate variable along with the desired grid also resolves the problems.

@charles-turner-1 charles-turner-1 marked this pull request as draft October 11, 2024 08:15
Append attributes to the DataVarInfo object, if the attribute has a
'long_name' key.

TODO: Why do we need a long name key? seems important
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the code using the presence of 'long_name' as a proxy for "this is a real variable, not just additional info"?

@charles-turner-1 charles-turner-1 changed the base branch from main to 186-return-type October 15, 2024 03:22
@charles-turner-1 charles-turner-1 marked this pull request as ready for review October 15, 2024 03:26
@charles-turner-1 charles-turner-1 marked this pull request as draft October 15, 2024 03:27
@charles-turner-1 charles-turner-1 marked this pull request as ready for review October 15, 2024 03:35
@charles-turner-1
Copy link
Contributor Author

I've split this into a series of stacked PR's:
#221: Adds typing, mypy type check to pre-commit. Merge into main, addresses #13.
#222: Refactors BaseBuilder.parse_access_ncfile to return a dataclass, rather than a tuple. Closes #186 and the PR merges into the branch introduced by #221
#212: (This PR) - changes data_vars => variables and updates the relevant tests to ensure we are catching all the correct coordinate variables. PR merges into the branch introduced by #222.

@charles-turner-1 charles-turner-1 linked an issue Oct 15, 2024 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants