Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DATA REQUEST] Add COSIMA Panantarctic / GFDL_OM4 Builder & Data #175

Open
4 of 5 tasks
anton-seaice opened this issue Jun 25, 2024 · 75 comments · May be fixed by #258
Open
4 of 5 tasks

[DATA REQUEST] Add COSIMA Panantarctic / GFDL_OM4 Builder & Data #175

anton-seaice opened this issue Jun 25, 2024 · 75 comments · May be fixed by #258
Assignees
Labels
data request Add data to the catalog

Comments

@anton-seaice
Copy link
Collaborator

anton-seaice commented Jun 25, 2024

Description of the data product

<Please replace this text with a description of the data product to add to the ACCESS-NRI catalog. What data does it contain? What format is it in? Who is it useful for?>

Location of the data product on Gadi

Checklist

Add a "x" between the brackets to all that apply

  • This data product is stable (unlikely to change substantially or move)
  • This data product is of use to the broader community
  • This data product is documented: link
  • This data product is licensed under
  • Those who want to access this data can be added to the project that houses it
@anton-seaice anton-seaice added the data request Add data to the catalog label Jun 25, 2024
@anton-seaice
Copy link
Collaborator Author

Following on from COSIMA/cosima-recipes#369 , I am suggesting maybe adding OM4_025.JRA_RYF to the intake catalog.

@dougiesquire - As this is a different model configuration, I guess this would require a new datastore "builder", so maybe its not worth the effort? The runs are used in cosima recipes to show examples of handling MOM6 data.

@adele-morrison - Are their companion runs to OM4_025.JRA_RYF which also should be added? Can you help with the "Description of the data product" and "Location of the data product on Gadi" sections, and then I will edit the original post please?

@anton-seaice
Copy link
Collaborator Author

I tried using the access-om3 builder, and got these errors when using builder.parser:

{'INVALID_ASSET': '/g/data/ik11/outputs/mom6-om4-025/OM4_025.JRA_RYF/output000/19000101.ice_daily.nc', 'TRACEBACK': 'Traceback (most recent call last):\n File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-24.04/lib/python3.10/site-packages/access_nri_intake/source/builders.py", line 329, in parser\n raise ParserError(f"Cannot determine realm for file {file}")\naccess_nri_intake.source.builders.ParserError: Cannot determine realm for file /g/data/ik11/outputs/mom6-om4-025/OM4_025.JRA_RYF/output000/19000101.ice_daily.nc\n'}

{'INVALID_ASSET': '/g/data/ik11/outputs/mom6-om4-025/OM4_025.JRA_RYF/output000/19000101.ocean_daily.nc', 'TRACEBACK': 'Traceback (most recent call last):\n File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-24.04/lib/python3.10/site-packages/access_nri_intake/source/builders.py", line 329, in parser\n raise ParserError(f"Cannot determine realm for file {file}")\naccess_nri_intake.source.builders.ParserError: Cannot determine realm for file /g/data/ik11/outputs/mom6-om4-025/OM4_025.JRA_RYF/output000/19000101.ocean_daily.nc\n'}

@dougiesquire
Copy link
Collaborator

Ah, yet another permutation of file naming. It might be safest just to write a dedicated builder, which is straightforward. I guess it would be an Om4Builder?

Is this output structured in a similar way to the regional MOM6 output? If so, it may be worth thinking about writing a builder that handles both?

@adele-morrison
Copy link

Apologies for being slow. Yes, lets add the panan experiments to Intake. We'd still to like delete a bunch of the daily data for the 1/20th panan, is that ok to do after it's added to Intake? After that frees up space on ol01 ideally I'd also like to move the 1/10th panan from ik11 to ol01. But the current locations are as follows:
/g/data/ol01/outputs/mom6-panan/panant-0025-zstar-ACCESSyr2/
and
/g/data/ol01/outputs/mom6-panan/panant-005-zstar-ACCESSyr2/
and
/g/data/ik11/outputs/mom6-panan/panant-01-zstar-ACCESSyr2/

@dougiesquire
Copy link
Collaborator

We'd still to like delete a bunch of the daily data for the 1/20th panan, is that ok to do after it's added to Intake?

I think if we know this is going to happen then it would be better to wait until it is done. We can get a Builder set up and ready to go though.

@marc-white
Copy link
Collaborator

@anton-seaice could you please add the precise location(s) of the data on Gadi?

@anton-seaice
Copy link
Collaborator Author

@adele-morrison is more on top of it than I am ? Noting the comments above about possibly moving it.

/g/data/ik11/outputs/mom6-om4-025/OM4_025.JRA_RYF seem to be appropriate to get some sample data if thats what you are after ?

@adele-morrison
Copy link

Yes that’s the right location. Would be great to get this in the catalog so we can keep switching all the COSIMA recipes over. What do you need in terms of documentation?

@adele-morrison
Copy link

I don’t think there’s any plans to move OM4_025.JRA_RYF. The panan data location is still in flux. I will try to keep that moving forward.

@marc-white
Copy link
Collaborator

OK, I'll start taking a look at the current data structure and builders to see what needs to happen to get these data ingested. Stay tuned...

@marc-white
Copy link
Collaborator

The filenames all look pretty coherent, but there's a couple of things I haven't been able to work out on my own:

  • What is the 'static' frequency, e.g., 19000101.ocean_static.nc? I'm assuming this is some sort of snapshot - should this file be ingested?
  • There are 'scalar' versions of some files, e.g., 19000101.ocean_annual.nc and 19000101.ocean_scalar_annual.nc. Again, what do these represent, and should they be ingested?

@minghangli-uni
Copy link

What is the 'static' frequency, e.g., 19000101.ocean_static.nc

It contains fields that do not change in frequency, such as grid-related data. It is saved once per run.

19000101.ocean_annual.nc

contains annually-averaged 2d fields

19000101.ocean_scalar_annual.nc

contains annually-averaged 0d fields

@anton-seaice
Copy link
Collaborator Author

I think we want all of those files - there is a frequency = 'fx' for the static files which exists in OM2 and OM3 datastores (and maybe others)

@marc-white
Copy link
Collaborator

Ah yes, I've found the fx frequency down in the utils package - I might variable that out so it's clearer

@marc-white
Copy link
Collaborator

Dumping this here so I can find it later (for building workable test data): https://stackoverflow.com/questions/15141563/python-netcdf-making-a-copy-of-all-variables-and-attributes-but-one

@marc-white
Copy link
Collaborator

I now have what I think is a functional AccessOm4Builder that works on /g/data/ik11/outputs/mom6-om4-025/OM4_025.JRA_RYF. Are there some other data locations that I should be attacked as a check?

@dougiesquire
Copy link
Collaborator

@marc-white, we definitely don't want to call this AccessOm4Builder. The "OM4"
data at /g/data/ik11/outputs/mom6-om4-025/OM4_025.JRA_RYF is from GFDL OM4 (I think - @adele-morrison can you confirm?), not an ACCESS model.

I'd suggest seeing if the data mentioned in this comment can use the same builder. If so, then we could possibly call the builder Mom6Builder

@marc-white
Copy link
Collaborator

/g/data/ol01/outputs/mom6-panan/panant-0025-zstar-ACCESSyr2/
and
/g/data/ol01/outputs/mom6-panan/panant-005-zstar-ACCESSyr2/
and
/g/data/ik11/outputs/mom6-panan/panant-01-zstar-ACCESSyr2/

I've updated the Builder to be able to read the filenames found in these directories. However, I've come across an interesting conundrum whilst trying to test the resulting catalog; the data in those three directories are, when ingested in to the catalog, pretty much identical, to the point where I can't figure out how to, say, get only the data from 0025-zstar (without resorting to the obvious solution of building a catalog only from that directory). This is causing me to have issues in forming a Dask array, because the catalog doesn't understand how to merge the files (I think it is ending up with three 'layers' of the same time series, and bombs out).

For the uninitiated like myself, what is the difference between these three runs, and how can I differentiate between them in an intake/access-nri-intake way?

@dougiesquire
Copy link
Collaborator

@marc-white, each of the experiments should be separate intake-esm datastores within the catalog.

@marc-white
Copy link
Collaborator

HI @anton-seaice and @adele-morrison , I'm now at the point where I'm ready to try an all-up ingest of the data. However, the metadata.yaml for OM4_025.JRA_RYF is incomplete, and doesn't exist for the mom6-panan datasets. Could you please add one for each dataset? Instructions are here: https://access-nri-intake-catalog.readthedocs.io/en/latest/management/building.html#metadata-yaml-files

@adele-morrison
Copy link

adele-morrison commented Aug 30, 2024

I've updated the metadata.yaml for OM4_025.JRA_RYF. I think @AndyHoggANU ran it, so some of the entries are currently just me guessing what the simulation is.

We're not quite ready to add the panan simulations ending in zstar-ACCESSyr2 to Intake yet (as above), because we still need to delete a bunch of that data and shift the 1/10th deg to ol01.

But we could add /g/data/ik11/outputs/mom6-panan/panant-01-zstar-v13 and panant-01-hycom1-v13 to Intake now.

@adele-morrison
Copy link

@AndyHoggANU any chance you want to create the metdata.yamls for panant-01-zstar-v13 and panant-01-hycom1-v13? Or @julia-neme perhaps you could do the panant-01-zstar-v13 one? That's what you used in your paper right?

@adele-morrison
Copy link

I've confirmed with @AndyHoggANU and metadata.yaml for OM4_025.JRA_RYF is good to go.

@marc-white
Copy link
Collaborator

I'll escalate this to the NCI help desk to see if they have any ideas of what's going on.

@marc-white
Copy link
Collaborator

@rbeucher turns out the issue was the openmpi module needs explicit loading in the build script - not sure how we escaped without that before.

@marc-white
Copy link
Collaborator

marc-white commented Oct 22, 2024

We're getting there! Just one more request of @julia-neme and @AndyHoggANU - please ensure your experiment metadata.yaml files contain a model description (Andy, I know at least one of yours already does). This should be a list of models the experiment used, e.g., based on the description of OM4_025.JRA_RYF, a suitable entry may be:

model:
- MOM6
- SIS2

(I know that model isn't listed as required, but it turns out it is if the data doesn't contain this information - see #223 )

Also, could I please confirm that this list of experiments that we want to ingest is the correct one (I know there's a lot of back-and-forth above about this):

  • /g/data/ik11/outputs/mom6-om4-025/OM4_025.JRA_RYF
  • /g/data/ik11/outputs/mom6-panan/panant-01-hycom1-v13
  • /g/data/ik11/outputs/mom6-panan/panant-01-zstar-v13

@adele-morrison
Copy link

@marc-white confirming that your list of 3 experiments above to ingest is correct. Thanks!

@AndyHoggANU
Copy link

OK, I added the model flag to the OM4 and hycom1 case. @julia-neme will need to do it for zstar.

@marc-white
Copy link
Collaborator

@julia-neme have you been able to make this update for the zstar data?

@julia-neme
Copy link

Yes, done now. Sorry I was on leave!

@marc-white
Copy link
Collaborator

@charles-turner-1 I'm having some troubles on the branch for this issue with the tests (see here, specifically tests/test_builders::test_parse_access_ncfile).

Each of the new MOM6 test files is missing a coordinate(s) when doing the intake-esm-to-direct values comparison. I've had a look at a couple of the examples via PDB, and it seems like intake-esm is dropping the coords that are marked as only being the edges of data variables (e.g., z_i is being dropped from 19000101.ocean_annual_z.nc by intake-esm). Is this something you encountered in #222 or #232 ?

@charles-turner-1
Copy link
Collaborator

charles-turner-1 commented Nov 14, 2024

Interesting - I don't remember this being an issue in access-nri-intake-catalog, but I do vaguely recall something similar to this happening with the tests in intake-esm. If you look here, I think it was necessary to request all variables known to be in the dataset when requesting coords.

I think that the varname parameter was being populated somewhere with variables indexed by the esm-datastore in our use case. Plausibly theses edges coords are being dropped here for reasons related to this?

I think the call to open_esm_datastore should pass all the variables that the parser has found in by ingesting variable column name. My guess is that for some reason the parser didn't index all of these & so intake-esm is throwing them away?

@marc-white
Copy link
Collaborator

@charles-turner-1 having stepped through the code line-by-line, I can see where, e.g., xTe and yTe are being dropped from 19000101.ice_daily: intake_esm/source.py. The offending line is:

ds = ds[variables]

Before this line, xTe and yTe still appear in the ds printout:

Pdb) ds.coords
Coordinates:
  * xT       (xT) float64 8B 0.0
  * xTe      (xTe) float64 8B 0.0
  * yT       (yT) float64 8B 0.0
  * yTe      (yTe) float64 8B 0.0
  * time     (time) object 3kB 1900-01-01 12:00:00 ... 1900-12-31 12:00:00
  * nv       (nv) float64 16B 1.0 2.0

And then, after that line:

(Pdb) ds
<xarray.Dataset> Size: 20kB
Dimensions:     (time: 365, yT: 1, xT: 1, nv: 2)
Coordinates:
  * xT          (xT) float64 8B 0.0
  * yT          (yT) float64 8B 0.0
  * time        (time) object 3kB 1900-01-01 12:00:00 ... 1900-12-31 12:00:00
  * nv          (nv) float64 16B 1.0 2.0

I'm presuming the ds=ds[variables] line is down-selecting to only those coordinates that appear to be relevant to the variables requested. xTe and yTe are not listed as being axes for any of the variables, but they are edges for a couple of them:

(Pdb) ds.variables
Frozen({'xT': <xarray.IndexVariable 'xT' (xT: 1)> Size: 8B
array([0.])
Attributes:
    domain_decomposition:  [   1 1440    1 1440]
    units:                 degrees_E
    long_name:             T point nominal longitude
    axis:                  X
    edges:                 xTe, 'xTe': <xarray.IndexVariable 'xTe' (xTe: 1)> Size: 8B
array([0.])
Attributes:
    units:      degrees_E
    long_name:  T-cell edge nominal longitude
    axis:       X, 'yT': <xarray.IndexVariable 'yT' (yT: 1)> Size: 8B
array([0.])
Attributes:
    domain_decomposition:  [   1 1080    1 1080]
    units:                 degrees_N
    long_name:             T point nominal latitude
    axis:                  Y
    edges:                 yTe, 'yTe': <xarray.IndexVariable 'yTe' (yTe: 1)> Size: 8B

The question here is, should these edges values exist in the catalog?

@charles-turner-1
Copy link
Collaborator

Okay, couple of things I can think of here.

  1. It looks to me like xT, xTe, yT, and yTe are all scalar variables. I know that there's some additional logic that intake-esm performs related to scalar variables, just before dropping variables that haven't been requested - are you able to check whether using an array that's got two coordinate values in the xT, xTe, yT, and yTe dimensions stops intake-esm dropping them?
  2. Even though intake-esm is dropping scalar variables here, this logic should be repeated by the test. It's strange that it isn't - is the test manipulating xTe and yTe in any way at all?

I think that in the context of the test, xTe and yTe shouldn't be contained in the final output. However, they should still make it into the catalog - the test assumes that we've requested a variable & filters for that accordingly - the replicated logic being applied to xr_ds is from the if requested variables: ... block in intake_esm.source._open_dataset.

I think we might want to add additional tests which don't apply this logic?

@marc-white
Copy link
Collaborator

  1. I'll try manipulating the test files and giving that a go.
  2. The tests shouldn't be manipulating anything - it's simply your comparison test from Correctness checks for parse_access_ncfile #232 with new files added.

@charles-turner-1
Copy link
Collaborator

charles-turner-1 commented Nov 17, 2024

Sorry, could have been clearer: I'd be curious to see whether xTe and yTe are altered by the test specifically here (starred lines):

1201    xr_ds = xr.open_dataset(file, **xarray_open_kwargs)
1202
1203 *   scalar_variables = [v for v in xr_ds.data_vars if len(xr_ds[v].dims) == 0]
1204 *   xr_ds = xr_ds.set_coords(scalar_variables)
1205 
1206   xr_ds = xr_ds[expected.variable]

@marc-white
Copy link
Collaborator

I've already removed my stack-tracing so I can have a go at de-scalaring the scalar variables, but I didn't think so? From memory, the reason I dived down into the intake-esm code was because I couldn't find a line in the test itself where I lost those variables. I can double-check that later though.

@marc-white
Copy link
Collaborator

Sorry, could have been clearer: I'd be curious to see whether xTe and yTe are altered by the test specifically here (starred lines):

1201 xr_ds = xr.open_dataset(file, **xarray_open_kwargs)
1202
1203 * scalar_variables = [v for v in xr_ds.data_vars if len(xr_ds[v].dims) == 0]
1204 * xr_ds = xr_ds.set_coords(scalar_variables)
1205
1206 xr_ds = xr_ds[expected.variable]

I think you might have this backwards - xr_ds (which is the 'direct open' of the relevant file) contains the missing xTe and yTe coordinates, while ie_ds (the intake-esm open of the file) does not. A pdb inspection of xr_ds before your highlighted lines confirms that xTe and yTe are already there:

(Pdb) xr_ds
<xarray.Dataset> Size: 20kB
Dimensions:     (xT: 1, xTe: 1, yT: 1, yTe: 1, time: 365, nv: 2)
Coordinates:
  * xT          (xT) float64 8B 0.0
  * xTe         (xTe) float64 8B 0.0
  * yT          (yT) float64 8B 0.0
  * yTe         (yTe) float64 8B 0.0
  * time        (time) object 3kB 1900-01-01 12:00:00 ... 1900-12-31 12:00:00
  * nv          (nv) float64 16B 1.0 2.0
Data variables:
    siconc      (time, yT, xT) float32 1kB dask.array<chunksize=(365, 1, 1), meta=np.ndarray>
    sithick     (time, yT, xT) float32 1kB dask.array<chunksize=(365, 1, 1), meta=np.ndarray>
    average_T1  (time) datetime64[ns] 3kB dask.array<chunksize=(365,), meta=np.ndarray>
    average_T2  (time) datetime64[ns] 3kB dask.array<chunksize=(365,), meta=np.ndarray>
    average_DT  (time) timedelta64[ns] 3kB dask.array<chunksize=(365,), meta=np.ndarray>
    time_bnds   (time, nv) timedelta64[ns] 6kB dask.array<chunksize=(365, 2), meta=np.ndarray>

Interestingly, xTe isn't actually scalar - it's a single-element array:

(Pdb) xr_ds["xTe"]
<xarray.DataArray 'xTe' (xTe: 1)> Size: 8B
array([0.])
Coordinates:
  * xTe      (xTe) float64 8B 0.0
Attributes:
    units:      degrees_E
    long_name:  T-cell edge nominal longitude
    axis:       X

This is the case both before and after your starred lines.

@charles-turner-1
Copy link
Collaborator

Sorry, could have been clearer: I'd be curious to see whether xTe and yTe are altered by the test specifically here (starred lines):
1201 xr_ds = xr.open_dataset(file, **xarray_open_kwargs)
1202
1203 * scalar_variables = [v for v in xr_ds.data_vars if len(xr_ds[v].dims) == 0]
1204 * xr_ds = xr_ds.set_coords(scalar_variables)
1205
1206 xr_ds = xr_ds[expected.variable]

I think you might have this backwards - xr_ds (which is the 'direct open' of the relevant file) contains the missing xTe and yTe coordinates, while ie_ds (the intake-esm open of the file) does not. A pdb inspection of xr_ds before your highlighted lines confirms that xTe and yTe are already there:

Yeah, I think this is what I would expect - am I correctly understanding that the subsequent operations being applied to xr_ds aren't dropping these variables then?

(Pdb) xr_ds
<xarray.Dataset> Size: 20kB
Dimensions:     (xT: 1, xTe: 1, yT: 1, yTe: 1, time: 365, nv: 2)
Coordinates:
  * xT          (xT) float64 8B 0.0
  * xTe         (xTe) float64 8B 0.0
  * yT          (yT) float64 8B 0.0
  * yTe         (yTe) float64 8B 0.0
  * time        (time) object 3kB 1900-01-01 12:00:00 ... 1900-12-31 12:00:00
  * nv          (nv) float64 16B 1.0 2.0
Data variables:
    siconc      (time, yT, xT) float32 1kB dask.array<chunksize=(365, 1, 1), meta=np.ndarray>
    sithick     (time, yT, xT) float32 1kB dask.array<chunksize=(365, 1, 1), meta=np.ndarray>
    average_T1  (time) datetime64[ns] 3kB dask.array<chunksize=(365,), meta=np.ndarray>
    average_T2  (time) datetime64[ns] 3kB dask.array<chunksize=(365,), meta=np.ndarray>
    average_DT  (time) timedelta64[ns] 3kB dask.array<chunksize=(365,), meta=np.ndarray>
    time_bnds   (time, nv) timedelta64[ns] 6kB dask.array<chunksize=(365, 2), meta=np.ndarray>

Interestingly, xTe isn't actually scalar - it's a single-element array:

I think xarray stores scalars internally as single element arrays, & then handles scalars by checking for the dimensions that these variables depend on? I figure this is to make the internal data structures consistent - eg. this issue.

(Pdb) xr_ds["xTe"]
<xarray.DataArray 'xTe' (xTe: 1)> Size: 8B
array([0.])
Coordinates:
  * xTe      (xTe) float64 8B 0.0
Attributes:
    units:      degrees_E
    long_name:  T-cell edge nominal longitude
    axis:       X

This is the case both before and after your starred lines.

I don't think the operations in these lines would do anything to the xTe, yTe variables themselves - instead, they're just being set to coordinates. My understanding of how ds[variables] then works is that it will retain all the data variables in variables, all the coordinates that those data variables depend on, and any coordinate variables explicitly requested.

I think I've misunderstood the source of the error: are xTe and yTe being retained by

  1. ie_ds and not xr_ds?
  2. xr_ds and not ie_ds?

If I'm understanding correctly now and it's case 2., what have you requested as variables? eg. in your test parametrisation, what is in VARS here?

_AccessNCFileInfo(
path = None, # type: ignore,
...
variable = VARS,
...

@marc-white
Copy link
Collaborator

  • ie_ds and not xr_ds?

  • xr_ds and not ie_ds?

@charles-turner-1 it's case #2:

(Pdb) ie_ds
<xarray.Dataset> Size: 20kB
Dimensions:     (time: 365, yT: 1, xT: 1, nv: 2)
Coordinates:
  * xT          (xT) float64 8B 0.0
  * yT          (yT) float64 8B 0.0
  * time        (time) object 3kB 1900-01-01 12:00:00 ... 1900-12-31 12:00:00
  * nv          (nv) float64 16B 1.0 2.0
Data variables:
    average_DT  (time) timedelta64[ns] 3kB dask.array<chunksize=(365,), meta=np.ndarray>
    siconc      (time, yT, xT) float32 1kB dask.array<chunksize=(365, 1, 1), meta=np.ndarray>
    average_T1  (time) datetime64[ns] 3kB dask.array<chunksize=(365,), meta=np.ndarray>
    time_bnds   (time, nv) timedelta64[ns] 6kB dask.array<chunksize=(365, 2), meta=np.ndarray>
    average_T2  (time) datetime64[ns] 3kB dask.array<chunksize=(365,), meta=np.ndarray>
    sithick     (time, yT, xT) float32 1kB dask.array<chunksize=(365, 1, 1), meta=np.ndarray>
Attributes:
    file_format:      NETCDF4
    NumFilesInSet:    1
    title:            #
    grid_type:        regular
    grid_tile:        N/A
    intake_esm_vars:  ['average_DT', 'siconc', 'average_T1', 'time_bnds', 'av...
(Pdb) xr_ds
<xarray.Dataset> Size: 20kB
Dimensions:     (xT: 1, xTe: 1, yT: 1, yTe: 1, time: 365, nv: 2)
Coordinates:
  * xT          (xT) float64 8B 0.0
  * xTe         (xTe) float64 8B 0.0
  * yT          (yT) float64 8B 0.0
  * yTe         (yTe) float64 8B 0.0
  * time        (time) object 3kB 1900-01-01 12:00:00 ... 1900-12-31 12:00:00
  * nv          (nv) float64 16B 1.0 2.0
Data variables:
    siconc      (time, yT, xT) float32 1kB dask.array<chunksize=(365, 1, 1), meta=np.ndarray>
    sithick     (time, yT, xT) float32 1kB dask.array<chunksize=(365, 1, 1), meta=np.ndarray>
    average_T1  (time) datetime64[ns] 3kB dask.array<chunksize=(365,), meta=np.ndarray>
    average_T2  (time) datetime64[ns] 3kB dask.array<chunksize=(365,), meta=np.ndarray>
    average_DT  (time) timedelta64[ns] 3kB dask.array<chunksize=(365,), meta=np.ndarray>
    time_bnds   (time, nv) timedelta64[ns] 6kB dask.array<chunksize=(365, 2), meta=np.ndarray>
Attributes:
    file_format:    NETCDF4
    NumFilesInSet:  1
    title:          #
    grid_type:      regular
    grid_tile:      N/A

The test parametrization is as follows (I worked this out by adding what was required to make the first half of the test pass):

(
            builders.Mom6Builder,
            "mom6/output000/19000101.ice_daily.nc",
            _AccessNCFileInfo(
                path=None,  # type: ignore
                filename="19000101.ice_daily.nc",
                file_id="XXXXXXXX_ice_daily",
                filename_timestamp="19000101",
                frequency="subhr",
                start_date="1900-01-01, 00:00:00",
                end_date="1900-01-01, 00:00:00",
                variable=[
                    "xT",
                    "xTe",
                    "yT",
                    "yTe",
                    "time",
                    "nv",
                    "siconc",
                    "sithick",
                    "average_T1",
                    "average_T2",
                    "average_DT",
                    "time_bnds",
                ],
                variable_long_name=[
                    "T point nominal longitude",
                    "T-cell edge nominal longitude",
                    "T point nominal latitude",
                    "T-cell edge nominal latitude",
                    "time",
                    "vertex number",
                    "ice concentration",
                    "ice thickness",
                    "Start time for average period",
                    "End time for average period",
                    "Length of average period",
                    "time axis boundaries",
                ],
                variable_standard_name=[
                    "",
                    "",
                    "",
                    "",
                    "",
                    "",
                    "",
                    "",
                    "",
                    "",
                    "",
                    "",
                ],
                variable_cell_methods=[
                    "",
                    "",
                    "",
                    "",
                    "",
                    "",
                    "time: mean",
                    "time: mean",
                    "",
                    "",
                    "",
                    "",
                ],
                variable_units=[
                    "degrees_E",
                    "degrees_E",
                    "degrees_N",
                    "degrees_N",
                    "days since 1900-01-01 00:00:00",
                    "",
                    "0-1",
                    "m-ice",
                    "days since 1900-01-01 00:00:00",
                    "days since 1900-01-01 00:00:00",
                    "days",
                    "days",
                ],
            ),
        ),

This should be all committed in the 175 branch.

@charles-turner-1
Copy link
Collaborator

Cool, I'll check out the branch & see if I can figure out whats up.

@charles-turner-1
Copy link
Collaborator

@marc-white What version of intake-esm are you using to test against? I started digging into the tests - only to find they all mysteriously started passing.

In tests/conftest.py, I added some functionality which added xfails to these tests - but I hadn't written it as a regex so it didn't capture the additional cases by default. Unfortunately, lots of the test files don't have coordinate variables that are missed without coordinate discovery enabled, & so we can't easily use a regex to sort this.

Purely my cockup here - I really should have made reference to the conftest.py in that test - I've added it in the updated tests. Sorry for the holdup

@charles-turner-1
Copy link
Collaborator

charles-turner-1 commented Nov 18, 2024

I've pushed a commit which fixes the test failures to the head of 175 & tests are all passing now.

If you can see a way of improving the way these dynamic xfails are handled, I think that would be a great shout - I'd rather not cause any more of these painful issues if we can avoid it. Once we get a new release of intake-esm out it shouldn't be an issue, but in the interim I can see this causing some more issues if we're not careful.

@marc-white
Copy link
Collaborator

marc-white commented Nov 18, 2024

Ah right, so we are actually expecting to see unused coordinates thrown away. Gotcha.

I think there is a wider thing to consider here though. From what I can tell, the coordinates that have been thrown away in this instance are 'edge' coordinates. They seem to define the edge behaviour of the main coordinates. However, those coordinates are used. Therefore, should these 'second-order' coordinates be something we should be preserving?

(Also, looks like I'm currently on intake-esm=2023.11.10.)

@charles-turner-1
Copy link
Collaborator

charles-turner-1 commented Nov 18, 2024

I think there is a wider thing to consider here though. From what I can tell, the coordinates that have been thrown away in this instance are 'edge' coordinates. They seem to define the edge behaviour of the main coordinates. However, those coordinates are used. Therefore, should these 'second-order' coordinates be something we should be preserving?

Yeah, I think we probably do want to keep these hanging around. Are you able to open an issue about this on intake-esm - I'll start looking into it more closely once I get done with this E2E test.

(Also, looks like I'm currently on intake-esm=2023.11.10.)

Cool, so I think if you were to create a new environment & install the access-nri-intake-catalog package, then pip install git+https://github.com/intake/intake-esm && XFAILS=0 pytest tests should give you 320 passed tests, whereas pip install intake-esm && XFAILS=1 pytest tests should give you 313 passed tests & 7 xfailed.

Once we get a new release of intake-esm out, we should be able to get rid of the xfails completely. What this means in practical terms for the point above is that if the edges / second-order coordinates have been erroneously chucked away, the user will at least be able to directly search for them.

But to answer the question more directly, yes, I think the desired behaviour is probably that there should be some sort of 'coordinate tree' traversal, rather than just going up a single level & including those coordinates.

@marc-white
Copy link
Collaborator

Interesting, I'm now getting two test failures on the Gadi access-med-0.6 environment that I don't get on my laptop from the MOM6 data, despite having the same intake/intake-esm. I might generate a PR just to see what the CI comes up with...

@marc-white marc-white linked a pull request Nov 18, 2024 that will close this issue
@marc-white marc-white linked a pull request Nov 18, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data request Add data to the catalog
Projects
Status: Backlog
Development

Successfully merging a pull request may close this issue.

9 participants