Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Issues opening up datasets with multiple coord vars for the same axis mapped to single dim/independent dims (IPSL ESM files) #285

Closed
oliviermarti opened this issue Jul 28, 2022 · 11 comments · Fixed by #343
Assignees
Labels
type: bug Inconsistencies or issues which will cause an issue or problem for users or implementors.

Comments

@oliviermarti
Copy link

oliviermarti commented Jul 28, 2022

What happened?

The IPSL Earth System Model output files has two time variables, and xcdat refuses to open them.

Developer Notes (@tomvothecoder, 8/30/22)

Dataset Info:

  • Single time dimension called time_counter
  • Two coord vars called time_counter and time_variable mapped to time_counter dimension

How xcdat handles coords:

  • xcdat is designed to handle single coord for an axis mapped to a single dim (1:1 mapping)
  • xcdat uses cf_xarray.cf["T"] to get the coord var, which breaks if there are multiple coord vars found
    • Have to use cf_xarray.cf[["T"]] to return a Dataset with all the coord vars mapped to an axis

What did you expect to happen?

No response

Minimal Complete Verifiable Example

# Test1 - Original IPSL file (not CF compliant)

import numpy as np, xarray as xr, xcdat as xc
TS_File ='https://thredds-su.ipsl.fr/thredds/dodsC/ipsl_thredds/omamce/TestCases/XCDAT/thetao_noCF.nc'
dTS = xc.open_dataset (TS_File, add_bounds=False, use_cftime=True, decode_times=True)

# Test2 - Remove attributes referring to non-existing bounds
# This file is CF compliant (https://pumatest.nerc.ac.uk/cgi-bin/cf-checker.pl)

import numpy as np, xarray as xr, xcdat as xc
TS_File ='https://thredds-su.ipsl.fr/thredds/dodsC/ipsl_thredds/omamce/TestCases/XCDAT/thetao_CF.nc'
dTS = xc.open_dataset (TS_File, add_bounds=False, use_cftime=True, decode_times=True)

# Test3 - remove attr 'coordinates: "time_centered nav_lat nav_lon"' from variable 'thetao' attributes

import numpy as np, xarray as xr, xcdat as xc
TS_File ='https://thredds-su.ipsl.fr/thredds/dodsC/ipsl_thredds/omamce/TestCases/XCDAT/thetao_CF_1.nc'
dTS = xc.open_dataset (TS_File, add_bounds=False, use_cftime=True, decode_times=True)

tos_year = dTS.temporal.group_average ("tos", freq="year", weighted=True)

Relevant log output

# Test 1
KeyError: "Receive multiple variables for key 'T': {'time_centered', 'time_counter'}. Expected only one. Please pass a list ['T'] instead to get all variables matching 'T'."

# Test 3
KeyError: "Receive multiple variables for key 'T': {'time_centered', 'time_counter'}. Expected only one. Please pass a list ['T'] instead to get all variables matching 'T'."

# Test 3
KeyError: "The coordinate variable 'time_counter' has no 'bounds' attr. Set the 'bounds' attr to the name of the bounds data variable."

We need to use `time_centered` and `time_centered_bounds` to compute the correct weighting. So we need to keep the `coordinates` attributes

Anything else we need to know?

No response

Environment

xcdat.version : '0.3.0'

xr.show_versions() :

INSTALLED VERSIONS

commit: None
python: 3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 07:03:09) [Clang 13.0.1 ]
python-bits: 64
OS: Darwin
OS-release: 21.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.1
libnetcdf: 4.8.1

xarray: 2022.3.0
pandas: 1.4.3
numpy: 1.22.4
scipy: 1.8.1
netCDF4: 1.6.0
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.6.1
nc_time_axis: 1.4.1
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2022.6.1
distributed: 2022.6.1
matplotlib: 3.5.2
cartopy: 0.20.2
seaborn: None
numbagg: None
fsspec: 2022.5.0
cupy: None
pint: 0.19.2
sparse: 0.13.0
setuptools: 63.1.0
pip: 22.1.2
conda: 4.13.0
pytest: None
IPython: 8.4.0
sphinx: None

@oliviermarti oliviermarti added the type: bug Inconsistencies or issues which will cause an issue or problem for users or implementors. label Jul 28, 2022
@pochedls
Copy link
Collaborator

Thanks for sharing this – we had not come across this case yet (looking at CMIP6 data). This file seems to have two time variables (with identical time values in each), but I could imagine that some netCDF files would have two distinct time axes.

I see that this issue results because xcdat is expecting one time axis and has logic built around checking/decoding a single time axis, but this file has time_counter and time_centered. Ideally, we could find all time axes and decode them iteratively. One concern is this might effect how we use cfxarray – we may generally expect a single dimension variable.

This is going to need a closer look.

@oliviermarti
Copy link
Author

Thank you for your interest. Note that we could specify the name of the right time axis at the call of open_dataset. However, if different variables in the file use different time axis, we are screwed with this method.
I'm not sure that IPSL file format is a very good one. But we have Tb of them ... !!

@rljacob
Copy link

rljacob commented Jul 29, 2022

By specifying the name when you open the dataset, you could at least work with all the variables that share that axis. Then open a new session for the variables with the other time axis.

@pochedls
Copy link
Collaborator

I like @rljacob's idea (at least as a possible stopgap solution). I think you could potentially implement this by taking in an optional variable in xcdat.open_[mf]dataset, determining the .dims (or maybe .coords) associated with that variable` and dropping any dim/coord not associated with the variable.

I am hitting a similar error opening CESM2 LE data (which signals this issue may affect E3SM files, too).

import xcdat

fns = '/p/user_pub/climate_work/pochedley1/cesm2le/TREFHT/*1301.020*'
ds = xcdat.open_mfdataset(fns)

AttributeError: cf_xarray can't wrap attribute 'dims' because there are multiple values for 'longitude'. There is no unique mapping from 'longitude' to a value in 'dims'.

As a temporary solution, I am opening this data with xarray. xcdat basically uses xr.open_[mv]dataset, but adds bounds (if they aren't there) and decodes some non-cf-compliant time axes to cftime arrays (if they are non-cf compliant). So xarray open dataset functionality is a reasonable workaround for many datasets.

@jypeter
Copy link

jypeter commented Aug 1, 2022

@pochedls is there a way you can make the Receive multiple variables for key 'T': {'time_centered', 'time_counter'}. Expected only one error more specific, by specifying where the different time coordinates come from. This could help the user figure out what is wrong. In the present case, I imagine one of the time axes is found thanks to the axis:T attribute, and the other thanks to the coordinates variable attribute

Also, I think adding something in the documentation about the logic of finding the axes and how the coordinates attribute is handled could be useful. I'm guessing you want to follow the CF Convention as much as possible (#292)

Clarifying how you handle the coordinates variable attribute in different cases would probably be useful, but I confess I'm not even sure how this attribute works. I have quickly checked the CF web site and found references to the coordinates attribute in

@tomvothecoder tomvothecoder changed the title [Bug]: xcdat does not open IPSL ESM files [Bug]: xcdat does not open files with multiple time variables (IPSL ESM files) Aug 3, 2022
@pochedls
Copy link
Collaborator

pochedls commented Aug 9, 2022

I looked into this a little further and have some ideas to address the issue in Test 1 and Test 2.

The immediate problem is when opening the dataset and getting the time axis in the _has_cf_compliant_time routine via ds.cf["T"], which returns:

KeyError: "Receive multiple variables for key 'T': {'time_centered', 'time_counter'}. Expected only one. Please pass a list ['T'] instead to get all variables matching 'T'."

This occurs because the dataset has multiple time axes. There are a couple potentially simple solutions to this:

  • Test for this issue specifically (e.g., here), print a warning, and return a dataset opened with xarray
    • This would likely fix most cases for opening the dataset, but this would still be problematic for datasets with non-cf_compliant time axes and there would likely still be issues in temporal operations (e.g., here), which expects one time axis.
  • pass in a list, e.g., ds.cf[["T"]] instead of ds.cf["T"]. You could check if one of the time coordinates is a dimension by checking this intersection: list(set(ds.cf[["T"]].coords) & set(ds.cf[["T"]].dims)). If one axis satisfies this criteria, you could then choose this axis (e.g., updating the selection here) within _has_cf_compliant_time.
    • This would likely cover most cases in opening datasets, but non-cf_compliant data would still have problems (you'd hit similar issues here) and you'd likely need to do something similar for temporal operations (as above). This would also be a problem if there were multiple temporal coordinate dimensions (e.g., a NetCDF with two variables and two different time axes).

A more complete solution might look like:

  1. In open_*dataset functions: if decode_times=True then get all temporal coordinates via time_axis_list = ds.cf[["T"]].coords.
  2. Loop over each temporal coordinate and test whether it is cf_compliant (you would need to re-write _has_cf_compliant_time to accept a DataArray rather than a data path). Decode non-cf_compliant axes within the loop and update the original dataset with the decoded axes and gather the cf_compliant DataArrays (and their attributes) into two dictionaries (e.g., variable_dict, attrs_dict).
    • Note that decoding non-cf_compliant axes here would require refactoring decode_non_cf_time to accept an axis DataArray. You would also need to write a new function to replace get_bounds to get the bounds from a specific axis (rather than by key).
  3. Decode the cf_compliant DataArrays using xr.conventions.cf_decoder(variable_dict, attrs_dict). Then update the original dataset with the decoded axes.

I think this would decode all time axes, which I think is what we should do (if possible). I do think temporal operations will still need to be updated to be able to select the appropriate time coordinate (e.g., by selecting the time coordinate associated with a specified data_var).

One note that might simplify the main logic above (at Step 2). Our version of _get_cftime_coords appears to work on both cf_compliant and non-cf_compliant data so a revised version of decode_non_cf_time (e.g., replaced with decode_time) might simply be applied to every time axis (which would save putting the cf_compliant data into a DataArray and having xarray decode). One reason we might not do this is that xarray may have more robust functionality to decode temporal data.

@pochedls
Copy link
Collaborator

pochedls commented Aug 9, 2022

In looking at Test 3, I see another issue here. The time axis needed for temporal operations is not actually assigned as a dimension. I thought this could be fixed by dropping time_counter and just using time_centered: this would get rid of the multiple time axes which is the issue for Test 1 and 2 and would also be the appropriate coordinate for temporal operations:

import numpy as np, xarray as xr, xcdat as xc
TS_File ='https://thredds-su.ipsl.fr/thredds/dodsC/ipsl_thredds/omamce/TestCases/XCDAT/thetao_CF_1.nc'
ds = xr.open_dataset(TS_File, decode_times=False)
ds['time_counter'] = ds['time_centered']
ds = ds.drop_vars('time_centered')
ds = ds.rename({'time_counter': 'time_centered'})
ds.to_netcdf('test.nc', 'w', 'NETCDF3_CLASSIC')

ds = xc.open_dataset('test.nc')
ds.temporal.group_average('thetao', freq='year', weighted=True)

But this yields a different error:

AttributeError: 'DataArray' object has no attribute 'time'

@tomvothecoder - do you think this is a separate issue in the temporal averaging logic?

@tomvothecoder
Copy link
Collaborator

In looking at Test 3, I see another issue here. The time axis needed for temporal operations is not actually assigned as a dimension. I thought this could be fixed by dropping time_counter and just using time_centered: this would get rid of the multiple time axes which is the issue for Test 1 and 2 and would also be the appropriate coordinate for temporal operations:

import numpy as np, xarray as xr, xcdat as xc
TS_File ='https://thredds-su.ipsl.fr/thredds/dodsC/ipsl_thredds/omamce/TestCases/XCDAT/thetao_CF_1.nc[](https://thredds-su.ipsl.fr/thredds/dodsC/ipsl_thredds/omamce/TestCases/XCDAT/thetao_CF_1.nc)'
ds = xr.open_dataset(TS_File, decode_times=False)
ds['time_counter'] = ds['time_centered']
ds = ds.drop_vars('time_centered')
ds = ds.rename({'time_counter': 'time_centered'})
ds.to_netcdf('test.nc', 'w', 'NETCDF3_CLASSIC')

ds = xc.open_dataset('test.nc')
ds.temporal.group_average('thetao', freq='year', weighted=True)

But this yields a different error:

AttributeError: 'DataArray' object has no attribute 'time'

@tomvothecoder - do you think this is a separate issue in the temporal averaging logic?

Yes, this issue is related to the TemporalAccessor class methods incorrectly performing static references to "time" as the name of the time dimension.

I opened up a separate issue and the related PR that fixes this:

@tomvothecoder tomvothecoder changed the title [Bug]: xcdat does not open files with multiple time variables (IPSL ESM files) [Bug]: Issues opening up datasets with multiple coordinates mapped to a single dimension (IPSL ESM files) Aug 30, 2022
@tomvothecoder tomvothecoder changed the title [Bug]: Issues opening up datasets with multiple coordinates mapped to a single dimension (IPSL ESM files) [Bug]: Issues opening up datasets with multiple coordinates mapped to single/multiple dimension (IPSL ESM files) Aug 30, 2022
@tomvothecoder tomvothecoder changed the title [Bug]: Issues opening up datasets with multiple coordinates mapped to single/multiple dimension (IPSL ESM files) [Bug]: Issues opening up datasets with multiple coordinates mapped to single dim/multiple dims (IPSL ESM files) Aug 30, 2022
@tomvothecoder tomvothecoder changed the title [Bug]: Issues opening up datasets with multiple coordinates mapped to single dim/multiple dims (IPSL ESM files) [Bug]: Issues opening up datasets with multiple coord vars for the same axis mapped to single dim/independent dims (IPSL ESM files) Aug 30, 2022
@pochedls
Copy link
Collaborator

I just wanted to note that I hit this error message when using the regridder (use the code in this issue with one of the two datasets below).

KeyError: "Receive multiple variables for key 'longitude': {'longitude', 'lon'}. Expected only one. Please pass a list ['longitude'] instead to get all variables matching 'longitude'."

* /p/css03/esgf_publish/CMIP6/CMIP/BCC/BCC-CSM2-MR/historical/r1i1p1f1/SImon/siconc/gn/v20200218/
* /p/css03/esgf_publish/CMIP6/CMIP/BCC/BCC-ESM1/historical/r1i1p1f1/SImon/siconc/gn/v20200218/

It's possible that efforts to address this issue (e.g., here) will fix the regridding issue.

@tomvothecoder
Copy link
Collaborator

tomvothecoder commented Nov 10, 2022

Hi @oliviermarti and @jypeter,

Thank you for your patience as we worked to solve this issue. I just wanted to inform you that that this issue is now resolved in xcdat=0.4.0 (#343). You should be able to add bounds, decode times, and perform xcdat temporal operations now.
Give it a shot and let us know how it goes!

Example code:

import numpy as np,
import xarray as xr

import xcdat as xc

print(xc.__version)
# 0.4.0

TS_File ='https://thredds-su.ipsl.fr/thredds/dodsC/ipsl_thredds/omamce/TestCases/XCDAT/thetao_CF_1.nc'
dTS = xc.open_dataset(TS_File, add_bounds=True, decode_times=True)

print(dTS)
# <xarray.Dataset>
# Dimensions:               (y: 149, x: 182, olevel: 31, time_counter: 12,
#                            axis_nbounds: 2, bnds: 2)
# Coordinates:
#   * olevel                (olevel) float32 5.0 15.0 25.0 ... 4.75e+03 5.25e+03
#   * time_counter          (time_counter) object 1850-01-16 12:00:00 ... 1850-...
# Dimensions without coordinates: y, x, axis_nbounds, bnds
# Data variables:
#     nav_lat               (y, x) float32 ...
#     nav_lon               (y, x) float32 ...
#     thetao                (time_counter, olevel, y, x) float32 ...
#     time_centered         (time_counter) float64 ...
#     time_centered_bounds  (time_counter, axis_nbounds) float64 ...
#     time_counter_bnds     (time_counter, bnds) object 1850-01-01 18:00:00 ......
#     olevel_bnds           (olevel, bnds) float32 -0.000237 10.0 ... 5.5e+03
# Attributes:
#     name:                            CM6-SW-pi-Par01_1m_18500101_18501231_grid_T
#     description:                     Created by xios
#     title:                           Created by xios
#     Conventions:                     CF-1.6
#     timeStamp:                       2021-Oct-22 08:27:34 GMT
#     uuid:                            bf661a39-5d4a-4999-aaae-56f09c3123c5
#     LongName:                        IPSLCM6.1.10-LR
#     NCO:                             netCDF Operators version 4.9.9 (Homepage...
#     _NCProperties:                   version=2,netcdf=4.7.4,hdf5=1.10.6
#     history:                         Thu Jul 28 16:24:47 2022: ncatted -a coo...
#     DODS_EXTRA.Unlimited_Dimension:  time_counter

tos_year = dTS.temporal.group_average("thetao", freq="year", weighted=True)

# <xarray.Dataset>
# Dimensions:       (y: 149, x: 182, olevel: 31, bnds: 2, time_counter: 1)
# Coordinates:
#   * olevel        (olevel) float32 5.0 15.0 25.0 ... 4.25e+03 4.75e+03 5.25e+03
#   * time_counter  (time_counter) object 1850-01-01 00:00:00
# Dimensions without coordinates: y, x, bnds
# Data variables:
#     nav_lat       (y, x) float32 ...
#     nav_lon       (y, x) float32 ...
#     olevel_bnds   (olevel, bnds) float32 -0.000237 10.0 10.0 ... 5e+03 5.5e+03
#     thetao        (time_counter, olevel, y, x) float64 nan nan nan ... nan nan
# Attributes:
#     name:                            CM6-SW-pi-Par01_1m_18500101_18501231_grid_T
#     description:                     Created by xios
#     title:                           Created by xios
#     Conventions:                     CF-1.6
#     timeStamp:                       2021-Oct-22 08:27:34 GMT
#     uuid:                            bf661a39-5d4a-4999-aaae-56f09c3123c5
#     LongName:                        IPSLCM6.1.10-LR
#     NCO:                             netCDF Operators version 4.9.9 (Homepage...
#     _NCProperties:                   version=2,netcdf=4.7.4,hdf5=1.10.6
#     history:                         Thu Jul 28 16:24:47 2022: ncatted -a coo...
#     DODS_EXTRA.Unlimited_Dimension:  time_counter

In your example file:

  • time_counter is the axis dimension.
  • time_counter has two sets of coordinates:
    1. time_counter (dimension coordinate variable) -- no bounds are found, xcdat adds them (time_counter_bnds)
    2. time_centered (non-dimension coordinate variable, aka "auxillary") -- has time_centered_bounds

What changed in xcdat=0.4.0

  • xcdat now loops over all time coordinates and adds bounds if they doesn't exist.
  • xcdat decodes all time coordinates and bounds.
  • xcdat operations use the dimension coordinate variable and the bounds related to the data variable you pass in.
    • In this case, time_counter and time_counter_bnds

@oliviermarti
Copy link
Author

oliviermarti commented Nov 23, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Inconsistencies or issues which will cause an issue or problem for users or implementors.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants