Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_netcdf with decoded time can create file with inconsistent time:units and time_bounds:units #2921

Closed
klindsay28 opened this issue Apr 25, 2019 · 4 comments · Fixed by #2965

Comments

@klindsay28
Copy link

Code Sample, a copy-pastable example if possible

import numpy as np
import xarray as xr

# create time and time_bounds DataArrays for Jan-1850 and Feb-1850
time_bounds_vals = np.array([[0.0, 31.0], [31.0, 59.0]])
time_vals = time_bounds_vals.mean(axis=1)

time_var = xr.DataArray(time_vals, dims='time',
                        coords={'time':time_vals})
time_bounds_var = xr.DataArray(time_bounds_vals, dims=('time', 'd2'),
                               coords={'time':time_vals})

# create Dataset of time and time_bounds
ds = xr.Dataset(coords={'time':time_var}, data_vars={'time_bounds':time_bounds_var})
ds.time.attrs = {'bounds':'time_bounds', 'calendar':'noleap',
                 'units':'days since 1850-01-01'} 

# write Jan-1850 values to file 
ds.isel(time=slice(0,1)).to_netcdf('Jan-1850.nc', unlimited_dims='time')

# write Feb-1850 values to file
ds.isel(time=slice(1,2)).to_netcdf('Feb-1850.nc', unlimited_dims='time')

# use open_mfdataset to read in files, combining into 1 Dataset
decode_times = True
decode_cf = True
ds = xr.open_mfdataset(['Jan-1850.nc', 'Feb-1850.nc'],
                       decode_cf=decode_cf, decode_times=decode_times)

# write combined Dataset out
ds.to_netcdf('JanFeb-1850.nc', unlimited_dims='time')

Problem description

The above code initially creates 2 netCDF files, for Jan-1850 and Feb-1850, that have the variables time and time_bounds, and time:bounds='time_bounds'. It then reads the 2 files back in as a single Dataset, using open_mfdataset, and this Dataset is written back out to a netCDF file. ncdump of this final file is

netcdf JanFeb-1850 {
dimensions:
	time = UNLIMITED ; // (2 currently)
	d2 = 2 ;
variables:
	int64 time(time) ;
		time:bounds = "time_bounds" ;
		time:units = "hours since 1850-01-16 12:00:00.000000" ;
		time:calendar = "noleap" ;
	double time_bounds(time, d2) ;
		time_bounds:_FillValue = NaN ;
		time_bounds:units = "days since 1850-01-01" ;
		time_bounds:calendar = "noleap" ;
data:

 time = 0, 708 ;

 time_bounds =
  0, 31,
  31, 59 ;
}

The problem is that the units attribute for time and time_bounds are different in this file, contrary to what CF conventions requires.

The final call to to_netcdf is creating a file where time's units (and type) differ from what they are in the intermediate files. These transformations are not being applied to time_bounds.

While the change to time's type is not necessarily an issue, I do find it surprising.

This inconsistency goes away if either of decode_times or decode_cf is set to False in the python code above. In particular, the transformations to time's units and type do not happen.

The inconsistency also goes away if open_mfdataset opens a single file. In this scenario also, the transformations to time's units and type do not happen.

I think that the desired behavior is to either not apply the units and type transformations to time, or to also apply them to time_bounds. The first option would be consistent with the current single-file behavior.

INSTALLED VERSIONS ------------------ commit: None python: 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 3.12.62-60.64.8-default machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.4 libnetcdf: 4.6.2

xarray: 0.12.1
pandas: 0.24.2
numpy: 1.16.2
scipy: 1.2.1
netCDF4: 1.4.2
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.0.3.4
nc_time_axis: None
PseudonetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 1.1.5
distributed: 1.26.1
matplotlib: 3.0.3
cartopy: None
seaborn: None
setuptools: 40.8.0
pip: 19.0.3
conda: None
pytest: 4.3.1
IPython: 7.4.0
sphinx: None

@klindsay28
Copy link
Author

It looks like ds.time.encoding is not getting set when open_mfdataset is opening multiple files. I suspect that this is leading to the surprising unit for time when the ds is written out. The code below demonstrates that ds.time.encoding is set by open_mfdataset in the single-file case and is not set in the multi-file case. However, ds.time_bounds.encoding is set in both the single- and multi-file cases.

The possibility of this is alluded to the in a comment in #2436, which relates the issue to #1614.

import numpy as np
import xarray as xr

# create time and time_bounds DataArrays for Jan-1850 and Feb-1850
time_bounds_vals = np.array([[0.0, 31.0], [31.0, 59.0]])
time_vals = time_bounds_vals.mean(axis=1)

time_var = xr.DataArray(time_vals, dims='time',
                        coords={'time':time_vals})
time_bounds_var = xr.DataArray(time_bounds_vals, dims=('time', 'd2'),
                               coords={'time':time_vals})

# create Dataset of time and time_bounds
ds = xr.Dataset(coords={'time':time_var}, data_vars={'time_bounds':time_bounds_var})
ds.time.attrs = {'bounds':'time_bounds', 'calendar':'noleap',
                 'units':'days since 1850-01-01'}

# write Jan-1850 values to file
ds.isel(time=slice(0,1)).to_netcdf('Jan-1850.nc', unlimited_dims='time')

# write Feb-1850 values to file
ds.isel(time=slice(1,2)).to_netcdf('Feb-1850.nc', unlimited_dims='time')

# use open_mfdataset to read in files, combining into 1 Dataset
decode_times = True
decode_cf = True
ds = xr.open_mfdataset(['Jan-1850.nc'],
                       decode_cf=decode_cf, decode_times=decode_times)

print('time and time_bounds encoding, single-file open_mfdataset')
print(ds.time.encoding)
print(ds.time_bounds.encoding)

# use open_mfdataset to read in files, combining into 1 Dataset
decode_times = True
decode_cf = True
ds = xr.open_mfdataset(['Jan-1850.nc', 'Feb-1850.nc'],
                       decode_cf=decode_cf, decode_times=decode_times)

print('--------------------')
print('time and time_bounds encoding, multi-file open_mfdataset')
print(ds.time.encoding)
print(ds.time_bounds.encoding)

produces

time and time_bounds encoding, single-file open_mfdataset
{'zlib': False, 'shuffle': False, 'complevel': 0, 'fletcher32': False, 'contiguous': False, 'chunksizes': (512,), 'source': '/gpfs/fs1/work/klindsay/analysis/CESM2_coup_carb_cycle_JAMES/Jan-1850.nc', 'original_shape': (1,), 'dtype': dtype('float64'), '_FillValue': nan, 'units': 'days since 1850-01-01', 'calendar': 'noleap'}
{'zlib': False, 'shuffle': False, 'complevel': 0, 'fletcher32': False, 'contiguous': False, 'chunksizes': (1, 2), 'source': '/gpfs/fs1/work/klindsay/analysis/CESM2_coup_carb_cycle_JAMES/Jan-1850.nc', 'original_shape': (1, 2), 'dtype': dtype('float64'), '_FillValue': nan, 'units': 'days since 1850-01-01', 'calendar': 'noleap'}
--------------------
time and time_bounds encoding, multi-file open_mfdataset
{}
{'zlib': False, 'shuffle': False, 'complevel': 0, 'fletcher32': False, 'contiguous': False, 'chunksizes': (1, 2), 'source': '/gpfs/fs1/work/klindsay/analysis/CESM2_coup_carb_cycle_JAMES/Jan-1850.nc', 'original_shape': (1, 2), 'dtype': dtype('float64'), '_FillValue': nan, 'units': 'days since 1850-01-01', 'calendar': 'noleap'}

@dcherian
Copy link
Contributor

dcherian commented May 7, 2019

Thanks for the great report @klindsay28.

Looks like CF recommends that time_bounds (boundary variable) not have the units or calendar attributes:

Since a boundary variable is considered to be part of a coordinate variable’s metadata, it is not necessary to provide it with attributes such as long_name and units.

Boundary variable attributes which determine the coordinate type (units, standard_name, axis and positive) or those which affect the interpretation of the array values (units, calendar, leap_month, leap_year and month_lengths) must always agree exactly with the same attributes of its associated coordinate, scalar coordinate or auxiliary coordinate variable. To avoid duplication, however, it is recommended that these are not provided to a boundary variable.

We already have special treatment for time_bounds in the decode step. It makes sense to treat it specially in the encode step. In fact it looks like @spencerkclark identified this issue in #2571 👏

@klindsay28 Any interest in putting together a PR that would avoid setting these attributes on time_bounds during the encode step?

Ping @fmaussion for feedback.

@spencerkclark
Copy link
Member

We already have special treatment for time_bounds in the decode step. It makes sense to treat it specially in the encode step.

+1 I didn't fully appreciate how strict the CF conventions were regarding this at the time of #2571. Where it is unambiguous I agree we should make an effort to comply (or preserve compliance).

@mathause
Copy link
Collaborator

Today @lukasbrunner and me ran into this problem. Opening an mfdataset and saving to_netcdfled to the following units for time and time_units:

time_bnds:units = "days since 1850-01-01" ;
time:units = "days since 2000-01-01" ;

Opening the dataset in xarray works fine, but ncdump and ncview use the units from time for time_bnds. Thus if I do

ncdump -tv time_bnds test.nc 

the first date is '2150-01-01' (instead of '2000-01-01') - very confusing. (panoply shows the correct time_bnds).

Workaround

import xarray as xr
filenames = ['file1.nc', 'file2.nc']

ds = xr.open_mfdataset(fNs)
ds.load()

# make sure the encoding is really empty
assert not ds.time.encoding

# assign encoding, such that they are equal
ds.time.encoding.update(ds.time_bnds.encoding)

# save
ds.to_netcdf('~/F/test.nc')

Btw. thanks to @klindsay28 for the nice error report.

dcherian added a commit to dcherian/xarray that referenced this issue May 15, 2019
Fixes pydata#2921

1. Removes certain attributes from bounds variables on encode.
2. open_mfdataset: Sets encoding on variables based on encoding in first file.
dcherian added a commit that referenced this issue Jun 25, 2019
* Don't set attributes on bounds variables.

Fixes #2921

1. Removes certain attributes from bounds variables on encode.
2. open_mfdataset: Sets encoding on variables based on encoding in first file.

* remove whitespace stuff.

* Make sure variable exists in first file before assigning encoding

* Make sure we iterate over coords too.

* lint fix.

* docs/comment fixes.

* mfdataset encoding test.

* time_bounds attrs test + allow for slight CF non-compliance.

* I need to deal with encoding!

* minor fixes.

* another minor fix.

* review fixes.

* lint fixes.

* Remove encoding changes and xfail test.

* Update whats-new.rst
tsjackson-noaa added a commit to tsjackson-noaa/MDTF-diagnostics that referenced this issue Jan 18, 2021
Change output format from NETCDF4 to NETCDF4_CLASSIC (for backwards
compatibility; *believe* but haven't tested that this still permits
writing >2Gb output files).

If variable has a time axis, make it unlimited. Set units on the time
bounds variable, if present, to address issue noted in
pydata/xarray#2921 (comment)
st-bender added a commit to st-bender/xarray that referenced this issue Apr 10, 2024
Issue pydata#2921 is about mismatching time units between a time variable and
its "bounds" companion.
However, pydata#2965 does more than fixing pydata#2921, it removes all double
attributes from "bounds" variables which has the undesired side effect
that there is currently no way to save them to netcdf with xarray.
Since the mentioned link is a recommendation and not a hard requirement
for CF compliance, these attributes should be left to the caller to
prepare the dataset variables appropriately if required.
Reduces the amount of surprise that attributes are not written to disk
and fixes pydata#8368.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants