Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure attributes on coordinate variables are preserved during round-tripping #154

Merged
merged 5 commits into from
Jun 24, 2024

Conversation

TomNicholas
Copy link
Member

@TomNicholas TomNicholas commented Jun 23, 2024

@TomNicholas TomNicholas added the bug Something isn't working label Jun 23, 2024
@@ -322,7 +321,7 @@ def separate_coords(
# use workaround to avoid creating IndexVariables described here https://github.com/pydata/xarray/pull/8107#discussion_r1311214263
if len(var.dims) == 1:
dim1d, *_ = var.dims
coord_vars[name] = (dim1d, var.data)
coord_vars[name] = (dim1d, var.data, var.attrs)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this fixes the initial issue described in #155 (i.e. the lat and lon coordinate variables having their attrs get dropped) and makes the new test in this PR pass.

However it seems to introduce a new failure involving decoding the time variable:

        # assert equal to original dataset
>       xrt.assert_identical(roundtrip, ds)
E       AssertionError: Left and right Dataset objects are not identical
E       
E       Differing coordinates:
E       L * time     (time) datetime64[ns] 23kB 2013-01-01T00:02:06.757437440 ... 201...
E       R * time     (time) float32 12kB 1.867e+06 1.867e+06 ... 1.885e+06 1.885e+06
E           Differing variable attributes:
E               units: hours since 1800-01-01
E               calendar: standard

@jsignell wondering if you have any idea what's going on here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I set decode_times=True then I get

>       xrt.assert_identical(roundtrip, ds)
E       AssertionError: Left and right Dataset objects are not identical
E       
E       Differing coordinates:
E       L * time     (time) datetime64[ns] 23kB 2013-01-01T00:02:06.757437440 ... 201...
E       R * time     (time) datetime64[ns] 23kB 2013-01-01T00:02:06.757437440 ... 201...

which seem extremely close, similar to in #122 (comment).

Copy link
Contributor

@jsignell jsignell Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure I am understanding this properly:

  • if you don't set decode_times then the roundtripped version ends up as a datetime[ns] and the original version ends up with floats
  • if you do set decode_times=True then the both the roundtripped and the original end up as datetime[ns] but with a floating point error

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I just tested a little theory. I think this has to do with the default attrs that xarray adds to the ds on load. When you read the saved version back in using xarray.open_dataset you end up with datetime[ns] dtype as well:

diff --git a/virtualizarr/tests/test_integration.py b/virtualizarr/tests/test_integration.py
index b758be6..0b370d5 100644
--- a/virtualizarr/tests/test_integration.py
+++ b/virtualizarr/tests/test_integration.py
@@ -52,6 +52,9 @@ class TestKerchunkRoundtrip:
         # save it to disk as netCDF (in temporary directory)
         ds.to_netcdf(f"{tmpdir}/air.nc")
 
+        # read it back in
+        original = xr.open_dataset(f"{tmpdir}/air.nc")
+
         # use open_dataset_via_kerchunk to read it as references
         vds = open_virtual_dataset(f"{tmpdir}/air.nc", indexes={})
 
@@ -62,12 +65,18 @@ class TestKerchunkRoundtrip:
         roundtrip = xr.open_dataset(f"{tmpdir}/refs.{format}", engine="kerchunk")
 
         # assert equal to original dataset
-        xrt.assert_identical(roundtrip, ds)
+        xrt.assert_identical(roundtrip, original)
 
     def test_kerchunk_roundtrip_concat(self, tmpdir, format):
         # set up example xarray dataset
         ds = xr.tutorial.open_dataset("air_temperature", decode_times=False)
 
+        # save it to disk as netCDF (in temporary directory)
+        ds.to_netcdf(f"{tmpdir}/air.nc")
+
+        # read it back in
+        original = xr.open_dataset(f"{tmpdir}/air.nc")
+
         # split into two datasets
         ds1, ds2 = ds.isel(time=slice(None, 1460)), ds.isel(time=slice(1460, None))
 
@@ -89,7 +98,7 @@ class TestKerchunkRoundtrip:
         roundtrip = xr.open_dataset(f"{tmpdir}/refs.{format}", engine="kerchunk")
 
         # assert equal to original dataset
-        xrt.assert_identical(roundtrip, ds)
+        xrt.assert_identical(roundtrip, original)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I just realised that if I make sure to set decode_times=False in both the original

# set up example xarray dataset
ds = xr.tutorial.open_dataset("air_temperature", decode_times=False)

and the roundtrip

# use fsspec to read the dataset from disk via the zarr store
roundtrip = xr.open_dataset(f"{tmpdir}/refs.{format}", engine="kerchunk", decode_times=False)

calls then all the tests pass on this PR, even with assert_identical. So perhaps this is a red herring and what we should do is merge this PR with decode_times=False then rebase #112 and work on decode_times=True there?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep that makes sense to me. It seems like xarray might determine whether or not to decode times based on the attrs that are defined and since some attrs get added on load it might bump the version that you store in netcdf over the boundary.

@@ -322,7 +321,7 @@ def separate_coords(
# use workaround to avoid creating IndexVariables described here https://github.com/pydata/xarray/pull/8107#discussion_r1311214263
if len(var.dims) == 1:
dim1d, *_ = var.dims
coord_vars[name] = (dim1d, var.data)
coord_vars[name] = (dim1d, var.data, var.attrs)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gets passed into the constructor of xr.Coordinates, which calls xr.as_variable.

@TomNicholas TomNicholas mentioned this pull request Jun 23, 2024
4 tasks
@TomNicholas TomNicholas changed the title Ensure roundtripping tests check attrs Ensure variable-level attributes are preserved during round-tripping Jun 24, 2024
@TomNicholas TomNicholas changed the title Ensure variable-level attributes are preserved during round-tripping Ensure attributes on coordinate variables are preserved during round-tripping Jun 24, 2024
@TomNicholas TomNicholas merged commit ef2429a into main Jun 24, 2024
7 checks passed
@TomNicholas TomNicholas deleted the roundtrip_variable_attrs branch June 24, 2024 18:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bugs with preserving variable-level attrs
2 participants