Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Losing top level name attribute when saving and then reopening using h5netcdf #9293

Closed
5 tasks done
melonora opened this issue Jul 30, 2024 · 15 comments · Fixed by #9298
Closed
5 tasks done

Losing top level name attribute when saving and then reopening using h5netcdf #9293

melonora opened this issue Jul 30, 2024 · 15 comments · Fixed by #9298
Labels
topic-DataTree Related to the implementation of a DataTree class topic-documentation

Comments

@melonora
Copy link

melonora commented Jul 30, 2024

What happened?

I have a DataTree that looks like this before saving with to_netcdf and engine set to h5netcdf:

<xarray.DataTree 'my_container'>
Group: /
│   Dimensions:           (y: 2, x: 2)
│   Dimensions without coordinates: y, x
│   Data variables:
│       latitude_series   (y, x) float64 32B ...
│       longitude_series  (y, x) float64 32B ...
└── Group: /temperature_dataset
        Dimensions:            (y: 2, x: 2, date: 2)
        Coordinates:
          * date               (date) datetime64[ns] 16B 2020-01-01 2020-01-02
            day_in_d           (date) int32 8B ...
        Dimensions without coordinates: y, x
        Data variables:
            temperatures_in_K  (x, y, date) float64 64B ...
        Attributes:
            name:               my_temperature
            latitude_in_deg:    my_latitude
            longitude_in_deg:   my_longitude
            reference_date:     2020-01-01
            conversion_factor:  1000.0

When reopening with open_datatree using the same engine I lose the name my_container of the root datatree.

What did you expect to happen?

I would expect the name of the root datatree to be conserved when reopening the datatree.

Minimal Complete Verifiable Example

import xarray as xr
from xarray.core.datatree import DataTree
from xarray.backends.api import open_datatree
import pandas as pd
import numpy as np

datatree = DataTree(name="my_container")
temperature_dataset_dict = {'attrs': {'name': 'my_temperature', 'latitude_in_deg': 'my_latitude', 'longitude_in_deg': 'my_longitude', 'reference_date': '2020-01-01', 'conversion_factor': 1000.0}, 'coords': {'date': {'data': pd.to_datetime(['2020-01-01', '2020-01-02']), 'dims': 'date'}, 'day_in_d': {'data': [0, 1], 'dims': 'date'}}, 'dims': {'x': 2, 'y': 2, 'date': 2}, 'data_vars': {'temperatures_in_K': {'data': np.array([[[0., 1.],
        [2., 3.]],
       [[4., 5.],
        [6., 7.]]]), 'dims': ['x', 'y', 'date']}}}
dataset = xr.Dataset.from_dict(temperature_dataset_dict)
datatree["temperature_dataset"] = DataTree(data=dataset)

latitude_dict = {'name': 'my_latitude', 'data': np.array([[1., 2.],
       [3., 4.]]), 'dims': ['y', 'x']}
longitude_dict = {'name': 'my_longitude', 'data': np.array([[5., 6.],
       [7., 8.]]), 'dims': ['y', 'x']}
datatree["latitude_series"] = xr.DataArray.from_dict(d=latitude_dict)
datatree["longitude_series"] = xr.DataArray.from_dict(d=longitude_dict)

datatree.to_netcdf("output.nc", engine="h5netcdf")
datatree = open_datatree("output.nc", engine="h5netcdf")
print(datatree)

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

No response

Anything else we need to know?

No response

Environment

python 3.10
xarray 2024.7.0
netcdf4 1.7.1
h5netcdf 1.3.0

@melonora melonora added bug needs triage Issue that has not been reviewed by xarray team member labels Jul 30, 2024
Copy link

welcome bot commented Jul 30, 2024

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

@TomNicholas TomNicholas added topic-DataTree Related to the implementation of a DataTree class and removed needs triage Issue that has not been reviewed by xarray team member labels Jul 30, 2024
@kmuehlbauer
Copy link
Contributor

@melonora Thanks for raising. Short question, does it work with engine netcdf4 ?

@TomNicholas
Copy link
Member

@melonora thanks for the clear issue! Note if I do ncdump -hc output.nc I see

netcdf output {
dimensions:
	y = 2 ;
	x = 2 ;
variables:
	double latitude_series(y, x) ;
		latitude_series:_FillValue = NaN ;
	double longitude_series(y, x) ;
		longitude_series:_FillValue = NaN ;

group: temperature_dataset {
  dimensions:
  	x = 2 ;
  	y = 2 ;
  	date = 2 ;
  variables:
  	double temperatures_in_K(x, y, date) ;
  		temperatures_in_K:_FillValue = NaN ;
  		string temperatures_in_K:coordinates = "day_in_d" ;
  	int64 date(date) ;
  		string date:units = "days since 2020-01-01 00:00:00" ;
  		string date:calendar = "proleptic_gregorian" ;
  	int64 day_in_d(date) ;

  // group attributes:
  		string :name = "my_temperature" ;
  		string :latitude_in_deg = "my_latitude" ;
  		string :longitude_in_deg = "my_longitude" ;
  		string :reference_date = "2020-01-01" ;
  		:conversion_factor = 1000. ;
  } // group temperature_dataset
}

which doesn't have my_container as a name for the root group. So the open_datatree is working correctly, the information is being dropped during the saving to disk.

@kmuehlbauer
Copy link
Contributor

@melonora @TomNicholas I think there is no way to implement this, as the standard name of the root group (at least in the HDF5/NETCDF4 world) is /.

See also the HDF5 documentation on groups. Citing from that document:

Every HDF5 file has a single root group, with the name /.

Maybe properly documenting that is enough?

@TomNicholas
Copy link
Member

Okay thanks @kmuehlbauer! I thought there was a reason I didn't already do this.

Maybe properly documenting that is enough?

Yes. Also if you really really wanted to preserve that information I guess you could an extra level at the root, so your my_container group became effectively {un-named_root}/my_container.

@TomNicholas
Copy link
Member

I think this is not possible in Zarr either, as the v3 spec says:

The root node has a path of /.

@kmuehlbauer
Copy link
Contributor

Yes. Also if you really really wanted to preserve that information I guess you could an extra level at the root, so your my_container group became effectively {un-named_root}/my_container.

Yes, that would work, too. I think if the user needs that kind of metadata it should go in attrs anyway. Relying on implementation details of underlying file formats might not be the best solution.

@kmuehlbauer
Copy link
Contributor

@TomNicholas Can you point me to the doc part where this information should be added? I'm going to add this now.

@melonora
Copy link
Author

melonora commented Jul 31, 2024

Thanks, I will approach it as suggested. Would it be good to give a warning when saving that the name of the root datatree is dropped? I could open a PR for that

@TomNicholas
Copy link
Member

I'm going to add this now.

❤️

Can you point me to the doc part where this information should be added?

Either the new section for the IO page, or the DataTree.to_netcdf/to_zarr methods.

Be aware that the explanatory (non-API) docs currently live in a temporary location, and are being merged in to the main docs to be exposed publicly as part of #9033.

Would it be good to give a warning when saving that the name of the root datatree is dropped? I could open a PR for that.

Perhaps? If we do this then the warning should only trigger if the name of the parent is not None. I'm slightly wary if this idea just because even once the user knows they don't care about saving the name of the root group, they could only turn off this warning by explicitly setting the name to None, which seems like a waste of people's time. What do you both think?

@kmuehlbauer
Copy link
Contributor

Perhaps? If we do this then the warning should only trigger if the name of the parent is not None. I'm slightly wary if this idea just because even once the user knows they don't care about saving the name of the root group, they could only turn off this warning by explicitly setting the name to None, which seems like a waste of people's time. What do you both think?

I agree, that might be annoying for the majority of users. But I have also no other idea at the moment.

@kmuehlbauer
Copy link
Contributor

Doc PR in #9298

@melonora
Copy link
Author

Great, thanks for the quick reply! As for the warning, I thought more to check at write whether the group being written is the root and only in case name is then not None to give the warning. If that is a possibility

@kmuehlbauer
Copy link
Contributor

@melonora Yes, totally understand that. But users might assemble/disassemble DataTree's from/into subtrees and at some point in time just dump them to disk. Most users will just don't care and would have to reset the name to None to avoid such warning. I think that's what @TomNicholas meant above.

@melonora
Copy link
Author

I see

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-DataTree Related to the implementation of a DataTree class topic-documentation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants