Guidance on potential breaking change for test #7359

hmaarrfk · 2022-12-05T15:13:06Z

hmaarrfk
Dec 5, 2022

We are trying to speed up certain operations in h5netcdf in h5netcdf/h5netcdf#197

One of them seems like it would break the changes introduced in:
#4879

I do not believe the issue is due to h5netcdf, rather to how h5py (and likely HDF5) handles files internally.

I adapted the problematic test, and ran it with h5py version 3.7.0

import shutil
import xarray
import os 
import h5py

example_1_path = "a.nc"
example_1_modified_path = "b.nc"

example_1 = xarray.open_dataset(
        os.path.join(os.path.dirname(xarray.__file__), "tests", "data", "example_1.nc"),
    )
example_1.to_netcdf(example_1_path, engine='h5netcdf')
example_1.rh.values += 100
example_1.to_netcdf(example_1_modified_path, engine='h5netcdf')
c = h5py.File(example_1_modified_path)
c_rh = c["/rh"][...]
assert np.array_equal(example_1.rh.values, c_rh)
c.close()

a = h5py.File(example_1_path, 'r')
a_rh = a["/rh"][...]
# a = xarray.open_dataset(example_1_path, engine='h5netcdf').load()
# a_rh = a.rh.values

# Simulate external process modifying example_1.nc while this script is running
shutil.copy(example_1_modified_path, example_1_path)

# Reopen example_1.nc (modified) as `b`; note that `a` has NOT been closed
# b = xarray.open_dataset(example_1_path, engine='h5netcdf').load()
# b_rh = b.rh.values
b = h5py.File(example_1_path, 'r')
b_rh = b["/rh"][...]

# Compare the old data
assert not np.array_equal(a_rh, b_rh)
assert a._id.fileno == b._id.fileno

# Compare newly loaded data
a_rh = a["/rh"][...]
b_rh = b["/rh"][...]
assert not np.array_equal(a_rh, b_rh)

While the first comparison passes the assertion, the second comparison fails because both values, a and b, are pointing to the new data (increased by 100).

It seems that h5py shares the same fileno.

Mostly, I find the test a little odd. Is there a flag we should be supplying to h5py to force it to open a new file handle?

hmaarrfk · 2022-12-05T15:23:07Z

hmaarrfk
Dec 5, 2022
Author

Basically, this seems to be a quirk in h5py and I believe down to HDF5. Given the maturity of HDF5, I believe a flag likely needs to be provided to stop it from reusing the same file descriptor:

from h5py import h5f
a = h5f.open(b'a.nc', h5f.ACC_RDONLY)
b = h5f.open(b'a.nc', h5f.ACC_RDONLY)
assert a.fileno == b.fileno

1 reply

hmaarrfk Apr 28, 2024
Author

import xarray as xr
d = xr.Dataset()
d.to_netcdf('a.nc')
from h5py import h5f
a = h5f.open(b'a.nc', h5f.ACC_RDONLY)
b = h5f.open(b'a.nc', h5f.ACC_RDONLY)
assert a.fileno == b.fileno

hmaarrfk · 2022-12-05T16:07:04Z

hmaarrfk
Dec 5, 2022
Author

My current inclination, is that our proposed optimizations are attempting to keep references to the datasets alive and thus, HDF5 is trying to keep their cached values alive as a consequence.

https://docs.hdfgroup.org/hdf5/develop/group___f_a_p_l.html#ga034a5fc54d9b05296555544d8dd9fe89

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Guidance on potential breaking change for test #7359

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Guidance on potential breaking change for test #7359

hmaarrfk Dec 5, 2022

Replies: 2 comments · 1 reply

hmaarrfk Dec 5, 2022 Author

hmaarrfk Apr 28, 2024 Author

hmaarrfk Dec 5, 2022 Author

hmaarrfk
Dec 5, 2022

Replies: 2 comments 1 reply

hmaarrfk
Dec 5, 2022
Author

hmaarrfk Apr 28, 2024
Author

hmaarrfk
Dec 5, 2022
Author