Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficient HDF5 chunk iteration via HDF5 1.14, h5py 3.8, and H5Dchunk_iter #286

Closed
mkitti opened this issue Jan 25, 2023 · 17 comments · Fixed by #331
Closed

Efficient HDF5 chunk iteration via HDF5 1.14, h5py 3.8, and H5Dchunk_iter #286

mkitti opened this issue Jan 25, 2023 · 17 comments · Fixed by #331

Comments

@mkitti
Copy link
Contributor

mkitti commented Jan 25, 2023

Currently, kerchunk iterates over HDF5 chunks by looping over a linear index passed to get_chunk_info. This uses H5Dget_chunk_info from the HDF5 C API.

kerchunk/kerchunk/hdf.py

Lines 523 to 524 in ff16c05

for index in range(num_chunks):
blob = dsid.get_chunk_info(index)

Support for H5Dchunk_iter was recently released as part of HDF5 1.14.0:
https://docs.hdfgroup.org/hdf5/v1_14/group___h5_d.html#gac482c2386aa3aea4c44730a627a7adb8
This method iterates through all the chunks contained with a dataset, visiting each chunk once.

H5Dchunk_iter was incorporated into h5py 3.8.0 as h5py.h5d.DatasetID.chunk_iter() when used with HDF5 1.14:
h5py/h5py#2202

Support for H5Dchunk_iter is expected in HDF5 1.12.3 and 1.10.9.

@ajelenak implemented a test which checks the equivalence of that call with the current iteration method implemented here:
https://github.com/h5py/h5py/blob/d2e84badfa5e4d8095bcc5d3db81f8548c340919/h5py/tests/test_dataset.py#L1800-L1826

Kerchunk should take advantage of H5Dchunk_iter so that it can efficiently iterate over chunks with linear scaling.

@martindurant
Copy link
Member

@mkitti , thanks for the info. Do you know how to implement this on our end? I assume we should wait a while until the new version of HDF becomes standard (and maybe maintain the old behaviour for a while anyway).

@mkitti
Copy link
Contributor Author

mkitti commented Jan 31, 2023

You could switch implementations depending on h5py.version.hdf5_version_tuple

@mkitti
Copy link
Contributor Author

mkitti commented Feb 1, 2023

PyTables just merged an implementation based on H5Dchunk_iter:

PyTables/PyTables#997

@martindurant
Copy link
Member

We should probably wait until at least HDF5==1.14 is on conda-forge ( https://anaconda.org/conda-forge/hdf5/files ). @ajelenak , will you have any appetite to implement faster iteration?
@mkitti , do you have evidence that the iteration in the current kerchunk.hdf module is particularly slow?

@mkitti
Copy link
Contributor Author

mkitti commented Feb 2, 2023

@ajelenak added H5Dchunk_iter to h5py in h5py/h5py#2202

HDF5 1.14 is due to hit conda-forge on February 5th according to this pull request:
conda-forge/hdf5-feedstock#188

My experience is mainly low-level using the C-API directly. I did do some concrete benchmarks for the Julia interface to HDF5:
JuliaIO/HDF5.jl#1031 (comment)

Essentially, when we are processing 16,384 chunks, retrieving chunk information via dsid.get_chunk_info (H5Dget_chunk_info) takes on the order of 10 seconds. Retrieving chunk information via H5Dchunk_iter takes less than 0.1 second. H5Dget_chunk_info may be faster for a very few chunks, but then we're talking tens of milliseconds in either case.

@mkitti
Copy link
Contributor Author

mkitti commented Feb 2, 2023

Here's a table summary of my Julia benchmarks:

Number of Chunks H5Dchunk_iter (seconds) H5Dget_chunk_info time (seconds) H5Dget_chunk_info / H5Dchunk_iter Ratio
4 0.000040029 0.000027406 0.7
16 0.000053692 0.000077566 1.4
64 0.000194085 0.000456561 2.4
256 0.000717275 0.004541661 6
1024 0.003004693 0.048780859 16
4096 0.011674653 0.662931619 57
16384 0.064214971 13.353451558 208

I would be happy to attempt a pull request if we determine a path to proceed here.

@martindurant
Copy link
Member

Since we have extra work that we do for each chunk, and python has generally higher overheads, I bet the difference is nowhere near as dramatic - but your point is taken!

@mkitti
Copy link
Contributor Author

mkitti commented Feb 3, 2023

It's the 13 seconds that bothers me the most here. I think that far exceeds any overhead you might see from Python.

Basically, H5Dchunk_iter scales much better. This gets noticeable when we are talking about ~10K chunks.

I have an interest in seeing HDF5, Zarr, and others scaling into that number of chunks.

@mkitti
Copy link
Contributor Author

mkitti commented Feb 6, 2023

HDF5 1.14 is now available via conda-forge

@martindurant
Copy link
Member

Thanks for letting us know

@ajelenak
Copy link
Collaborator

ajelenak commented Feb 6, 2023

@mkitti Can you share somehow your test file?

@mkitti
Copy link
Contributor Author

mkitti commented Feb 6, 2023

@ajelenak I created it via Julia:
JuliaIO/HDF5.jl#1031 (comment)

@ajelenak
Copy link
Collaborator

ajelenak commented Feb 8, 2023

Here are benchmark results with an h5 file created according to this JuliaIO/HDF5.jl#1031 (comment) code. chunk_info is the old method and chunk_iter is the new one.

  • Python 3.11.0
  • h5py-3.8.0
  • libhdf5-1.14.0
  • h5 file is loaded in memory
  • Each chunk location method is run 10 times and the best (quickest) time is used
4 chunks :: chunk info = 1.31e-05 s
4 chunks :: chunk iter = 4.861e-06 s
4 chunks :: chunk_info / chunk_iter = 2.69
16 chunks :: chunk info = 4.5497e-05 s
16 chunks :: chunk iter = 1.2756e-05 s
16 chunks :: chunk_info / chunk_iter = 3.57
64 chunks :: chunk info = 0.000231134 s
64 chunks :: chunk iter = 4.3405e-05 s
64 chunks :: chunk_info / chunk_iter = 5.33
256 chunks :: chunk info = 0.002054627 s
256 chunks :: chunk iter = 0.000166648 s
256 chunks :: chunk_info / chunk_iter = 12.33
1024 chunks :: chunk info = 0.026394162 s
1024 chunks :: chunk iter = 0.000703572 s
1024 chunks :: chunk_info / chunk_iter = 37.51
4096 chunks :: chunk info = 0.378013247 s
4096 chunks :: chunk iter = 0.003634743 s
4096 chunks :: chunk_info / chunk_iter = 104.00
16384 chunks :: chunk info = 9.026964752 s
16384 chunks :: chunk iter = 0.013217899 s
16384 chunks :: chunk_info / chunk_iter = 682.93

@martindurant
Copy link
Member

@ajelenak , are you likely to have the time to implement this?

@ajelenak
Copy link
Collaborator

ajelenak commented Feb 8, 2023

Sure, will add it to my to-do list. Should the new method kick in when a suitable libhdf5 version is detected, or be a user option? I prefer former.

@martindurant
Copy link
Member

I think it's fine to use it when available without needing a new option.

@mkitti
Copy link
Contributor Author

mkitti commented May 3, 2023

I gave this a shot in #331. With many chunks, the results are quite remarkable.

Time for SingleHdf5ToZarr.translate():

Number of Chunks Before this pull request, with get_chunk_info After this pull request, with chunk_iter Ratio
16,384 13 seconds 0.131 seconds 99x
32,768 74 seconds 0.214 seconds 346x
65,536 393 seconds 0.472 seconds 832x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants