-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Efficient HDF5 chunk iteration via HDF5 1.14, h5py 3.8, and H5Dchunk_iter #286
Comments
@mkitti , thanks for the info. Do you know how to implement this on our end? I assume we should wait a while until the new version of HDF becomes standard (and maybe maintain the old behaviour for a while anyway). |
You could switch implementations depending on |
PyTables just merged an implementation based on |
We should probably wait until at least HDF5==1.14 is on conda-forge ( https://anaconda.org/conda-forge/hdf5/files ). @ajelenak , will you have any appetite to implement faster iteration? |
@ajelenak added HDF5 1.14 is due to hit conda-forge on February 5th according to this pull request: My experience is mainly low-level using the C-API directly. I did do some concrete benchmarks for the Julia interface to HDF5: Essentially, when we are processing 16,384 chunks, retrieving chunk information via |
Here's a table summary of my Julia benchmarks:
I would be happy to attempt a pull request if we determine a path to proceed here. |
Since we have extra work that we do for each chunk, and python has generally higher overheads, I bet the difference is nowhere near as dramatic - but your point is taken! |
It's the 13 seconds that bothers me the most here. I think that far exceeds any overhead you might see from Python. Basically, H5Dchunk_iter scales much better. This gets noticeable when we are talking about ~10K chunks. I have an interest in seeing HDF5, Zarr, and others scaling into that number of chunks. |
HDF5 1.14 is now available via conda-forge |
Thanks for letting us know |
@mkitti Can you share somehow your test file? |
@ajelenak I created it via Julia: |
Here are benchmark results with an h5 file created according to this JuliaIO/HDF5.jl#1031 (comment) code.
|
@ajelenak , are you likely to have the time to implement this? |
Sure, will add it to my to-do list. Should the new method kick in when a suitable libhdf5 version is detected, or be a user option? I prefer former. |
I think it's fine to use it when available without needing a new option. |
I gave this a shot in #331. With many chunks, the results are quite remarkable. Time for
|
Currently, kerchunk iterates over HDF5 chunks by looping over a linear index passed to
get_chunk_info
. This usesH5Dget_chunk_info
from the HDF5 C API.kerchunk/kerchunk/hdf.py
Lines 523 to 524 in ff16c05
Support for
H5Dchunk_iter
was recently released as part of HDF5 1.14.0:https://docs.hdfgroup.org/hdf5/v1_14/group___h5_d.html#gac482c2386aa3aea4c44730a627a7adb8
This method iterates through all the chunks contained with a dataset, visiting each chunk once.
H5Dchunk_iter
was incorporated into h5py 3.8.0 ash5py.h5d.DatasetID.chunk_iter()
when used with HDF5 1.14:h5py/h5py#2202
Support for
H5Dchunk_iter
is expected in HDF5 1.12.3 and 1.10.9.@ajelenak implemented a test which checks the equivalence of that call with the current iteration method implemented here:
https://github.com/h5py/h5py/blob/d2e84badfa5e4d8095bcc5d3db81f8548c340919/h5py/tests/test_dataset.py#L1800-L1826
Kerchunk should take advantage of
H5Dchunk_iter
so that it can efficiently iterate over chunks with linear scaling.The text was updated successfully, but these errors were encountered: