Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] H5D_chunk_iter_op_t exposes size (nbytes) as uint32_t rather than hsize_t #2056

Closed
mkitti opened this issue Aug 22, 2022 · 8 comments · Fixed by #2074
Closed

[BUG] H5D_chunk_iter_op_t exposes size (nbytes) as uint32_t rather than hsize_t #2056

mkitti opened this issue Aug 22, 2022 · 8 comments · Fixed by #2074

Comments

@mkitti
Copy link
Contributor

mkitti commented Aug 22, 2022

Describe the bug
H5D_chunk_iter_op_t exposes the argument size (formerly nbytes) as uint32_t rather than hsize_t.

Expected behavior
The type of size should be hsize_t as in H5Dget_chunk_info.

Platform (please complete the following information)

  • 1.13, 1.12
  • All OSes, compilers, build systems

Additional context
nbytes was renamed to size in #1969

@mkitti
Copy link
Contributor Author

mkitti commented Sep 9, 2022

Hi @fortnern, @derobins, and @gheber . If you have a second, I would like to hear your opinion on this issue before I try to backport my proposed fix in #2074 to 1.12 and integrate this into my H5Dchunk_iter backport to 1.10.

Basically I think the type for the filter_mask and size arguments should be the same between H5D_chunk_iter_op_t and H5Dget_chunk_info.

Currently, we have the following in H5Dpublic.h.

typedef int (*H5D_chunk_iter_op_t)(
    const hsize_t *offset,
    uint32_t filter_mask,
    haddr_t addr,
    uint32_t size,
    void *op_data
);
H5_DLL herr_t H5Dget_chunk_info(
    hid_t dset_id,
    hid_t fspace_id,
    hsize_t chk_idx,
    hsize_t *offset,
    unsigned *filter_mask,
    haddr_t *addr,
    hsize_t *size
);

I think we should change the definition of H5D_chunk_iter_op such that filter_mask is unsigned and size is hsize_t as follows.

typedef int (*H5D_chunk_iter_op_t)(
    const hsize_t *offset,
    unsigned filter_mask,
    haddr_t addr,
    hsize_t size,
    void *op_data
);

I find this particularly problematic since I do not think we should expose an API where size is limited to 32-bits.

Please let me know if you agree. If so, I will then finish my current pull request series on this matter across 1.13, 1.12, and 1.10.

@Dave-Allured
Copy link
Contributor

The documented upper limit for an HDF5 chunk size is 4 Gb. This fits neatly within uint32. Also, the purpose of chunking is to break up a dataset into small pieces for access performance. I can not conceive of a valid purpose for a chunk size exceeding this limit. In my experience, most valid chunk sizing is between 10^3 and 10^8 bytes, far below 4 x 10^9.

Therefore, I suggest that exposed uint32's be left alone, unless I am missing something.

@mkitti
Copy link
Contributor Author

mkitti commented Sep 14, 2022

@Dave-Allured , it's not clear to me that that would always be the case that chunks will be limited to 4 GB. It was suggested on the forums that perhaps contiguous datasets should be a special case of chunked datasets with only a single chunk: https://forum.hdfgroup.org/t/what-do-you-want-to-see-in-hdf5-2-0/10003/19?u=kittisopikulm . In that case, we would want chunks as large as datasets can be now. The bug report here is not exactly about that. I'm not suggesting that we change the internal chunk size type here.

The bug I am reporting here is about consistency across the public API. The older API functions that provide size information on chunks all use hsize_t, a 64-bit unsigned integer in most cases now. What I am suggesting is that the new function H5Dchunk_iter and the H5D_chunk_iter_op_t type should be consistent with H5Dget_chunk_info.

Having the public API being able to accommodate sizes up to 64 bits would allow an internal move to 64-bit sizes in the future even though that is currently not possible. For now, the API should at least be consistent.

@derobins
Copy link
Member

There's not much we can do about H5Dget_chunk_info() now, but we shouldn't be propagating a lie about acceptable parameter values across the API. Using a 64-bit parameter when there are hard 32-bit limits in the library is misleading.

The real bug is that H5Dget_chunk_info() was released with an incorrect parameter type.

@Dave-Allured
Copy link
Contributor

If HDF Group says that unit32 was unintentional, then I can not object to changing H5Dget_chunk_info() for constency. @mkitti, thanks for working on it.

@mkitti
Copy link
Contributor Author

mkitti commented Sep 16, 2022

@derobins What the current API of H5Dget_chunk_info communicates to me is that the address could be a 64-bit value in the future despite the current internal limitations of the HDF5 library and format. This API is used purely for reading values. Allowing for 64-bit values in the current API allows for forward-compatibility.

As far as I know, the main interface to set chunk sizes is via H5Pset_chunk. This function takes the dimensions as hsize_t despite each dimension also being practically limited to less than 32-bits.

herr_t H5Pset_chunk(
   hid_t plist_id,
   int ndims,
   const hsize_t dim[] 
)

The corresponding H5Pget_chunk is also hsize_t. If having dimensions and sizes be of type hsize_t is incorrect, then the entire public chunk API since HDF5 1.0.0 is incorrect.

Changing H5Dget_chunk_info, H5Dget_chunk_info_by_coords,H5Pget_chunk, and H5Pset_chunk now would break a long-standing API. Adding 32-bit based H5Dget_chunk_info2, H5Dget_chunk_info_by_coords2,H5Pget_chunk2, and H5Pset_chunk2 would seem to be heading in a backwards direction.

Since H5D_chunk_iter_op_t is currently unreleased in stable branch, I think the easiest thing to do is just make H5D_chunk_iter_op_t consistent with the rest of the existing chunk API. I'm not sure if I see the benefits of introducing a 32-bit H5D_chunk_iter_op_t now, only to break it later in a possible HDF5 2.0.

That said I will update #1968 with the 32-bit version of H5D_chunk_iter_op_t and proceed from there.

@mkitti
Copy link
Contributor Author

mkitti commented Sep 16, 2022

The lesser consistency issue in the public API is between unsigned *filter_mask in H5Dget_chunk_info and uint32_t filter_mask in H5D_chunk_iter_op_t.

We should pick either unsigned or uint32_t for consistency.

@mkitti
Copy link
Contributor Author

mkitti commented Dec 19, 2022

I' m reopening this since we are moving forward with #2074 .

I will also use this to track the backport to 1.12.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants