Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

h5py (+PyTorch) memory leak since hdf5 1.12.1 #1256

Closed
gvtulder opened this issue Dec 2, 2021 · 8 comments · Fixed by #3942
Closed

h5py (+PyTorch) memory leak since hdf5 1.12.1 #1256

gvtulder opened this issue Dec 2, 2021 · 8 comments · Fixed by #3942
Assignees
Labels
Component - C Library Core C library issues (usually in the src directory) Priority - 0. Blocker ⛔ This MUST be merged for the release to happen Type - Bug / Bugfix Please report security issues to help@hdfgroup.org instead of creating an issue on GitHub
Milestone

Comments

@gvtulder
Copy link

gvtulder commented Dec 2, 2021

In h5py, the Python memory explodes when an array is loaded from an h5py object and converted to a PyTorch tensor, but only if you also retrieve an attribute. See the original issue: h5py/h5py#2010.

I am not entirely sure if this is an h5py issue or a problem in libhdf5, but it is connected to this change in hdf5: 98a27f1 (between libhdf5 1.12.0 and 1.12.1). Before this change, the leak does not appear, after, it does.

Example

On my system, I can produce this error by running pip install h5py numpy torch and then running this script:

import h5py
import numpy as np
import torch

with h5py.File('test.h5', 'w') as ds:
    ds.create_dataset('scan', data=np.ones(shape=(1000, 1000)))
    ds['scan'].attrs['example'] = 'abcdefg'

def load_leaky():
    with h5py.File('test.h5', 'r') as ds:
        x_i = ds['scan'][:]
        ds['scan'].attrs['example']
    torch.tensor(x_i)

while True:
    load_leaky()

Running this will cause the Python memory to explode. All three steps are necessary (loading an array + reading an attribute + converting the array to PyTorch) to create this issue.

Bisection result

I repeated this with self-compiled h5py 3.6.0 with libdfh5 compiled from git between 1.12.0 and 1.12.1. The problem depends on the hdf5 version: 1.12.0 works, 1.12.1 doesn't, and 98a27f1 is the first commit where it goes wrong.

Is this an hdf5 issue or is this something in h5py?

cc @tacaswell

@gvtulder
Copy link
Author

The problem might be linked to this specific change in H5T__initiate_copy.

After the change, it looks like this, and then it leaks:

hdf5/src/H5T.c

Lines 3433 to 3438 in 98a27f1

/* Increment ref count on owned VOL object */
if (new_dt->shared->owned_vol_obj)
(void)H5VL_object_inc_rc(new_dt->shared->owned_vol_obj);
/* Reset vol_obj field */
new_dt->vol_obj = NULL;
but the leak disappears if I change those lines back to their previous state:

hdf5/src/H5T.c

Lines 3349 to 3351 in 4aa4036

/* Reset VOL fields */
new_dt->vol_obj = NULL;
new_dt->shared->owned_vol_obj = NULL;

Another potential clue: after the change, the functions H5T_copy and H5T__initiate_copy are called much more often than before. I added print statements to a few functions to count this, while running the script in my previous post (running the load_leaky()function 2500 times):

--- calls with leak
+++ calls without leak
    2501 H5T__own_vol_obj
    2501 H5VL_free_object not yet free 1
-   7502 H5VL_free_object not yet free 2
-   5002 H5VL_free_object not yet free 3
   12505 H5VL_free_object was free 0
-  38146 H5T_copy
+  18154 H5T_copy
   15003 H5T_copy_reopen
-  53149 H5T__initiate_copy
+  33157 H5T__initiate_copy

(for H5VL_free_object, I printed the value of vol_obj->rc after the function name)

I do not understand why this ref counting change would have to lead to more copies.

@derobins derobins self-assigned this May 4, 2023
@derobins derobins added Priority - 1. High 🔼 These are important issues that should be resolved in the next release Component - C Library Core C library issues (usually in the src directory) Type - Bug / Bugfix Please report security issues to help@hdfgroup.org instead of creating an issue on GitHub labels May 4, 2023
@derobins derobins added this to the 1.14.3 milestone Oct 9, 2023
@derobins derobins modified the milestones: 1.14.3, 1.14.4 Oct 21, 2023
@plule-ansys
Copy link

plule-ansys commented Nov 14, 2023

Hello, I've hit the same issue using the C++ library (triggered by the same commit 98a27f1). Here are some additional findings in the event it helps:

  • It can be reproduced by opening in loop a file and reading a variable length attribute, though reading an array does not seem necessary in that case
  • Writing a variable length string attribute leads to the same leak
  • Fixed length string attributes are not affected

Here is a minimal reproduction in the form of a C++ test:

static void
test_string_attr_many()
{
    SUBTEST("Write a string many times");
    StrType   vls_type(0, H5T_VARIABLE);
    DataSpace att_space(H5S_SCALAR);
    for (size_t i = 0; i < 10000; ++i) {
        H5File    fid1("file.h5", H5F_ACC_TRUNC);
        Group     root      = fid1.openGroup("/");
        Attribute gr_vlattr = root.createAttribute("attribute", vls_type, att_space);
        gr_vlattr.write(vls_type, std::string("value"));
    }
    PASSED();
}

Reverting the ref count change in H5T.c fixed it when writing the attribute too. Visual Studio shows the following allocating paths, they seem to match @gvtulder 's findings:

image

@ajelenak
Copy link
Contributor

ajelenak commented Nov 15, 2023 via email

@jhendersonHDF
Copy link
Collaborator

jhendersonHDF commented Jan 11, 2024

@gvtulder @plule-ansys @tacaswell

Hi all, this issue is a bit old now; is this still a problem with the latest
released version of h5py? I've tried to reproduce this issue with a C
program, but can't reproduce the memory leak. @plule-ansys's example
exhibits a slow memory leak, but this appears to be due to a missing
call to H5Treclaim (or the older H5Dvlen_reclaim). Though not very
well-documented, this function should be called after reading an object
(dataset or attribute) with a variable-length type in order to free memory
that the library allocated as a part of the read operation.

I looked through the h5py codebase and it appears to expose the
H5Dvlen_reclaim API, but, in relation to object I/O, as far as I can tell the
codebase only calls H5Dvlen_reclaim after a write to an attribute or dataset
in https://github.com/h5py/h5py/blob/master/h5py/_proxy.pyx#L67-L68
and https://github.com/h5py/h5py/blob/master/h5py/_proxy.pyx#L148-L149
when the type needs 'proxy buffering'. This makes sense because calling
H5Dvlen_reclaim after reading from an object would free the memory
before the caller could use it, but this also presumably means that the
user's Python code would be responsible for calling this API through the
Python interface once finished with the data after reading from an object.

I read through the h5py docs briefly, but couldn't find any reference to
calling this function in the docs for variable-length types. Based on that,
I believe that these memory leaks are coming from not cleaning up
the memory that the library allocated. I'm not entirely sure what would have
caused this to become a problem after the commit mentioned in this
issue, but it may well be that the library was conveniently but unintentionally
closing out these objects too early and now properly keeps the objects
around with reference counting until the user calls H5Dvlen_reclaim.

@tacaswell Are you able to comment on whether h5py has any information
about dealing with the cleanup of memory allocated for reads of variable-length
typed objects? Regardless, I think our documentation on this could be much better.

@tacaswell
Copy link

I have not had time to dig into this, but my expectation is that h5py should be doing the copy before handing back to the (Python space) user or we need to extend our object tear down logic. Either way, it should not be on the end user to remember to do this.

@jhendersonHDF
Copy link
Collaborator

I have not had time to dig into this, but my expectation is that h5py should be doing the copy before handing back to the (Python space) user or we need to extend our object tear down logic. Either way, it should not be on the end user to remember to do this.

Thanks for the response! I plan to try and prototype this, but if this is something that h5py can manage such that the end user doesn't have to worry about it, that would be fantastic. I'm less familiar with Python, but I received some advice during a separate discussion pointing me towards context managers that look like they could be useful for this purpose. As long as h5py is still in control of the underlying C buffer when the Python space is finished with it, then I imagine h5py could be updated to call H5Dvlen_reclaim on the buffer to free the library-allocated memory before freeing the buffer itself without needing a copy. Otherwise, if that proves to be unworkable, it seems copying the buffer before giving it back to the Python space could work as well.

@plule-ansys
Copy link

@jhendersonHDF Thank you for your answer.

Though not very well-documented, this function should be called after reading an object
(dataset or attribute) with a variable-length type in order to free memory that the library allocated as a part of the read operation.

Could you be a bit more specific on what should be freed with H5Treclaim, either on the C++ or python example please? In both cases, the leak happens in a write only loop, and the leaked read buffer is not directly used by the user code, so I'm a bit confused on what buffer should be reclaimed in that scenario.

Also let me know if the C++ situation should be handled in a separate issue, my initial understanding was that it was the same but now I'm unsure.

@jhendersonHDF
Copy link
Collaborator

Hi @plule-ansys,

after digging more into the issue, it seems to be a bit of a multifaceted problem,
but in your particular case with the C++ example, I think nothing actually needs
to be freed with H5Treclaim and it was more of a mistake on my part translating
between the Python example, C++ example and my own C example. That said,
while I believe your example is fine, H5Treclaim is still something that needs to
be called after reading an object with a variable-length type, even if that read buffer
happens to not really be used. For example, see
https://github.com/HDFGroup/hdf5/blob/develop/HDF5Examples/C/H5T/h5ex_t_vlstring.c
and more particularly this part of that file:
https://github.com/HDFGroup/hdf5/blob/develop/HDF5Examples/C/H5T/h5ex_t_vlstring.c#L111-L118.

HDF5 generally needs to allocate memory for variable-length types during the read
operation in order to properly hand back data to the user, so H5Treclaim is used
to reclaim that memory. A user could also free that memory manually; H5Treclaim
mostly just acts as a convenience function. Looking at the example above, a
dims[0] * sizeof(char *)-sized char ** read buffer is allocated and passed into the
H5Dread call. HDF5 will allocate each char * in that array and fill in the buffers
before returning from H5Dread, at which point the user could free that memory
by stepping through the array, or just by calling H5Treclaim with the proper arguments.
Variable-length string types are fairly simple in this regard, while other variable-length
types (think something like a recursive variable-length of variable-length) can get a
little more complicated and H5Treclaim can help manage the headache of trying to
properly free buffers of those types. I've prototyped out adding this functionality to
h5py, but it's a little bit unwieldy from the end user perspective, so I'm hoping to
eventually work with the h5py folks and improve on that.

Having said all that, I finally found the source of the original issue reported on
here and am currently working on a fix for the HDF5 1.14.4 release. I believe
that should fix the issue for both the Python example and your C++ example.
Thanks to @gvtulder for giving me a place to start from that helped in pinpointing
the problem!

@derobins derobins added Priority - 0. Blocker ⛔ This MUST be merged for the release to happen and removed Priority - 1. High 🔼 These are important issues that should be resolved in the next release labels Jan 19, 2024
jhendersonHDF added a commit to jhendersonHDF/hdf5 that referenced this issue Jan 24, 2024
lrknox added a commit that referenced this issue Feb 15, 2024
* Update upload- artifact to match download version (#3929)

* Reorg and update options for doc and cmake config (#3934)

* Add binary build for linux S3 (#3936)

* Clean up Doxygen for szip functions and constants (#3943)

* Replace off_t with HDoff_t internally (#3944)

off_t is a 32-bit signed value on Windows, so we should use HDoff_t
(which is __int64 on Windows) internally instead.

Also defines HDftell on Windows to be _ftelli64().

* Fix chid_t to hid_t (#3948)

* Fortran API work. (#3941)

* - Added Fortran APIs:
      H5FGET_INTENT_F, H5SSELECT_ITER_CREATE_F, H5SSEL_ITER_GET_SEQ_LIST_F,
      H5SSELECT_ITER_CLOSE_F, H5S_mp_H5SSELECT_ITER_RESET_F

    - Added Fortran Parameters:
      H5S_SEL_ITER_GET_SEQ_LIST_SORTED_F, H5S_SEL_ITER_SHARE_WITH_DATASPACE_F

    - Added tests for new APIs
    - Removed H5F C wrapper stubs
    - Documentation misc. cleanup.

* Add the user test program in HDFFV-9174 for committed types. (#3937)

Add the user test program for committed types in HDFFV-9174

* Remove cached datatype conversion path table entries on file close (#3942)

* fixed BIND name (#3957)

* update H5Ssel_iter_reset_f test

* Change 'extensible' to 'fixed' in H5FA code (#3964)

* RF: move codespell configuration into .codespellrc so could be used locally as well (#3958)

* Add RELEASE.txt note for the fix for issue #1256 (#3955)

* Fix doxygen errors (#3962)

* Add API support for Fortran MPI_F08 module definitions. (#3959)

* revert to using c-stub for _F08 MPI APIs

* use mpi compiler wrappers for cmake and nvhpc

* Added a GitHub Codespaces configuration. (#3966)

* Fixed XL and gfortran errors (#3968)

* h5 compiler wrappers now pass all arguments passed to it to the compile line (#3954)

* The issue was that the "allargs" variable was not being used in the final command of the compiler wrapper. Any entries containing an escaped quote (\", \') or other non-matching argument (*) would not be passed to the compile line. I have fixed this problem by ensuring all arguments passed to the compiler wrapper are now included in the compile line.

* Add binary testing to CI testing (#3971)

* Replace 'T2' with ' ' to avoid failure to match expected output due to (#3975)

* Clarify vlen string datatype message (#3950)

* append '-WF,' when passing C preprocessor directives to the xlf compiler (#3976)

* Create CITATION.cff (#3927)

Add citation source based on http://web.archive.org/web/20230610185232/https://portal.hdfgroup.org/display/knowledge/How+do+I+properly+cite+HDF5%The space difference in the Fortran examples must be fixed to match the expected output for compression filter examples.

* corrected warning: implicit conversion changes signedness (#3982)

* Skip mac bintest until more reliable (#3983)

* Make platform specific test presets for windows and macs (#3988)

* chore: fix typo (#3989)

* Add a missing left parenthesis in RELEASE.txt. (#3990)

* Remove ADB signature from RELEASE.txt. (#3986)

* Bump the github-actions group with 6 updates (#3981)

* Sync API tests with vol-tests (#3940)

* Fix for github issue #2414: segfault when copying dataset with attrib… (#3967)

* Fix for github issue #2414: segfault when copying dataset with attributes.
This also fixes github issue #3241: segfault when copying dataset.
Need to set the location via H5T_set_loc() of the src datatype
when copying dense attributes.
Otherwise the vlen callbacks are not set up therefore causing seg fault
when doing H5T_convert() -> H5T__conv_vlen().

* Fix broken links caused by examples relocation. (#3995)

* Add abi-complience check and upload to releases (#3996)

* Fix h5watch test failures to ignore system warnings on ppc64le. (#3997)

* Remove oneapi/clang compiler printf() type warning. (#3994)

* Updated information about obtaining the HDF5 source code to use the repos. (#3972)

* Fix overwritten preset names (#4000)

* Fix incompatible pointer type warnings in object reference examples (#3999)

* Fix build issue and some warnings in H5_api_dataset_test.c (#3998)

* Modern C++ dtor declarations (#1830)

* C++ dtor modernization

- Replaced a bunch of empty dtors with `= default`
- Removed deprecated `throw()`. In C++11, dtors are `noexcept` by default.

* remove incorrect check for environ (#4002)

* Add a missing file into Makefile.am for MinGW Autotools build error. (#4004)

* Issue #1824: Replaced most remaining sprintf with safer snprint (#4003)

* Add hl and cpp ABI reports to daily build (#4006)

* Don't add files and directories with names that begin with ., or that match *autom4te* to release tar & zip files. (#4009)

* Fix some output issues with ph5diff (#4008)

* Update install texts (#4010)

* Add C in project line for CMake to fix #4012. (#4014)

* separate out individual checks for string removal (#4015)

* Add compound subset ops on attributes to API tests (#4005)

---------
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component - C Library Core C library issues (usually in the src directory) Priority - 0. Blocker ⛔ This MUST be merged for the release to happen Type - Bug / Bugfix Please report security issues to help@hdfgroup.org instead of creating an issue on GitHub
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants