-
Notifications
You must be signed in to change notification settings - Fork 1
OOM error during inference. #19
Comments
I investigated this a little bit more and I am fairly certain that we don't have a bug in our callback. But let me know what you think @jonasteuwen. I did the following things:
Filename: /home/a.karkala/ahcore/ahcore/callbacks.py
The issue we have looks the same as what is reported here. They identified a problem with hdf5 c library against which h5py is compiled. Unfortunately, the issue they pointed out is still not fixed in upstream hdf5. Check here. One of the contributor has taken this up and marked this issue for the next release. check here. For now, maybe we should compile h5py from source by choosing the non-leaky hdf5 library. |
Thanks @moerlemans ! So, your working environment uses hdf5 1.10.6. Let me try with these versions. |
I did some more checks and I am capturing them here. It's now starting to look like it wasn't a problem with h5py or the underlying library either (well, at most it may have added to the trouble). But I installed older versions of h5py with older versions of hdf5 library and the problem still persisted. Specifically, I tried: Then, I started looking elsewhere. I investigated how much memory the child processes end up taking while doing the h5 writing. Following are some screenshots I made using the
When the prediction begins for the first image, process with id 3328375 is the parent. 3336839 is the child process. After that's done, the child process is properly exited. During this time, the resident set size and the virtual memory has significantly increased in the parent process. The next child process with id 3387563 inherits the parent (multiprocessing uses forking by default. So, the memory isn't copied but it's shared and copied only when the child is trying to modify it.). So, this rules out any problem with the multiprocessing we have in place currently. Or in fact, the problem doesn't seem to be in the h5 writing or the callback at all. So, I disabled all the callbacks and simply ran a prediction loop. Much to my own surprise, it crashed very quickly. It turns out there is an open issue on pytorch lightning 2.0 github which is similar (but not exactly the same). |
WriteH5Callback
I downgraded to pytorch_lightning 1.9.1 while retaining the latest pytorch(2.1.1). The issue doesn't seem to go away. |
I downgraded pytorch to 1.12.1 and the issue didn't go away. |
From the looks of it, it seems like the predictions are not being deallocated after the prediction step. So, to test this hypothesis I return None after Please note that, in this run, my batch size was 256 tiles and all the callbacks were switched off. |
@EricMarcus-ai and I looked at the issue today. We disabled virtually everything that could cause memory leaks. Concretely, we did the following: Commented all the lines within the Just as a sanity check, we also downgraded the pytorch lightning version and repeated the run. It still broke.
|
Today, I investigated the role of the |
@AjeyPaiK can this be closed? |
Yes. I made the necessary changes in this PR. |
Describe the bug
I am using the
WriteH5Callback
at inference time. I tracked the RSS memory which gets utilised during inference. Below is what I found.After writing each H5 file corresponding to one WSI from the inference dataset:
To Reproduce
Run the following after configuring this version of ahcore
Expected behavior
The RSS memory shouldn't keep increasing with every new image while performing inference.
Environment
dlup version: 0.3.32
Python version: 3.10
Operating System: Linux
Additional Context
I am trying to run inference on a large batch of WSIs from a clinical dataset (n=1072) with my trained models. That's when I encountered this problem
The text was updated successfully, but these errors were encountered: