`GridPatch` with `pin_memory=True` significant slow-down in following epochs #6082

drbeh · 2023-02-28T19:37:21Z

Reported by a user (@kenza-bouzid) here:

When using num_workers>1 and pin_memory=True, training time increases exponentially over epochs

I had profiled/timed all intermediate steps, and found out that GridPatch was the guilty one

So I timed all intermediate step in the tranform

It turned out that the operation that was taking too long was a memory allocation by np.array

MONAI/monai/transforms/spatial/array.py

Line 3279 in a2ec375

patched_image = np.array(patches[0])

I eventually fixed it by setting pin_memory=False,
which I explain by cuda memory allocation being more expensive as it has to be allocated from the pinned memory

Note that I am dealing with particularly large slides ~50kx50k

Any thoughts on this?

drbeh · 2023-02-28T19:40:03Z

@kenza-bouzid just to double confirm that when you set pin_memory=False this issue goes away and it takes the same time (few seconds) in all epochs, right?

kenza-bouzid · 2023-02-28T19:44:20Z

@drbeh Correct! I don't have a proper explanation why pin_memory=False fixes this issue tbh.
Training looks healthy with pin_memory=False, epoch time is constant

drbeh · 2023-02-28T19:58:50Z

@ericspod @Nic-Ma @wyli @rijobro do you have any insights here on why setting pin_memory=True in dataloader causes this behavior?

It seems that these conversion to and from numpy arrays have some effects here but I thought pinning memory is being done in the last step of dataloader so I don't see why it has anything to do with it.
Also there have been some issues with pin memory in pytorch in the past, like this one from years ago, so can it be a pytorch issue?

ericspod · 2023-03-06T12:21:17Z

I honestly am not sure what the issue here would be. I recall there being other issues we've encountered in the past with pinned memory so I typically avoid it. I can only speculate it could be the pinned segment's lifetime continues after the np.array conversion because of Pytorch's buffer reuse or some other issue in Pytorch or CUDA itself.

Fixes #6245 . ### Description support GPU tensor for * `GridPatch`, `GridPatchd`, `RandGridPatch` and `RandGridPatchd` * `GridPatchDataset`, `PatchIter`, `PatchIterd` and `iter_patch` #### Changes * Currently, `GridPatch`, `GridPatchDataset` and their siblings need to copy GPU tensor to numpy array (or not support GPU tensor) in their pipeline. This PR enables them to support end-to-end GPU operations. * Since the padding function `pad_nd` is used heavily, its robustness is improved. Some strange behavior introduced by #6066 is fixed. * In `GridPatch`, #6082 might be fixed by this PR? * `RandGridPatch`'s mode defaults to `None` (consistent with docs) #### Need to Notice * Default padding mode of `PatchIter`, `PatchIterd` and `iter_patch` are kept as `NumpyPadMode.WRAP`. However, `GridPatch` defaults to `None`. (might be inconsistent?) ### Types of changes  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [x] New tests added to cover the changes. - [ ] Integration tests passed locally by running `./runtests.sh -f -u --net --coverage`. - [x] Quick tests passed locally by running `./runtests.sh --quick --unittests --disttests`. - [ ] In-line docstrings updated. - [ ] Documentation updated, tested `make html` command in the `docs/` folder. --------- Signed-off-by: Qingpeng Li <qingpeng9802@gmail.com>

drbeh changed the title ~~GridPatch with pin_memory=True significantly slow-down in following epochs~~ GridPatch with pin_memory=True significant slow-down in following epochs Mar 6, 2023

qingpeng9802 mentioned this issue Mar 27, 2023

support GPU tensor for GridPatch and GridPatchDataset #6246

Merged

7 tasks

KumoLiu added bug Something isn't working Contribution wanted labels Dec 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`GridPatch` with `pin_memory=True` significant slow-down in following epochs #6082

`GridPatch` with `pin_memory=True` significant slow-down in following epochs #6082

drbeh commented Feb 28, 2023

drbeh commented Feb 28, 2023

kenza-bouzid commented Feb 28, 2023 •

edited

Loading

drbeh commented Feb 28, 2023

ericspod commented Mar 6, 2023

GridPatch with pin_memory=True significant slow-down in following epochs #6082

GridPatch with pin_memory=True significant slow-down in following epochs #6082

Comments

drbeh commented Feb 28, 2023

drbeh commented Feb 28, 2023

kenza-bouzid commented Feb 28, 2023 • edited Loading

drbeh commented Feb 28, 2023

ericspod commented Mar 6, 2023

`GridPatch` with `pin_memory=True` significant slow-down in following epochs #6082

`GridPatch` with `pin_memory=True` significant slow-down in following epochs #6082

kenza-bouzid commented Feb 28, 2023 •

edited

Loading