Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GridPatch with pin_memory=True significant slow-down in following epochs #6082

Open
drbeh opened this issue Feb 28, 2023 · 4 comments
Open
Labels
bug Something isn't working Contribution wanted

Comments

@drbeh
Copy link
Member

drbeh commented Feb 28, 2023

Reported by a user (@kenza-bouzid) here:

When using num_workers>1 and pin_memory=True, training time increases exponentially over epochs
image
image
I had profiled/timed all intermediate steps, and found out that GridPatch was the guilty one
image
So I timed all intermediate step in the tranform

image

image

image

It turned out that the operation that was taking too long was a memory allocation by np.array

patched_image = np.array(patches[0])

I eventually fixed it by setting pin_memory=False,
which I explain by cuda memory allocation being more expensive as it has to be allocated from the pinned memory

Note that I am dealing with particularly large slides ~50kx50k

Any thoughts on this?

@drbeh
Copy link
Member Author

drbeh commented Feb 28, 2023

@kenza-bouzid just to double confirm that when you set pin_memory=False this issue goes away and it takes the same time (few seconds) in all epochs, right?

@kenza-bouzid
Copy link

kenza-bouzid commented Feb 28, 2023

@drbeh Correct! I don't have a proper explanation why pin_memory=False fixes this issue tbh.
Training looks healthy with pin_memory=False, epoch time is constant
image

@drbeh
Copy link
Member Author

drbeh commented Feb 28, 2023

@ericspod @Nic-Ma @wyli @rijobro do you have any insights here on why setting pin_memory=True in dataloader causes this behavior?

It seems that these conversion to and from numpy arrays have some effects here but I thought pinning memory is being done in the last step of dataloader so I don't see why it has anything to do with it.
Also there have been some issues with pin memory in pytorch in the past, like this one from years ago, so can it be a pytorch issue?

@ericspod
Copy link
Member

ericspod commented Mar 6, 2023

I honestly am not sure what the issue here would be. I recall there being other issues we've encountered in the past with pinned memory so I typically avoid it. I can only speculate it could be the pinned segment's lifetime continues after the np.array conversion because of Pytorch's buffer reuse or some other issue in Pytorch or CUDA itself.

@drbeh drbeh changed the title GridPatch with pin_memory=True significantly slow-down in following epochs GridPatch with pin_memory=True significant slow-down in following epochs Mar 6, 2023
wyli pushed a commit that referenced this issue Mar 29, 2023
Fixes #6245 .

### Description
support GPU tensor for
* `GridPatch`, `GridPatchd`, `RandGridPatch` and `RandGridPatchd`
* `GridPatchDataset`, `PatchIter`, `PatchIterd` and `iter_patch`

#### Changes
* Currently, `GridPatch`, `GridPatchDataset` and their siblings need to
copy GPU tensor to numpy array (or not support GPU tensor) in their
pipeline. This PR enables them to support end-to-end GPU operations.
* Since the padding function `pad_nd` is used heavily, its robustness is
improved. Some strange behavior introduced by
#6066 is fixed.
* In `GridPatch`, #6082
might be fixed by this PR?
* `RandGridPatch`'s mode defaults to `None` (consistent with docs)

#### Need to Notice
* Default padding mode of `PatchIter`, `PatchIterd` and `iter_patch` are
kept as `NumpyPadMode.WRAP`. However, `GridPatch` defaults to `None`.
(might be inconsistent?)

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [x] New tests added to cover the changes.
- [ ] Integration tests passed locally by running `./runtests.sh -f -u
--net --coverage`.
- [x] Quick tests passed locally by running `./runtests.sh --quick
--unittests --disttests`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated, tested `make html` command in the `docs/`
folder.

---------


Signed-off-by: Qingpeng Li <qingpeng9802@gmail.com>
@KumoLiu KumoLiu added bug Something isn't working Contribution wanted labels Dec 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Contribution wanted
Projects
None yet
Development

No branches or pull requests

4 participants