-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GraphBolt][CUDA] Use better memory allocation algorithm to avoid OOM. #7618
Conversation
To trigger regression tests:
|
Docs need to be added to let user know this option. And it might be good to let user to turn on and off this allocation method. |
@TristonC The user can turn it off by setting the pytorch environment value already. This PR sets it only if user has not provided a value for it already. Where in the documentation should this be documented? 99% of the users won't even touch this option I feel. |
Here is my concern, although I really want it default on. The Pytorch has this option experimental and default value off.
|
@TristonC Do we let the users simply go OOM then? The new feature fetching pipeline creates a lot more feature tensors due to pipeline being longer now with Disk -> CPUCache-> GPU Cache and there can be 2x memory usage difference between enabling it and not. Do we let users run our new features and think that it does not work well? I think a better way is to enable it and give a warning to the user saying we enabled it. If they want to disable it, they can set bla bla environment variable etc. That way, we don't need documentation as the warning will provide the links to the user. Nobody will see the documentation anyway. |
I like the idea of the warning. |
Will change the PR accordingly, appreciate the review. |
LGTM |
@TristonC @frozenbugs pytorch/pytorch#130330 for relevant discussion. This PR is proposing to enable it by default: pytorch/pytorch#130959 and it is approved. Since it looks like it is almost a production-level feature, I am hoping that there should be no harm in us enabling it by default and if a user ever have a problem, then they can consider turning it off since we display a warning. |
Sounds good, does it also mean that we can remove the flag changing code in the future? |
İf it is turned on by default, we can turn it on like this for only older torch versions in the future and not have to display a warning. |
Description
I have 24GB GPU. Without this, some examples go OOM. With this, they don't even use half the GPU memory. This is due to the irregular sizes of the minibatches in GNN training. Each minibatch has different sized
node_ids()
for example. The default allocator does not work well in this use case. The added option is much better suited for GNN minibatch training with GraphBolt.See https://pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf.
Checklist
Please feel free to remove inapplicable items for your PR.
Changes