[GraphBolt][CUDA] Use better memory allocation algorithm to avoid OOM. #7618

mfbalin · 2024-07-30T14:21:20Z

Description

I have 24GB GPU. Without this, some examples go OOM. With this, they don't even use half the GPU memory. This is due to the irregular sizes of the minibatches in GNN training. Each minibatch has different sized node_ids() for example. The default allocator does not work well in this use case. The added option is much better suited for GNN minibatch training with GraphBolt.

See https://pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf.

Checklist

Please feel free to remove inapplicable items for your PR.

The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
I've leverage the tools to beautify the python and c++ code.
The PR is complete and small, read the Google eng practice (CL equals to PR) to understand more about small PR. In DGL, we consider PRs with less than 200 lines of core code change are small (example, test and documentation could be exempted).
All changes have test coverage
Code is well-documented
To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
Related issue is referred in this PR
If the PR is for a new model/paper, I've updated the example index here.

Changes

dgl-bot · 2024-07-30T14:21:50Z

To trigger regression tests:

@dgl-bot run [instance-type] [which tests] [compare-with-branch];
For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

dgl-bot · 2024-07-30T14:23:36Z

Commit ID: bb73447

Build ID: 1

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

dgl-bot · 2024-07-30T14:34:21Z

Commit ID: da2f788

Build ID: 2

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

dgl-bot · 2024-07-30T15:08:03Z

Commit ID: 5cb555d

Build ID: 3

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

TristonC · 2024-07-30T18:59:10Z

Docs need to be added to let user know this option. And it might be good to let user to turn on and off this allocation method.

mfbalin · 2024-07-30T19:19:58Z

Docs need to be added to let user know this option. And it might be good to let user to turn on and off this allocation method.

@TristonC The user can turn it off by setting the pytorch environment value already. This PR sets it only if user has not provided a value for it already.

Where in the documentation should this be documented? 99% of the users won't even touch this option I feel.

TristonC · 2024-07-30T21:39:08Z

Here is my concern, although I really want it default on. The Pytorch has this option experimental and default value off.
It will be perfect if this works out of box. When it goes wrong, we need to have a way of letting user know this could be a root cause. Doc can be made to be viewed on https://docs.dgl.ai/api/python/dgl.graphbolt.html#

expandable_segments (experimental, default: False)

mfbalin · 2024-07-30T21:49:38Z

Here is my concern, although I really want it default on. The Pytorch has this option experimental and default value off. It will be perfect if this works out of box. When it goes wrong, we need to have a way of letting user know this could be a root cause. Doc can be made to be viewed on https://docs.dgl.ai/api/python/dgl.graphbolt.html#

expandable_segments (experimental, default: False)

@TristonC Do we let the users simply go OOM then? The new feature fetching pipeline creates a lot more feature tensors due to pipeline being longer now with Disk -> CPUCache-> GPU Cache and there can be 2x memory usage difference between enabling it and not. Do we let users run our new features and think that it does not work well?

I think a better way is to enable it and give a warning to the user saying we enabled it. If they want to disable it, they can set bla bla environment variable etc. That way, we don't need documentation as the warning will provide the links to the user. Nobody will see the documentation anyway.

TristonC · 2024-07-30T21:54:14Z

I like the idea of the warning.

mfbalin · 2024-07-30T21:54:56Z

I like the idea of the warning.

Will change the PR accordingly, appreciate the review.

dgl-bot · 2024-07-31T02:59:29Z

Commit ID: 51c6e16220a46b464fe9e9cd5d5e387c4130abbd

Build ID: 4

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

dgl-bot · 2024-07-31T03:01:07Z

Commit ID: 892b5984e7ee242accf61e7177a36e06d70a7c85

Build ID: 5

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

dgl-bot · 2024-07-31T03:24:27Z

Commit ID: 9856159c3cab16d93e0de87ac10753fecfff717c

Build ID: 6

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

dgl-bot · 2024-07-31T03:57:17Z

Commit ID: 90e41bd6d95f3fac86e47fc334d283d80dbc9a14

Build ID: 7

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot · 2024-07-31T04:34:08Z

Commit ID: ecef989

Build ID: 8

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

TristonC · 2024-07-31T17:31:36Z

LGTM

mfbalin · 2024-08-01T14:52:31Z

@TristonC @frozenbugs pytorch/pytorch#130330 for relevant discussion.

This PR is proposing to enable it by default: pytorch/pytorch#130959 and it is approved.

Since it looks like it is almost a production-level feature, I am hoping that there should be no harm in us enabling it by default and if a user ever have a problem, then they can consider turning it off since we display a warning.

frozenbugs · 2024-08-02T06:18:04Z

Sounds good, does it also mean that we can remove the flag changing code in the future?

mfbalin · 2024-08-02T10:43:36Z

Sounds good, does it also mean that we can remove the flag changing code in the future?

İf it is turned on by default, we can turn it on like this for only older torch versions in the future and not have to display a warning.

[GraphBolt][CUDA] Use better memory allocation algorithm to avoid OOM.

bb73447

mfbalin requested a review from frozenbugs July 30, 2024 14:21

linting?

da2f788

linting

5cb555d

TristonC self-assigned this Jul 30, 2024

Add warning to all code paths.

221a613

mfbalin requested a review from TristonC July 31, 2024 02:57

minor

734dc9f

linting?

499f925

final linting

c950e27

Merge branch 'master' into gb_cuda_memory_alloc_env

ecef989

TristonC approved these changes Jul 31, 2024

View reviewed changes

frozenbugs approved these changes Aug 2, 2024

View reviewed changes

mfbalin merged commit 56a1e64 into dmlc:master Aug 2, 2024
2 checks passed

mfbalin deleted the gb_cuda_memory_alloc_env branch August 2, 2024 11:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GraphBolt][CUDA] Use better memory allocation algorithm to avoid OOM. #7618

[GraphBolt][CUDA] Use better memory allocation algorithm to avoid OOM. #7618

mfbalin commented Jul 30, 2024 •

edited

Loading

dgl-bot commented Jul 30, 2024

dgl-bot commented Jul 30, 2024

dgl-bot commented Jul 30, 2024

dgl-bot commented Jul 30, 2024

TristonC commented Jul 30, 2024

mfbalin commented Jul 30, 2024 •

edited

Loading

TristonC commented Jul 30, 2024 •

edited

Loading

mfbalin commented Jul 30, 2024 •

edited

Loading

TristonC commented Jul 30, 2024

mfbalin commented Jul 30, 2024

dgl-bot commented Jul 31, 2024

dgl-bot commented Jul 31, 2024

dgl-bot commented Jul 31, 2024

dgl-bot commented Jul 31, 2024

dgl-bot commented Jul 31, 2024

TristonC commented Jul 31, 2024

mfbalin commented Aug 1, 2024 •

edited

Loading

frozenbugs commented Aug 2, 2024

mfbalin commented Aug 2, 2024

[GraphBolt][CUDA] Use better memory allocation algorithm to avoid OOM. #7618

[GraphBolt][CUDA] Use better memory allocation algorithm to avoid OOM. #7618

Conversation

mfbalin commented Jul 30, 2024 • edited Loading

Description

Checklist

Changes

dgl-bot commented Jul 30, 2024

dgl-bot commented Jul 30, 2024

dgl-bot commented Jul 30, 2024

dgl-bot commented Jul 30, 2024

TristonC commented Jul 30, 2024

mfbalin commented Jul 30, 2024 • edited Loading

TristonC commented Jul 30, 2024 • edited Loading

mfbalin commented Jul 30, 2024 • edited Loading

TristonC commented Jul 30, 2024

mfbalin commented Jul 30, 2024

dgl-bot commented Jul 31, 2024

dgl-bot commented Jul 31, 2024

dgl-bot commented Jul 31, 2024

dgl-bot commented Jul 31, 2024

dgl-bot commented Jul 31, 2024

TristonC commented Jul 31, 2024

mfbalin commented Aug 1, 2024 • edited Loading

frozenbugs commented Aug 2, 2024

mfbalin commented Aug 2, 2024

mfbalin commented Jul 30, 2024 •

edited

Loading

mfbalin commented Jul 30, 2024 •

edited

Loading

TristonC commented Jul 30, 2024 •

edited

Loading

mfbalin commented Jul 30, 2024 •

edited

Loading

mfbalin commented Aug 1, 2024 •

edited

Loading