-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature][Performance][GPU] Introducing UnifiedTensor for efficient zero-copy host memory access from GPU #3086
Conversation
To trigger regression tests:
|
namespace impl { | ||
|
||
template <typename DType, typename IdType> | ||
NDArray IndexSelectCPUFromGPU(NDArray array, IdArray index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this IndexSelectCPUToGPU
or IndexSelectGPUFromCPU
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the rule I've made here is that if I say FuncXfromY, then X is the memory location and Y is the operator (or processor). I think you mean from the perspective of the actual data movement direction. If you think IndexSelectCPUToGPU
is easier to understand, we can definitely modify to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only need to document it clearly.
I think IndexSelectCPUFromGPU means selecting data in CPU according to the Index from GPU and return the GPU NDArray.
Can you Add a doc string and also check the context of array and index using assertions?
#!/usr/bin/env python | ||
# coding: utf-8 | ||
|
||
import ogb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ogb is not used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact this training example is an exact copy of https://github.com/dmlc/dgl/blob/master/examples/pytorch/ogb_lsc/MAG240M/train.py. Based on your comments, it seems like several improvements can be made in the baseline DGL code. However, in terms of consistency, it is kind of weird to make the updates only for this pull request. So I simply created a new example based on commonly used GraphSAGE code (https://github.com/dmlc/dgl/blob/master/examples/pytorch/graphsage/train_sampling.py). Someone else can update the baseline MAG240M DGL code later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. In this PR we only focus on this example.
We can update the baseline coder later in another PR.
import torch.nn.functional as F | ||
import argparse | ||
|
||
class RGAT(nn.Module): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just use from .train_multi_gpus import RGAT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now the example code is changed to the GraphSAGE.
examples/pytorch/ogb_lsc/MAG240M/experimental/train_unified_tensor.py
Outdated
Show resolved
Hide resolved
with tqdm.tqdm(train_dataloader) as tq: | ||
for i, (input_nodes, output_nodes, mfgs) in enumerate(tq): | ||
mfgs = [g.to('cuda') for g in mfgs] | ||
x = feats.gather_row(input_nodes.cuda()).float() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use feats[input_nodes.cuda()] ?
Furthermore, the dtype of feats can be inferred when we initialize the UnifiedTensor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem. Both requests are reflected in the new GraphSAGE example.
for i, (input_nodes, output_nodes, mfgs) in enumerate(tqdm.tqdm(valid_dataloader)): | ||
with torch.no_grad(): | ||
mfgs = [g.to('cuda') for g in mfgs] | ||
x = feats.gather_row(input_nodes.cuda()).float() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reflected in the newer example.
python/dgl/contrib/unified_tensor.py
Outdated
self._array = None | ||
self._input = None | ||
|
||
def gather_row(self, index): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we merge gather_row with __getitem__ ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem. It should be merged now.
In general, can you add some unittest for Unified tensor? |
I added a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM
Let some comments.
namespace impl { | ||
|
||
template <typename DType, typename IdType> | ||
NDArray IndexSelectCPUFromGPU(NDArray array, IdArray index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only need to document it clearly.
I think IndexSelectCPUFromGPU means selecting data in CPU according to the Index from GPU and return the GPU NDArray.
Can you Add a doc string and also check the context of array and index using assertions?
Co-authored-by: xiang song(charlie.song) <classicxsong@gmail.com>
I added a comment and also the asserts for the indexSelectCPUfromGPU. |
Hi @davidmin7 , thanks for your great work!
But here I'm a little confused as to why the GPU memory usage is less when using UnifiedTensor than the original way(11MB vs 512MB before inference/ 351MB vs 514MB after the first inference)? Looking forward to your reply. |
Hi, @maqy1995 Thank you for your interest in using Fortunately, @yaox12 rewrote the example here: https://github.com/dmlc/dgl/blob/master/examples/pytorch/graphsage/train_sampling.py. You can use the |
@maqy1995 The GPU memory usage reported in the training script is not accurate. It only counts in the GPU memory allocated by PyTorch. Use |
Thanks for your quick reply @davidmin7 @yaox12 .
I also used the The possible reason:
Anyway, I'll try to use the latest example. Thanks again for your replies. |
Hi @maqy1995, maybe my guess is that PyTorch is reserving some extra space in the non-UnifiedTensor method. Unfortunately, I don't know which would be the trigger exactly. |
Description
Unlike some of the conventional NN training, GNN training often involves fetching input features stored in non-contiguous and fine-grained manners. Such characteristics may not matter when the whole input feature can be fit into the GPU memory, but it can be a quite delicate issue with much larger input datasets.
As of now, to resolve this issue, the practice which is the most commonly done is using CPU to slice the original node feature tensor and then sending the resulting tensor to the GPU memory. However, the volume of data which needs to be moved during this "reshaping" process can be non-negligible because the CPU need to first read from the original tensor (1st memory access), write to the temporary buffer (2nd memory access), and finally the content of the temporary buffer is DMAed to GPU (3rd memory access). This means, to send X-byte of data to GPU, we end up accessing memory by 3X-byte.
To overcome such issue, this PR introduces the zero-copy access capability of NVIDIA GPUs. Modern NVIDIA GPUs are capable of accessing the CPU memory directly. If the GPUs can directly access the CPU memory, what we only need to send is the list of node feature indices which is relatively much smaller than the node feature itself. Once the indices are sent to the GPUs, GPUs can freely access the node features without bothering CPU.
In case of training GraphSAGE with Reddit dataset, this method can reduce the training epoch time from 7.4s to 4.2s when the node features are always stored in the CPU memory.
This PR is mainly composed of two parts: First, adding the capability to pin and map CPU memory into the GPU memory. Second, adding the cross-device operators/functions. From the user perspective, both parts are simply packed into the new
UnifiedTensor
class.The opt-in to this feature is completely optional. Unless the users actively declare the
UnifiedTensor
class in their codes, there should not be any side-effect from this PR.Checklist
Please feel free to remove inapplicable items for your PR.
or have been fixed to be compatible with this change
Changes