Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature][Performance][GPU] Introducing UnifiedTensor for efficient zero-copy host memory access from GPU #3086

Merged
merged 20 commits into from
Jul 16, 2021

Conversation

davidmin7
Copy link
Contributor

@davidmin7 davidmin7 commented Jul 2, 2021

Description

Unlike some of the conventional NN training, GNN training often involves fetching input features stored in non-contiguous and fine-grained manners. Such characteristics may not matter when the whole input feature can be fit into the GPU memory, but it can be a quite delicate issue with much larger input datasets.

As of now, to resolve this issue, the practice which is the most commonly done is using CPU to slice the original node feature tensor and then sending the resulting tensor to the GPU memory. However, the volume of data which needs to be moved during this "reshaping" process can be non-negligible because the CPU need to first read from the original tensor (1st memory access), write to the temporary buffer (2nd memory access), and finally the content of the temporary buffer is DMAed to GPU (3rd memory access). This means, to send X-byte of data to GPU, we end up accessing memory by 3X-byte.

To overcome such issue, this PR introduces the zero-copy access capability of NVIDIA GPUs. Modern NVIDIA GPUs are capable of accessing the CPU memory directly. If the GPUs can directly access the CPU memory, what we only need to send is the list of node feature indices which is relatively much smaller than the node feature itself. Once the indices are sent to the GPUs, GPUs can freely access the node features without bothering CPU.

In case of training GraphSAGE with Reddit dataset, this method can reduce the training epoch time from 7.4s to 4.2s when the node features are always stored in the CPU memory.

This PR is mainly composed of two parts: First, adding the capability to pin and map CPU memory into the GPU memory. Second, adding the cross-device operators/functions. From the user perspective, both parts are simply packed into the new UnifiedTensor class.

The opt-in to this feature is completely optional. Unless the users actively declare the UnifiedTensor class in their codes, there should not be any side-effect from this PR.

Checklist

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented
  • To the my best knowledge, examples are either not affected by this change,
    or have been fixed to be compatible with this change
  • Related issue is referred in this PR
  • If the PR is for a new model/paper, I've updated the example index here.

Changes

  • Feature1: Add the in-place memory pinning/unpinning mechanisms to the DGL ndarray using cudaHostRegister()/cudaHostUnregister()
  • Feature2: Introducing "UVM" kernel category which can operate in cross-device manner. For example, GPU directly reading node features stored in CPU memory.

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 2, 2021

To trigger regression tests:

  • @dgl-bot run [instance-type] [which tests] [compare-with-branch];
    For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

@davidmin7 davidmin7 marked this pull request as ready for review July 2, 2021 05:11
namespace impl {

template <typename DType, typename IdType>
NDArray IndexSelectCPUFromGPU(NDArray array, IdArray index);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this IndexSelectCPUToGPU or IndexSelectGPUFromCPU?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the rule I've made here is that if I say FuncXfromY, then X is the memory location and Y is the operator (or processor). I think you mean from the perspective of the actual data movement direction. If you think IndexSelectCPUToGPU is easier to understand, we can definitely modify to it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only need to document it clearly.
I think IndexSelectCPUFromGPU means selecting data in CPU according to the Index from GPU and return the GPU NDArray.
Can you Add a doc string and also check the context of array and index using assertions?

#!/usr/bin/env python
# coding: utf-8

import ogb
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ogb is not used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact this training example is an exact copy of https://github.com/dmlc/dgl/blob/master/examples/pytorch/ogb_lsc/MAG240M/train.py. Based on your comments, it seems like several improvements can be made in the baseline DGL code. However, in terms of consistency, it is kind of weird to make the updates only for this pull request. So I simply created a new example based on commonly used GraphSAGE code (https://github.com/dmlc/dgl/blob/master/examples/pytorch/graphsage/train_sampling.py). Someone else can update the baseline MAG240M DGL code later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. In this PR we only focus on this example.
We can update the baseline coder later in another PR.

import torch.nn.functional as F
import argparse

class RGAT(nn.Module):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just use from .train_multi_gpus import RGAT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now the example code is changed to the GraphSAGE.

with tqdm.tqdm(train_dataloader) as tq:
for i, (input_nodes, output_nodes, mfgs) in enumerate(tq):
mfgs = [g.to('cuda') for g in mfgs]
x = feats.gather_row(input_nodes.cuda()).float()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use feats[input_nodes.cuda()] ?
Furthermore, the dtype of feats can be inferred when we initialize the UnifiedTensor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem. Both requests are reflected in the new GraphSAGE example.

for i, (input_nodes, output_nodes, mfgs) in enumerate(tqdm.tqdm(valid_dataloader)):
with torch.no_grad():
mfgs = [g.to('cuda') for g in mfgs]
x = feats.gather_row(input_nodes.cuda()).float()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reflected in the newer example.

self._array = None
self._input = None

def gather_row(self, index):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we merge gather_row with __getitem__ ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem. It should be merged now.

src/array/cuda/uvm/array_index_select_uvm.cu Show resolved Hide resolved
python/dgl/contrib/unified_tensor.py Show resolved Hide resolved
@classicsong
Copy link
Contributor

In general, can you add some unittest for Unified tensor?
You can add them here: https://github.com/dmlc/dgl/tree/master/tests/pytorch

@davidmin7
Copy link
Contributor Author

In general, can you add some unittest for Unified tensor?
You can add them here: https://github.com/dmlc/dgl/tree/master/tests/pytorch

I added a test_unified_tensor.py test in the directory you suggested.

Copy link
Contributor

@classicsong classicsong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM
Let some comments.

namespace impl {

template <typename DType, typename IdType>
NDArray IndexSelectCPUFromGPU(NDArray array, IdArray index);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only need to document it clearly.
I think IndexSelectCPUFromGPU means selecting data in CPU according to the Index from GPU and return the GPU NDArray.
Can you Add a doc string and also check the context of array and index using assertions?

tests/pytorch/test_unified_tensor.py Show resolved Hide resolved
davidmin7 and others added 2 commits July 12, 2021 12:19
Co-authored-by: xiang song(charlie.song) <classicxsong@gmail.com>
@davidmin7
Copy link
Contributor Author

davidmin7 commented Jul 12, 2021

Overall LGTM
Let some comments.

I added a comment and also the asserts for the indexSelectCPUfromGPU.

@classicsong classicsong merged commit 905c0aa into dmlc:master Jul 16, 2021
@davidmin7 davidmin7 deleted the dgl-unified-tensor branch August 10, 2021 14:16
@maqy1995
Copy link
Contributor

maqy1995 commented Feb 15, 2022

Hi @davidmin7 , thanks for your great work!
I built the DGL from the repo https://github.com/davidmin7/dgl.git and run the graphsage examples. It is found that unifiedTensor has indeed achieved a good acceleration effect:

(work) maqy@ubuntu:/davidfork/dgl/examples/pytorch/graphsage$ python train_sampling.py --num-epochs 10 --data-cpu
Epoch 00000 | Step 00000 | Loss 6.1428 | Train Acc 0.0300 | Speed (samples/sec) nan | GPU 265.5
MB
Epoch 00000 | Step 00020 | Loss 2.8067 | Train Acc 0.3210 | Speed (samples/sec) 15635.5011 | GPU 512.2 MB
...
Eval Acc 0.9465
Test Acc: 0.9468
...
Epoch 00009 | Step 00140 | Loss 0.3134 | Train Acc 0.9180 | Speed (samples/sec) 15294.2769 | GPU 514.7 MB
Epoch Time(s): 10.4865
Avg epoch time: 10.435481643676757
----------------------------------------------- split line ---------------------------------------------------
(work) maqy@ubuntu:/davidfork/dgl/examples/pytorch/graphsage$ python train_sampling_unified_tensor.py --num-epochs 10 --data-cpu
Epoch 00000 | Step 00000 | Loss 6.0242 | Train Acc 0.0280 | Speed (samples/sec) nan | GPU 11.1 MB
Epoch 00000 | Step 00020 | Loss 3.0486 | Train Acc 0.2370 | Speed (samples/sec) 29305.7057 | GPU 11.6 MB
...
Eval Acc 0.9467
Test Acc: 0.9458
...
Epoch 00010 | Step 00140 | Loss 0.3327 | Train Acc 0.9270 | Speed (samples/sec) 29333.4603 | GPU 351.5 MB
Epoch Time(s): 5.5860
Avg epoch time: 5.597336053848267

But here I'm a little confused as to why the GPU memory usage is less when using UnifiedTensor than the original way(11MB vs 512MB before inference/ 351MB vs 514MB after the first inference)?

Looking forward to your reply.

@davidmin7
Copy link
Contributor Author

Hi, @maqy1995

Thank you for your interest in using UnifiedTensor. The example you used, has several aggressive optimizations I added previously so it may not be the best application to get some stable memory allocation status.

Fortunately, @yaox12 rewrote the example here: https://github.com/dmlc/dgl/blob/master/examples/pytorch/graphsage/train_sampling.py. You can use the UnifiedTensor by using uva for the --graph-device (for the graph structure like CSR) and --data-device (for the node feature tensor) arguments. Hope this example gives you a better view of memory usage. This commit should be merged in the nightly version of DGL, I think.

@yaox12
Copy link
Collaborator

yaox12 commented Feb 16, 2022

@maqy1995 The GPU memory usage reported in the training script is not accurate. It only counts in the GPU memory allocated by PyTorch. Use nvidia-smi command in the shell to get an accurate number.

@maqy1995
Copy link
Contributor

Thanks for your quick reply @davidmin7 @yaox12 .

Use nvidia-smi command in the shell to get an accurate number.

I also used the nvidia-smi command to see the GPU memory usage yesterday, it seems that it will still be a little less than the original GPU memory usage when using UnifiedTensor (about a few hundred MBs).

The possible reason:

The example you used, has several aggressive optimizations I added previously so it may not be the best application to get some stable memory allocation status.

Anyway, I'll try to use the latest example. Thanks again for your replies.

@davidmin7
Copy link
Contributor Author

Hi @maqy1995, maybe my guess is that PyTorch is reserving some extra space in the non-UnifiedTensor method. Unfortunately, I don't know which would be the trigger exactly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants