[Feature][Performance][GPU] Introducing UnifiedTensor for efficient zero-copy host memory access from GPU #3086

davidmin7 · 2021-07-02T03:02:21Z

Description

Unlike some of the conventional NN training, GNN training often involves fetching input features stored in non-contiguous and fine-grained manners. Such characteristics may not matter when the whole input feature can be fit into the GPU memory, but it can be a quite delicate issue with much larger input datasets.

As of now, to resolve this issue, the practice which is the most commonly done is using CPU to slice the original node feature tensor and then sending the resulting tensor to the GPU memory. However, the volume of data which needs to be moved during this "reshaping" process can be non-negligible because the CPU need to first read from the original tensor (1st memory access), write to the temporary buffer (2nd memory access), and finally the content of the temporary buffer is DMAed to GPU (3rd memory access). This means, to send X-byte of data to GPU, we end up accessing memory by 3X-byte.

To overcome such issue, this PR introduces the zero-copy access capability of NVIDIA GPUs. Modern NVIDIA GPUs are capable of accessing the CPU memory directly. If the GPUs can directly access the CPU memory, what we only need to send is the list of node feature indices which is relatively much smaller than the node feature itself. Once the indices are sent to the GPUs, GPUs can freely access the node features without bothering CPU.

In case of training GraphSAGE with Reddit dataset, this method can reduce the training epoch time from 7.4s to 4.2s when the node features are always stored in the CPU memory.

This PR is mainly composed of two parts: First, adding the capability to pin and map CPU memory into the GPU memory. Second, adding the cross-device operators/functions. From the user perspective, both parts are simply packed into the new UnifiedTensor class.

The opt-in to this feature is completely optional. Unless the users actively declare the UnifiedTensor class in their codes, there should not be any side-effect from this PR.

Checklist

Please feel free to remove inapplicable items for your PR.

The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
Code is well-documented
To the my best knowledge, examples are either not affected by this change,
or have been fixed to be compatible with this change
Related issue is referred in this PR
If the PR is for a new model/paper, I've updated the example index here.

Changes

Feature1: Add the in-place memory pinning/unpinning mechanisms to the DGL ndarray using cudaHostRegister()/cudaHostUnregister()
Feature2: Introducing "UVM" kernel category which can operate in cross-device manner. For example, GPU directly reading node features stored in CPU memory.

dgl-bot · 2021-07-02T03:03:15Z

To trigger regression tests:

@dgl-bot run [instance-type] [which tests] [compare-with-branch];
For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

VoVAllen · 2021-07-02T06:51:50Z

src/array/uvm_array_op.h

+namespace impl {
+
+template <typename DType, typename IdType>
+NDArray IndexSelectCPUFromGPU(NDArray array, IdArray index);


Is this IndexSelectCPUToGPU or IndexSelectGPUFromCPU?

So the rule I've made here is that if I say FuncXfromY, then X is the memory location and Y is the operator (or processor). I think you mean from the perspective of the actual data movement direction. If you think IndexSelectCPUToGPU is easier to understand, we can definitely modify to it.

We only need to document it clearly.
I think IndexSelectCPUFromGPU means selecting data in CPU according to the Index from GPU and return the GPU NDArray.
Can you Add a doc string and also check the context of array and index using assertions?

classicsong · 2021-07-06T05:02:23Z