Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add basic support for CUDA Graph #36190

Merged
merged 7 commits into from
Sep 29, 2021

Conversation

sneaxiy
Copy link
Collaborator

@sneaxiy sneaxiy commented Sep 28, 2021

PR types

New features

PR changes

APIs

Describe

Add basic support for CUDA Graph, including:

  • Memory pool for CUDA Graph capturing. CUDA Graph needs to cache all Tensor address without calling any cudaFree.
  • Basic APIs like CUDAGraph.capture_begin/capture_end/replay/reset. Notice that this API is in the experimental stage, and may be changed in the future.
  • Use CUDAGraphCaptureModeGuard to switch to cudaStreamCaptureModeRelaxed to skip the unsupported cudaMalloc during capturing.

Usage:

from paddle.device.cuda.graphs import CUDAGraph

input = ... # define the input tensor. The definition of the input tensor should be before the `CUDAGraph.capture_start()`.

graph = CUDAGraph()
graph.capture_start()
output = ... # do some GPU operations here
graph.capture_end()

for _ in range(BATCH_NUM):
    input.copy_(input_tensor, False) # input_tensor is the input data of the model, may be from DataLoader
    graph.replay()
    print(output)

graph.reset() # it is not required, but it is better to call this method to release cached memory asap

TODO: when using CUDA Graph,

  • Cache CuDNN descriptor. Otherwise, errors would raise during capturing.
  • Disable CuDNN exhaustive search. Otherwise, errors would raise during capturing.
  • Disable FLAGS_sync_all_reduce when using distributed training. The FLAGS_sync_all_reduce would call cudaStreamSynchronize, which is not supported during capturing.
  • Modify ParallelExecutor to support CUDA Graph.

@CLAassistant
Copy link

CLAassistant commented Sep 28, 2021

CLA assistant check
All committers have signed the CLA.

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.


void RemoveMemoryPoolOfCUDAGraph(CUDAGraphID id) {
auto iter = cuda_graph_allocator_map_.find(id);
PADDLE_ENFORCE_EQ(iter != cuda_graph_allocator_map_.end(), true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use PADDLE_ENFORCE_NE directly

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


public:
explicit CUDAGraphCaptureModeGuard(cudaStreamCaptureMode new_mode) {
old_mode_ = new_mode;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why old_mode_ = new_mode?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change the variable name and add some comments for better understanding.

}

~CUDAGraphCaptureModeGuard() PADDLE_MAY_THROW {
PADDLE_ENFORCE_CUDA_SUCCESS(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ok to raise exception in the destructor?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Although it is not recommended to raise exception in the destructor, I think that we should not hide the exception. If exception is raised in the destructor, std::terminate would be called to stop the process immediately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Although it is not recommended to raise exception in the destructor, I think that we should not hide the exception. If exception is raised in the destructor, std::terminate would be called to stop the process immediately.

Maybe we need a PADDLE_WARN_CUDA_SUCCESS, etc.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Although it is not recommended to raise exception in the destructor, I think that we should not hide the exception. If exception is raised in the destructor, std::terminate would be called to stop the process immediately.

Maybe we need a PADDLE_WARN_CUDA_SUCCESS, etc.

Hard to do this. Suppose that we have a common method void func(). The func may be called anywhere, inside destructor or outside destructor, but we have to only write one of PADDLE_ENFORCE_CUDA_SUCCESS and PADDLE_ENFORCE_WARN_SUCCESS inside its implementation.

@@ -557,6 +558,7 @@ class RecordedCudaMallocHelper {
#ifdef PADDLE_WITH_HIP
auto result = hipMalloc(ptr, size);
#else
CUDAGraphCaptureModeGuard capture_mode_guard{cudaStreamCaptureModeRelaxed};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imho, call this func when IsCUDAGraphCapturing is true.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@PaddlePaddle PaddlePaddle locked and limited conversation to collaborators Sep 29, 2021
@PaddlePaddle PaddlePaddle unlocked this conversation Sep 29, 2021
@sneaxiy sneaxiy closed this Sep 29, 2021
@sneaxiy sneaxiy reopened this Sep 29, 2021
Copy link
Contributor

@zhiqiu zhiqiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sneaxiy sneaxiy merged commit 21b93c3 into PaddlePaddle:develop Sep 29, 2021
@sneaxiy sneaxiy deleted the add_cuda_graph_basic_support branch September 29, 2021 09:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants