-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU under-utilized when waiting for CPU to launch kernel #8818
Comments
@panyx0718 I believe the GPU kernel is launched asynchronously. Paddle/paddle/fluid/operators/sgd_op.cu Lines 76 to 78 in 78c884d
In many computational intensive Ops (conv, MatMul etc.), I feel it is hard to believe that CPU, who launches the kernel, is slower than GPU, who executes the kernel. |
GPU kernel is launched async. But CPU is not launching then fast enough to make GPU full. As the timeline shows, SGD and Elementwise_mul is much faster on GPU than CPU. |
Sure. What is the time percental these ops makes up? |
I don't have actual number. By looking at the timeline, I feel that we can improve ~20% if we can get GPU always busy. (not just sgd and elementwise_mul, but also other ops) |
I don't think launching kernel will cost much time, "GPU under-utilized" is because the data processed by CUDA kernel is so little. The two pictures can prove that. A simple description, we assume that GPU has a task queue (in fact, there is indeed a task queue), in Figure 1, due to a large amount of data, each kernel takes a long time, maybe that GPU has not been complete the computation over kernel1 when kernel2 is added to the task queue. So the axis of GPU time is compact. However, for Figure 2, due to a small amount of data, each kernel takes very little time. maybe the task queue is empty in most time. |
I don't mean the "launch kernel call" tasks too long. I mean the GPU is waiting for CPU to do all the computations and then launch the kernel so that it can start doing something |
您好,此issue在近一个月内暂无更新,我们将于今天内关闭。若在关闭后您仍需跟进提问,可重新开启此问题,我们将在24小时内回复您。因关闭带来的不便我们深表歉意,请您谅解~感谢您对PaddlePaddle的支持! |
Currently we run Op one-by-one synchronously. For Ops that can be quickly finished by GPU, the CPU is too slow to launch GPU kernels. Hence, in many cases, the GPU is under-utilized.
To mitigate the situation, we need to schedule Ops in parallel (based on dependency information). So, we can better utilize both cpus and gpus.
The text was updated successfully, but these errors were encountered: