-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix NCCLBcast hang up bug in Parallel Executor #11377
Fix NCCLBcast hang up bug in Parallel Executor #11377
Conversation
2. Check the memory usage of ALL gpus rather than the first one
…clGroupEnd blocking the exception throwing 2. NOTE the usage of NCCLGroupGuard
@@ -63,7 +64,17 @@ ParallelExecutor::ParallelExecutor( | |||
member_->global_scope_ = scope; | |||
member_->use_cuda_ = exec_strategy.use_cuda_; | |||
|
|||
// Step 1. Bcast the params to devs. | |||
// Step 1. check all memory usages of all places. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So these code is not needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these messages should be print out also, so I didn't remove it.
@@ -41,6 +41,11 @@ inline ncclDataType_t ToNCCLDataType(std::type_index type) { | |||
} | |||
} | |||
|
|||
// NOTE(minqiyang): according to the ncclGroupEnd documentations: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I think we can assume people who develop paddlepaddle with this file is familiar with the NCCL natives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid this bug happens again, I leave these comments.
platform::dynload::ncclBcast(buffers[i], numel, data_type, 0, | ||
nccl_ctx.comm_, nccl_ctx.stream()); | ||
} | ||
member_->nccl_ctxs_->WaitAll(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line may not needed? since ncclgroupend
will sync all group calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chengduoZH Can you please help take a look at this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@typhoonzero line 185 is necessary,
ncclgroupend
only ensures that there is only one thread can invoke nccl function, but not sync the calls.- the invoking of nccl functions is async, so we need use
WaitAll()
to ensure the invoking have completed on GPU side.
// This step will create BuddyAllocator for each place, which will | ||
// 1. enforce that the place is avaliable and NOT used. | ||
// 2. avoid ncclBcast hanging up for NOT enough memory to use. | ||
for (size_t i = 0; i < member_->places_.size(); ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does these lines was intended to let the process fail fast?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, and print some useful messages
platform::dynload::ncclBcast(buffers[i], numel, data_type, 0, | ||
nccl_ctx.comm_, nccl_ctx.stream()); | ||
} | ||
member_->nccl_ctxs_->WaitAll(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@typhoonzero line 185 is necessary,
ncclgroupend
only ensures that there is only one thread can invoke nccl function, but not sync the calls.- the invoking of nccl functions is async, so we need use
WaitAll()
to ensure the invoking have completed on GPU side.
@@ -145,9 +156,9 @@ void ParallelExecutor::BCastParamsToGPUs( | |||
auto &dims = main_tensor.dims(); | |||
if (paddle::platform::is_gpu_place(main_tensor.place())) { | |||
#ifdef PADDLE_WITH_CUDA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the modify of line159~167 is necessary.
Because if the memory of one Place is insufficient, the program will throw an exception in this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the thrown exception will not be handled properly, this PR was submitted to fix this bug
size_t usage = memory::memory_usage(member_->places_[i]); | ||
VLOG(4) << "Memory usage of device: " << member_->places_[i] << " is " | ||
<< usage << " bytes"; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check just should do one time. So maybe it is not appropriate for placing it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because BuddyAllocator's array was a static member in function in this line, so this check can do many times in the constructor in ParallelExecutor, without init duplicated BuddyAllocator
@@ -25,6 +25,7 @@ limitations under the License. */ | |||
#include "paddle/fluid/framework/details/scope_buffered_ssa_graph_executor.h" | |||
#include "paddle/fluid/framework/details/ssa_graph_builder_factory.h" | |||
#include "paddle/fluid/framework/details/threaded_ssa_graph_executor.h" | |||
#include "paddle/fluid/memory/malloc.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this line.
@@ -63,7 +64,7 @@ ParallelExecutor::ParallelExecutor( | |||
member_->global_scope_ = scope; | |||
member_->use_cuda_ = exec_strategy.use_cuda_; | |||
|
|||
// Step 1. Bcast the params to devs. | |||
// Step 2. Bcast the params to devs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
==> Step 1.
this close #11375