Fix NCCLBcast hang up bug in Parallel Executor #11377

velconia · 2018-06-11T15:08:39Z

this close #11375

2. Check the memory usage of ALL gpus rather than the first one

…clGroupEnd blocking the exception throwing 2. NOTE the usage of NCCLGroupGuard

jacquesqiao · 2018-06-12T08:45:31Z

paddle/fluid/framework/parallel_executor.cc

@@ -63,7 +64,17 @@ ParallelExecutor::ParallelExecutor(
  member_->global_scope_ = scope;
  member_->use_cuda_ = exec_strategy.use_cuda_;

-  // Step 1. Bcast the params to devs.
+  // Step 1. check all memory usages of all places.


So these code is not needed?

I think these messages should be print out also, so I didn't remove it.

typhoonzero · 2018-06-13T01:26:21Z

paddle/fluid/platform/nccl_helper.h

@@ -41,6 +41,11 @@ inline ncclDataType_t ToNCCLDataType(std::type_index type) {
  }
 }

+// NOTE(minqiyang): according to the ncclGroupEnd documentations:


Well, I think we can assume people who develop paddlepaddle with this file is familiar with the NCCL natives.

To avoid this bug happens again, I leave these comments.

typhoonzero · 2018-06-13T01:36:21Z

paddle/fluid/framework/parallel_executor.cc

+          platform::dynload::ncclBcast(buffers[i], numel, data_type, 0,
+                                       nccl_ctx.comm_, nccl_ctx.stream());
+        }
+        member_->nccl_ctxs_->WaitAll();


This line may not needed? since ncclgroupend will sync all group calls.

@chengduoZH Can you please help take a look at this?

@typhoonzero line 185 is necessary,

ncclgroupend only ensures that there is only one thread can invoke nccl function, but not sync the calls.

the invoking of nccl functions is async, so we need use WaitAll() to ensure the invoking have completed on GPU side.

typhoonzero · 2018-06-13T01:37:10Z

paddle/fluid/framework/parallel_executor.cc

+  // This step will create BuddyAllocator for each place, which will
+  //    1. enforce that the place is avaliable and NOT used.
+  //    2. avoid ncclBcast hanging up for NOT enough memory to use.
+  for (size_t i = 0; i < member_->places_.size(); ++i) {


Does these lines was intended to let the process fail fast?

yes, and print some useful messages

chengduoZH · 2018-06-13T11:04:47Z

paddle/fluid/framework/parallel_executor.cc

+          platform::dynload::ncclBcast(buffers[i], numel, data_type, 0,
+                                       nccl_ctx.comm_, nccl_ctx.stream());
+        }
+        member_->nccl_ctxs_->WaitAll();


@typhoonzero line 185 is necessary,

ncclgroupend only ensures that there is only one thread can invoke nccl function, but not sync the calls.

the invoking of nccl functions is async, so we need use WaitAll() to ensure the invoking have completed on GPU side.

chengduoZH · 2018-06-13T11:13:29Z

paddle/fluid/framework/parallel_executor.cc

@@ -145,9 +156,9 @@ void ParallelExecutor::BCastParamsToGPUs(
    auto &dims = main_tensor.dims();
    if (paddle::platform::is_gpu_place(main_tensor.place())) {
 #ifdef PADDLE_WITH_CUDA


I don't think the modify of line159~167 is necessary.
Because if the memory of one Place is insufficient, the program will throw an exception in this line.

Actually, the thrown exception will not be handled properly, this PR was submitted to fix this bug

chengduoZH · 2018-06-13T11:36:25Z

paddle/fluid/framework/parallel_executor.cc

+    size_t usage = memory::memory_usage(member_->places_[i]);
+    VLOG(4) << "Memory usage of device: " << member_->places_[i] << " is "
+            << usage << " bytes";
+  }


This check just should do one time. So maybe it is not appropriate for placing it here.

Because BuddyAllocator's array was a static member in function in this line, so this check can do many times in the constructor in ParallelExecutor, without init duplicated BuddyAllocator

chengduoZH · 2018-06-13T13:51:55Z

paddle/fluid/framework/parallel_executor.cc

@@ -25,6 +25,7 @@ limitations under the License. */
 #include "paddle/fluid/framework/details/scope_buffered_ssa_graph_executor.h"
 #include "paddle/fluid/framework/details/ssa_graph_builder_factory.h"
 #include "paddle/fluid/framework/details/threaded_ssa_graph_executor.h"
+#include "paddle/fluid/memory/malloc.h"


Remove this line.

chengduoZH · 2018-06-13T13:52:20Z

paddle/fluid/framework/parallel_executor.cc

@@ -63,7 +64,7 @@ ParallelExecutor::ParallelExecutor(
  member_->global_scope_ = scope;
  member_->use_cuda_ = exec_strategy.use_cuda_;

-  // Step 1. Bcast the params to devs.
+  // Step 2. Bcast the params to devs.


==> Step 1.

velconia added 2 commits June 11, 2018 23:04

1. Create buddy allocator in each places before NcclBcast the variables

2dde5f7

2. Check the memory usage of ALL gpus rather than the first one

1. Make NCCLGroupGuard guards only the ncclBcast part, which avoid nc…

fe520d1

…clGroupEnd blocking the exception throwing 2. NOTE the usage of NCCLGroupGuard

jacquesqiao reviewed Jun 12, 2018

View reviewed changes

typhoonzero reviewed Jun 13, 2018

View reviewed changes

chengduoZH reviewed Jun 13, 2018

View reviewed changes

Remove the memory usage check of gpus

15afefe

chengduoZH reviewed Jun 13, 2018

View reviewed changes

Fix code style

212bdd6

chengduoZH approved these changes Jun 13, 2018

View reviewed changes

typhoonzero merged commit 046bb5c into PaddlePaddle:develop Jun 14, 2018

typhoonzero mentioned this pull request Jun 21, 2018

No response when running models in benchmark/fluid using multiple GPUs #11360

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix NCCLBcast hang up bug in Parallel Executor #11377

Fix NCCLBcast hang up bug in Parallel Executor #11377

velconia commented Jun 11, 2018 •

edited

Loading

jacquesqiao Jun 12, 2018

velconia Jun 12, 2018

typhoonzero Jun 13, 2018

velconia Jun 13, 2018

typhoonzero Jun 13, 2018

typhoonzero Jun 13, 2018

chengduoZH Jun 13, 2018

typhoonzero Jun 13, 2018

velconia Jun 13, 2018 •

edited

Loading

chengduoZH Jun 13, 2018

chengduoZH Jun 13, 2018

velconia Jun 13, 2018

chengduoZH Jun 13, 2018

velconia Jun 13, 2018 •

edited

Loading

chengduoZH Jun 13, 2018

chengduoZH Jun 13, 2018

Fix NCCLBcast hang up bug in Parallel Executor #11377

Fix NCCLBcast hang up bug in Parallel Executor #11377

Conversation

velconia commented Jun 11, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

velconia Jun 13, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

velconia Jun 13, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

velconia commented Jun 11, 2018 •

edited

Loading

velconia Jun 13, 2018 •

edited

Loading

velconia Jun 13, 2018 •

edited

Loading