timeline for distributed training #10380

panyx0718 · 2018-05-03T12:03:12Z

One trainer task is enable for profiling. The trainer will tell PS servers to start profiling. When the trainer stops profiling, it also informs the PS servers. They all generate profile files that can be converted to timeline

typhoonzero · 2018-05-04T01:55:34Z

cmake/external/grpc.cmake

@@ -33,7 +33,7 @@ ExternalProject_Add(
    extern_grpc
    DEPENDS protobuf zlib
    GIT_REPOSITORY "https://github.com/grpc/grpc.git"
-    GIT_TAG "v1.10.x"
+    GIT_TAG "v1.8.x"


I'm still looking into this issue #10153, version 1.10.x should have some performance boost compared to 1.8.x, if the machine you are using encounters this issue, can you please use another one for test?

typhoonzero · 2018-05-04T01:58:56Z

paddle/fluid/operators/detail/send_recv.proto

+  // If true, the ps server will start profiling, the ps
+  // server stops profiling and generates a profile to /tmp/profile_ps_*
+  // when profile switches from true to false.
+  bool profile = 11;


Can we just use a envronment variable to determine whether to run profile?

Using env is difficult to start and stop profiling of multiple machines at the same time?

I see, that make sense! Thanks for this very useful tool!

typhoonzero · 2018-05-04T03:21:12Z

paddle/fluid/operators/detail/send_recv.proto

+  // If true, the ps server will start profiling, the ps
+  // server stops profiling and generates a profile to /tmp/profile_ps_*
+  // when profile switches from true to false.
+  bool profile = 11;


I see, that make sense! Thanks for this very useful tool!

typhoonzero · 2018-05-04T03:21:58Z

paddle/fluid/operators/listen_and_serv_op.cc

@@ -294,6 +295,8 @@ void ListenAndServOp::RunAsyncLoop(framework::Executor *executor,

 void ListenAndServOp::RunImpl(const framework::Scope &scope,
                              const platform::Place &dev_place) const {
+  // Mark this as PS that it should decide profiling by listening from trainer.
+  platform::SetProfileLisener();


SetProfileLisener => SetProfileListener

jacquesqiao · 2018-05-04T05:02:53Z

paddle/fluid/operators/detail/sendrecvop_utils.cc

@@ -45,6 +46,13 @@ void SerializeToByteBuffer(const std::string& name, framework::Variable* var,
  void* payload = nullptr;
  size_t payload_size;
  ProtoEncodeHelper e(static_cast<char*>(buf), 1024);
+  // Note: normally the profiler is enabled in 1 trainer, hence only


How to ensure that only one trainer is enabled?

see vgg_fluid example. we can use args.task_index == 0 to verify this.

jacquesqiao · 2018-05-04T09:01:55Z

benchmark/cluster/vgg16/vgg16_fluid.py

@@ -196,14 +198,28 @@ def train_loop(exe, trainer_prog):
                    feed={"pixel": img_data,
                          "label": y_data},
                    fetch_list=[avg_cost, batch_acc, batch_size])
+                return loss, acc, b_size
+
+            if args.profile and args.task_index == 0:


maybe here we can use trainer_id == 0

I think task_index and trainer_id mean the same thing here? vgg_fluid.py only has task_index, but no trainer_id.

jacquesqiao · 2018-05-07T01:54:07Z

paddle/fluid/operators/detail/variable_response.cc

+          return tag;
+        }
+        meta_.set_profile(profiling);
+        int64_t lisner_id = platform::ListenerId();


lisner_id => listener_id

jacquesqiao

LGTM!

panyx0718 force-pushed the dist_timeline branch from a8860e8 to 2c6b55a Compare May 3, 2018 12:05

Add timeline support for distributed training

76d8b14

panyx0718 force-pushed the dist_timeline branch from 2c6b55a to 76d8b14 Compare May 3, 2018 12:31

panyx0718 requested review from gongweibao, typhoonzero and jacquesqiao and removed request for gongweibao May 4, 2018 01:48

typhoonzero reviewed May 4, 2018

View reviewed changes

remove version change

9927413

typhoonzero previously approved these changes May 4, 2018

View reviewed changes

typhoonzero reviewed May 4, 2018

View reviewed changes

jacquesqiao reviewed May 4, 2018

View reviewed changes

clean up

5a9f17f

panyx0718 dismissed typhoonzero’s stale review via 5a9f17f May 4, 2018 07:51

jacquesqiao reviewed May 4, 2018

View reviewed changes

jacquesqiao reviewed May 7, 2018

View reviewed changes

follow comments

d1ea74d

jacquesqiao approved these changes May 7, 2018

View reviewed changes

panyx0718 merged commit dce0732 into PaddlePaddle:develop May 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

timeline for distributed training #10380

timeline for distributed training #10380

panyx0718 commented May 3, 2018

typhoonzero May 4, 2018

panyx0718 May 4, 2018

typhoonzero May 4, 2018

panyx0718 May 4, 2018 •

edited

Loading

typhoonzero May 4, 2018

typhoonzero May 4, 2018

typhoonzero May 4, 2018

panyx0718 May 4, 2018

jacquesqiao May 4, 2018

panyx0718 May 4, 2018

jacquesqiao May 4, 2018

panyx0718 May 4, 2018

jacquesqiao May 7, 2018

panyx0718 May 7, 2018

jacquesqiao left a comment

timeline for distributed training #10380

timeline for distributed training #10380

Conversation

panyx0718 commented May 3, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

panyx0718 May 4, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacquesqiao left a comment

Choose a reason for hiding this comment

panyx0718 May 4, 2018 •

edited

Loading