-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
timeline for distributed training #10380
Conversation
cmake/external/grpc.cmake
Outdated
@@ -33,7 +33,7 @@ ExternalProject_Add( | |||
extern_grpc | |||
DEPENDS protobuf zlib | |||
GIT_REPOSITORY "https://github.com/grpc/grpc.git" | |||
GIT_TAG "v1.10.x" | |||
GIT_TAG "v1.8.x" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still looking into this issue #10153, version 1.10.x
should have some performance boost compared to 1.8.x
, if the machine you are using encounters this issue, can you please use another one for test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
// If true, the ps server will start profiling, the ps | ||
// server stops profiling and generates a profile to /tmp/profile_ps_* | ||
// when profile switches from true to false. | ||
bool profile = 11; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just use a envronment variable to determine whether to run profile?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using env is difficult to start and stop profiling of multiple machines at the same time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, that make sense! Thanks for this very useful tool!
// If true, the ps server will start profiling, the ps | ||
// server stops profiling and generates a profile to /tmp/profile_ps_* | ||
// when profile switches from true to false. | ||
bool profile = 11; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, that make sense! Thanks for this very useful tool!
@@ -294,6 +295,8 @@ void ListenAndServOp::RunAsyncLoop(framework::Executor *executor, | |||
|
|||
void ListenAndServOp::RunImpl(const framework::Scope &scope, | |||
const platform::Place &dev_place) const { | |||
// Mark this as PS that it should decide profiling by listening from trainer. | |||
platform::SetProfileLisener(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SetProfileLisener
=> SetProfileListener
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -45,6 +46,13 @@ void SerializeToByteBuffer(const std::string& name, framework::Variable* var, | |||
void* payload = nullptr; | |||
size_t payload_size; | |||
ProtoEncodeHelper e(static_cast<char*>(buf), 1024); | |||
// Note: normally the profiler is enabled in 1 trainer, hence only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How to ensure that only one trainer is enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see vgg_fluid example. we can use args.task_index == 0 to verify this.
@@ -196,14 +198,28 @@ def train_loop(exe, trainer_prog): | |||
feed={"pixel": img_data, | |||
"label": y_data}, | |||
fetch_list=[avg_cost, batch_acc, batch_size]) | |||
return loss, acc, b_size | |||
|
|||
if args.profile and args.task_index == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe here we can use trainer_id == 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think task_index and trainer_id mean the same thing here? vgg_fluid.py only has task_index, but no trainer_id.
return tag; | ||
} | ||
meta_.set_profile(profiling); | ||
int64_t lisner_id = platform::ListenerId(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lisner_id => listener_id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
One trainer task is enable for profiling. The trainer will tell PS servers to start profiling. When the trainer stops profiling, it also informs the PS servers. They all generate profile files that can be converted to timeline