Fault tolerant Paddle Cloud metrics ideas #262

helinwang · 2017-08-01T23:54:53Z

We need to justify that fault tolerant training managed by k8s is the way to go for the industry by data points. And also need to collect critical metrics for blog post and conferences.

Here are some initial ideas.

Real world job / experiment that shows k8s and fault tolerance can achieve high computing resource utilization and fair resource distribution across teams.
- Can we kill some node when new users submitting job when cluster is fully loaded?
- Experiment on day time production load take more node from training job, night time the otherwise.
Optimal minimum pserser / trainer ratio with different model sizes.
ASGD download / upload step size vs. time to converge: smaller download / upload step size puts less burden on networking (less time per step), but add more noise to the gradient (more step to converge).

helinwang assigned helinwang, jacquesqiao, dzhwinter, typhoonzero, Yancey1989 and gongweibao Aug 1, 2017

helinwang mentioned this issue Aug 2, 2017

Distributed PaddlePaddle Blog PaddlePaddle/Paddle#2930

Closed

Yancey1989 added need be discussed and removed need be discussed labels Aug 3, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fault tolerant Paddle Cloud metrics ideas #262

Fault tolerant Paddle Cloud metrics ideas #262

helinwang commented Aug 1, 2017 •

edited

Loading

Fault tolerant Paddle Cloud metrics ideas #262

Fault tolerant Paddle Cloud metrics ideas #262

Comments

helinwang commented Aug 1, 2017 • edited Loading

helinwang commented Aug 1, 2017 •

edited

Loading