Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fault tolerant Paddle Cloud metrics ideas #262

Open
helinwang opened this issue Aug 1, 2017 · 0 comments
Open

Fault tolerant Paddle Cloud metrics ideas #262

helinwang opened this issue Aug 1, 2017 · 0 comments
Assignees

Comments

@helinwang
Copy link
Collaborator

helinwang commented Aug 1, 2017

We need to justify that fault tolerant training managed by k8s is the way to go for the industry by data points. And also need to collect critical metrics for blog post and conferences.

Here are some initial ideas.

  • Real world job / experiment that shows k8s and fault tolerance can achieve high computing resource utilization and fair resource distribution across teams.
    • Can we kill some node when new users submitting job when cluster is fully loaded?
    • Experiment on day time production load take more node from training job, night time the otherwise.
  • Optimal minimum pserser / trainer ratio with different model sizes.
  • ASGD download / upload step size vs. time to converge: smaller download / upload step size puts less burden on networking (less time per step), but add more noise to the gradient (more step to converge).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants