You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We need to justify that fault tolerant training managed by k8s is the way to go for the industry by data points. And also need to collect critical metrics for blog post and conferences.
Here are some initial ideas.
Real world job / experiment that shows k8s and fault tolerance can achieve high computing resource utilization and fair resource distribution across teams.
Can we kill some node when new users submitting job when cluster is fully loaded?
Experiment on day time production load take more node from training job, night time the otherwise.
Optimal minimum pserser / trainer ratio with different model sizes.
ASGD download / upload step size vs. time to converge: smaller download / upload step size puts less burden on networking (less time per step), but add more noise to the gradient (more step to converge).
The text was updated successfully, but these errors were encountered:
We need to justify that fault tolerant training managed by k8s is the way to go for the industry by data points. And also need to collect critical metrics for blog post and conferences.
Here are some initial ideas.
The text was updated successfully, but these errors were encountered: