You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current situation is that when a Pod encounters an exception, it can be automatically restarted. However, as for multi-node training, all Pods need to be restarted. How do I configure this YAML file? Thanks!
kevinsummer219
changed the title
How to restart all pods when one pod fails?
How to restart the training JOB when one training process fails in cluster environment to recover the training?
Sep 25, 2024
What you would like to be added?
I use pytorchJOB.
Why is this needed?
I think, that is valid idea. So in such restart policy our controller should re-create all PyTorchJob's pods in case of single Pod failure.
Love this feature?
Give it a 👍 We prioritize the features with most 👍
The text was updated successfully, but these errors were encountered: