You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, when scheduling a Volcano job, if you want to trigger rescheduling in error scenarios, the only way to retry is through the RestartJob Action. The impact range is sometimes too large.
Describe the solution you'd like
It would be desirable to introduce new capabilities, such as individually restarting a specific Pod/Task to minimize the impact.
Furthermore, it could allow for a graded execution. For example, first attempt to restartPod/restartTask. If recovery is not achieved within a certain period, then try restartJob.
Additional context
After implementing the above capability, we can use it as follows:
// other configspec:
policies:
- event: PodFailedaction: RestartPod
- event: PodEvictedaction: RestartJobtimeout: 10m// other config
When Pod A exits with an error, it triggers a PodFailed event, which in turn triggers a RestartPod action, so the Volcano controller will attempt to recreate Pod A. This recreation action will trigger a PodEvicted event, which in turn triggers a scheduled task to RestartJob after 10 minutes.
If Pod A returns to a running state, the delay RestartJob action is canceled, and everything returns to normal.
If Pod A still cannot reach the running state after 10 minutes, a broader reconstruction action, which is restartJob in this scene, will be attempted.
The text was updated successfully, but these errors were encountered:
Currently, when scheduling a Volcano job, if you want to trigger rescheduling in error scenarios, the only way to retry is through the RestartJob Action. The impact range is sometimes too large.
@bibibox As far as i know, besides RestartJob, RestartTask action has been already supported in volcano. So would you clarify you requirement?
Currently, when scheduling a Volcano job, if you want to trigger rescheduling in error scenarios, the only way to retry is through the RestartJob Action. The impact range is sometimes too large.
@bibibox As far as i know, besides RestartJob, RestartTask action has been already supported in volcano. So would you clarify you requirement?
Although there is a restartTask action in Volcano’s actions, it is actually not implemented yet and is currently unavailable.
What is the problem you're trying to solve
Currently, when scheduling a Volcano job, if you want to trigger rescheduling in error scenarios, the only way to retry is through the RestartJob Action. The impact range is sometimes too large.
Describe the solution you'd like
It would be desirable to introduce new capabilities, such as individually restarting a specific Pod/Task to minimize the impact.
Furthermore, it could allow for a graded execution. For example, first attempt to restartPod/restartTask. If recovery is not achieved within a certain period, then try restartJob.
Additional context
After implementing the above capability, we can use it as follows:
When Pod A exits with an error, it triggers a PodFailed event, which in turn triggers a RestartPod action, so the Volcano controller will attempt to recreate Pod A. This recreation action will trigger a PodEvicted event, which in turn triggers a scheduled task to RestartJob after 10 minutes.
If Pod A returns to a running state, the delay RestartJob action is canceled, and everything returns to normal.
If Pod A still cannot reach the running state after 10 minutes, a broader reconstruction action, which is restartJob in this scene, will be attempted.
The text was updated successfully, but these errors were encountered: