Support More Actions For Volcano Job Failure Scenario #3812

bibibox · 2024-11-11T07:26:50Z

What is the problem you're trying to solve

Currently, when scheduling a Volcano job, if you want to trigger rescheduling in error scenarios, the only way to retry is through the RestartJob Action. The impact range is sometimes too large.

Describe the solution you'd like

It would be desirable to introduce new capabilities, such as individually restarting a specific Pod/Task to minimize the impact.

Furthermore, it could allow for a graded execution. For example, first attempt to restartPod/restartTask. If recovery is not achieved within a certain period, then try restartJob.

Additional context

After implementing the above capability, we can use it as follows:

// other config
spec:
  policies:
  - event: PodFailed
    action: RestartPod
  - event: PodEvicted
    action: RestartJob
    timeout: 10m
// other config

When Pod A exits with an error, it triggers a PodFailed event, which in turn triggers a RestartPod action, so the Volcano controller will attempt to recreate Pod A. This recreation action will trigger a PodEvicted event, which in turn triggers a scheduled task to RestartJob after 10 minutes.

If Pod A returns to a running state, the delay RestartJob action is canceled, and everything returns to normal.
If Pod A still cannot reach the running state after 10 minutes, a broader reconstruction action, which is restartJob in this scene, will be attempted.

william-wang · 2024-11-12T02:11:46Z

What is the problem you're trying to solve

Currently, when scheduling a Volcano job, if you want to trigger rescheduling in error scenarios, the only way to retry is through the RestartJob Action. The impact range is sometimes too large.

@bibibox As far as i know, besides RestartJob, RestartTask action has been already supported in volcano. So would you clarify you requirement?

bibibox · 2024-11-12T02:51:02Z

What is the problem you're trying to solve

Currently, when scheduling a Volcano job, if you want to trigger rescheduling in error scenarios, the only way to retry is through the RestartJob Action. The impact range is sometimes too large.

@bibibox As far as i know, besides RestartJob, RestartTask action has been already supported in volcano. So would you clarify you requirement?

Although there is a restartTask action in Volcano’s actions, it is actually not implemented yet and is currently unavailable.

Monokaix · 2024-12-03T11:54:22Z

- event: PodFailed
action: RestartTask

RestartPod?

bibibox added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 11, 2024

bibibox changed the title ~~Support Delay Action For Volcano Jobs~~ Support More Actions For Volcano Job Failure Scenario Nov 11, 2024

This was referenced Nov 11, 2024

[WIP] support more actions for volcano job failure scenario #3813

Open

add events and actions for task reschedule volcano-sh/apis#140

Merged

william-wang modified the milestones: v1.10, v2.0 Nov 12, 2024

william-wang assigned bibibox Nov 12, 2024

bibibox mentioned this issue Nov 22, 2024

Add RestartPod action for pod reschedule volcano-sh/apis#143

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support More Actions For Volcano Job Failure Scenario #3812

Support More Actions For Volcano Job Failure Scenario #3812

bibibox commented Nov 11, 2024 •

edited

Loading

william-wang commented Nov 12, 2024 •

edited

Loading

What is the problem you're trying to solve

bibibox commented Nov 12, 2024

What is the problem you're trying to solve

Monokaix commented Dec 3, 2024

Support More Actions For Volcano Job Failure Scenario #3812

Support More Actions For Volcano Job Failure Scenario #3812

Comments

bibibox commented Nov 11, 2024 • edited Loading

What is the problem you're trying to solve

Describe the solution you'd like

Additional context

william-wang commented Nov 12, 2024 • edited Loading

What is the problem you're trying to solve

bibibox commented Nov 12, 2024

What is the problem you're trying to solve

Monokaix commented Dec 3, 2024

bibibox commented Nov 11, 2024 •

edited

Loading

william-wang commented Nov 12, 2024 •

edited

Loading