Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support More Actions For Volcano Job Failure Scenario #3812

Open
bibibox opened this issue Nov 11, 2024 · 3 comments
Open

Support More Actions For Volcano Job Failure Scenario #3812

bibibox opened this issue Nov 11, 2024 · 3 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Milestone

Comments

@bibibox
Copy link
Contributor

bibibox commented Nov 11, 2024

What is the problem you're trying to solve

Currently, when scheduling a Volcano job, if you want to trigger rescheduling in error scenarios, the only way to retry is through the RestartJob Action. The impact range is sometimes too large.

Describe the solution you'd like

It would be desirable to introduce new capabilities, such as individually restarting a specific Pod/Task to minimize the impact.

Furthermore, it could allow for a graded execution. For example, first attempt to restartPod/restartTask. If recovery is not achieved within a certain period, then try restartJob.

Additional context

After implementing the above capability, we can use it as follows:

// other config
spec:
  policies:
  - event: PodFailed
    action: RestartPod
  - event: PodEvicted
    action: RestartJob
    timeout: 10m
// other config

When Pod A exits with an error, it triggers a PodFailed event, which in turn triggers a RestartPod action, so the Volcano controller will attempt to recreate Pod A. This recreation action will trigger a PodEvicted event, which in turn triggers a scheduled task to RestartJob after 10 minutes.

If Pod A returns to a running state, the delay RestartJob action is canceled, and everything returns to normal.
If Pod A still cannot reach the running state after 10 minutes, a broader reconstruction action, which is restartJob in this scene, will be attempted.

@bibibox bibibox added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 11, 2024
@bibibox bibibox changed the title Support Delay Action For Volcano Jobs Support More Actions For Volcano Job Failure Scenario Nov 11, 2024
@william-wang william-wang modified the milestones: v1.10, v2.0 Nov 12, 2024
@william-wang
Copy link
Member

william-wang commented Nov 12, 2024

What is the problem you're trying to solve

Currently, when scheduling a Volcano job, if you want to trigger rescheduling in error scenarios, the only way to retry is through the RestartJob Action. The impact range is sometimes too large.

@bibibox As far as i know, besides RestartJob, RestartTask action has been already supported in volcano. So would you clarify you requirement?

@bibibox
Copy link
Contributor Author

bibibox commented Nov 12, 2024

What is the problem you're trying to solve

Currently, when scheduling a Volcano job, if you want to trigger rescheduling in error scenarios, the only way to retry is through the RestartJob Action. The impact range is sometimes too large.

@bibibox As far as i know, besides RestartJob, RestartTask action has been already supported in volcano. So would you clarify you requirement?

Although there is a restartTask action in Volcano’s actions, it is actually not implemented yet and is currently unavailable.

@Monokaix
Copy link
Member

Monokaix commented Dec 3, 2024

- event: PodFailed
action: RestartTask

RestartPod?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

3 participants