Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Umbrella] Support Suspend in volcano #3875

Open
kerthcet opened this issue Dec 12, 2024 · 10 comments
Open

[Umbrella] Support Suspend in volcano #3875

kerthcet opened this issue Dec 12, 2024 · 10 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@kerthcet
Copy link
Member

kerthcet commented Dec 12, 2024

What is the problem you're trying to solve

We would like to make vcjob part of the Kueue ecosystem working as a high level job queueing component. Meanwhile, in preemption scenarios, suspending the job (or other terms in volcano) is somehow a foundational capacity.

Describe the solution you'd like

Make it possible to suspend the vcjob and reclaim the owned Pods. This can be achieved via two ways:

I would like to leverage the existing function rather than reinventing the wheel if possible.

Additional context

related issues:

@kerthcet kerthcet added the kind/feature Categorizes issue or PR as related to a new feature. label Dec 12, 2024
@kerthcet
Copy link
Member Author

cc @GhangZh

@googs1025
Copy link
Member

+1, this will be a very useful feature 👍

@Monokaix
Copy link
Member

Hi, Thanks for your contribution! Just a little confused here, I'd like to know why kueue needs vcjob suspension. kueue is an independent queue management project, and Volcano also has its own queue, Volcano's own scheduling and queue are integrated. Why do we need to adapt a separate queue project?

@kerthcet
Copy link
Member Author

kerthcet commented Dec 13, 2024

Thanks @Monokaix for the inputs and sorry for the incomplete context, I'll explain more clearly here.

First of all, I think we can come to a consequence that Suspend(or other terminologies) is somehow a necessary capacity for Jobs(batch job, vcjob, rayjob, etc.) because of the limited resources that can't satisfy the job at the moment but will be met later with resources reclaimed. And I do think volcano supports this already, as I mentioned the cli suspend/resume capacity.

And the second question about why kueue needs this is just because asked by our users and community partners, they use both volcano and other schedulers in their clusters, they hope then can have a global job queueing system in the front. And they do have a forked volcano running in their clusters to finish this work.

Based on this, I do think this is reasonable to support Suspend better in volcano. And actually I do have several questions before we come to the proposal and I'm not sure about the answer, maybe someone can help me to understand this better:

  • How preemption implements in volcano, I think generally it invokes the evict function and update the task status to Releasing?
  • When resources not ready, the job will be marked as pending, right?
  • Besides cli suspend, does the Aborted state has other scenarios? Like really failures happened here.

@kerthcet
Copy link
Member Author

FYI: I tried to update the vcjob with status.state.phase = Aborted, it doesn't work. It will be rolled to Running.

@kerthcet
Copy link
Member Author

Another problem is once resumed, the completed ones will be ignored which means we'll restart a fairly new task.

@hwdef
Copy link
Member

hwdef commented Dec 16, 2024

I think jobTemplate alse support this scenario.
You hope that vcjob will not be processed after it is created. This is exactly the goal of jobTemplate.
jobTemplate has the same fields as vcjob,
You can create a controller and create a vcjob based on jobtemplate when receiving a jobtemplate creation request.

This way you don’t need to change any API of volcano, you just need to create an external controller.

@kerthcet
Copy link
Member Author

Thanks @hwdef Any document about jobTemplate, what's this used for in volcano? Will vcjob controller watch for this resource?

@hwdef
Copy link
Member

hwdef commented Dec 17, 2024

Please check this:
https://github.com/volcano-sh/volcano/tree/master/docs/design/jobflow#introduction-2

It is the template of vcjob and can be referenced by jobflow
If you don't create jobflow and associate jobtemplate, then no controller will handle jobtemplate

@kerthcet
Copy link
Member Author

Can jobtemplate be rolled back? Suspend means a called job could be switched between running or pending with arbitrary times, I guess jobtemplate could not, because it's a flow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

4 participants