Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

suspend/resume 3 times vcjob will directly failure #3067

Open
haiker2011 opened this issue Aug 22, 2023 · 6 comments · May be fixed by #3876
Open

suspend/resume 3 times vcjob will directly failure #3067

haiker2011 opened this issue Aug 22, 2023 · 6 comments · May be fixed by #3876
Assignees
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug.

Comments

@haiker2011
Copy link

haiker2011 commented Aug 22, 2023

What happened:
Use vcjob create job, then job running, then suspend/resume 3 times, the vcjob will directly failure
What you expected to happen:
vcjob can success resume
How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

case v1alpha1.ResumeJobAction:
return KillJob(as.job, PodRetainPhaseSoft, func(status *vcbatch.JobStatus) bool {
status.State.Phase = vcbatch.Restarting
status.RetryCount++
return true
})

// Get the maximum number of retries.
maxRetry := ps.job.Job.Spec.MaxRetry
if status.RetryCount >= maxRetry {
// Failed is the phase that the job is restarted failed reached the maximum number of retries.
status.State.Phase = vcbatch.Failed
return true
}

when resume job, RetryCount++, and RetryCount >= maxRetry, job Failed

Environment:

  • Volcano Version: v1.7
  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@haiker2011 haiker2011 added the kind/bug Categorizes issue or PR as related to a bug. label Aug 22, 2023
@Thor-wl
Copy link
Contributor

Thor-wl commented Aug 24, 2023

@haiker2011 Thanks for your report. So it's not so robust that current logic are not aware of why the job is restarted(due to mannual operation or job failed), right?

@Thor-wl Thor-wl added good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Aug 24, 2023
@haiker2011
Copy link
Author

@Thor-wl Yes,mannual operation and job failed should be treated separately,status.RetryCount++ only job failed.

@ashutosh887
Copy link

Let me work on this

@Thor-wl @haiker2011

@ashutosh887
Copy link

/assign

@shiliang
Copy link

shiliang commented Sep 5, 2023

/assign

@shiliang
Copy link

shiliang commented Sep 6, 2023

@haiker2011 Thanks for your report. So it's not so robust that current logic are not aware of why the job is restarted(due to mannual operation or job failed), right?

how to aware of mannual operation in peoject?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants