suspend/resume 3 times vcjob will directly failure #3067

haiker2011 · 2023-08-22T09:03:36Z

What happened:
Use vcjob create job, then job running, then suspend/resume 3 times, the vcjob will directly failure
What you expected to happen:
vcjob can success resume
How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

volcano/pkg/controllers/job/state/aborted.go

Lines 31 to 36 in 36abf1b

    
           case v1alpha1.ResumeJobAction: 
        
           	return KillJob(as.job, PodRetainPhaseSoft, func(status *vcbatch.JobStatus) bool { 
        
           		status.State.Phase = vcbatch.Restarting 
        
           		status.RetryCount++ 
        
           		return true 
        
           	})

volcano/pkg/controllers/job/state/restarting.go

Lines 31 to 38 in 36abf1b

    
           // Get the maximum number of retries. 
        
           maxRetry := ps.job.Job.Spec.MaxRetry 
        
           if status.RetryCount >= maxRetry { 
        
           	// Failed is the phase that the job is restarted failed reached the maximum number of retries. 
        
           	status.State.Phase = vcbatch.Failed 
        
           	return true 
        
           }

when resume job, RetryCount++, and RetryCount >= maxRetry, job Failed

Environment:

Volcano Version: v1.7
Kubernetes version (use kubectl version):
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others:

The text was updated successfully, but these errors were encountered:

Thor-wl · 2023-08-24T01:55:42Z

@haiker2011 Thanks for your report. So it's not so robust that current logic are not aware of why the job is restarted(due to mannual operation or job failed), right?

haiker2011 · 2023-08-24T05:16:08Z

@Thor-wl Yes，mannual operation and job failed should be treated separately，status.RetryCount++ only job failed.

ashutosh887 · 2023-09-01T03:14:14Z

Let me work on this

@Thor-wl @haiker2011

ashutosh887 · 2023-09-01T03:14:19Z

/assign

shiliang · 2023-09-05T14:57:01Z

/assign

shiliang · 2023-09-06T02:01:41Z

@haiker2011 Thanks for your report. So it's not so robust that current logic are not aware of why the job is restarted(due to mannual operation or job failed), right?

how to aware of mannual operation in peoject?

haiker2011 added the kind/bug Categorizes issue or PR as related to a bug. label Aug 22, 2023

Thor-wl added good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Aug 24, 2023

volcano-sh-bot assigned ashutosh887 Sep 1, 2023

volcano-sh-bot assigned shiliang Sep 5, 2023

volcano-sh-bot mentioned this issue Sep 15, 2023

fix: suspend/resume 3 times vcjob will directly failure #3126

Closed

volcano-sh-bot assigned omcal Feb 9, 2024

omcal removed their assignment Feb 11, 2024

This was referenced Dec 12, 2024

[Umbrella] Support Suspend in volcano #3875

Open

Fix resuming jobs will make job failing into failure #3876

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

suspend/resume 3 times vcjob will directly failure #3067

suspend/resume 3 times vcjob will directly failure #3067

haiker2011 commented Aug 22, 2023 •

edited

Loading

Thor-wl commented Aug 24, 2023

haiker2011 commented Aug 24, 2023

ashutosh887 commented Sep 1, 2023

ashutosh887 commented Sep 1, 2023

shiliang commented Sep 5, 2023

shiliang commented Sep 6, 2023

suspend/resume 3 times vcjob will directly failure #3067

suspend/resume 3 times vcjob will directly failure #3067

Comments

haiker2011 commented Aug 22, 2023 • edited Loading

Thor-wl commented Aug 24, 2023

haiker2011 commented Aug 24, 2023

ashutosh887 commented Sep 1, 2023

ashutosh887 commented Sep 1, 2023

shiliang commented Sep 5, 2023

shiliang commented Sep 6, 2023

haiker2011 commented Aug 22, 2023 •

edited

Loading