Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add task status named ReleasingFailed #2943

Closed
wants to merge 6 commits into from

Conversation

ycfnana
Copy link

@ycfnana ycfnana commented Jun 28, 2023

the question is #2922
I met a problem, pod always be Pending when one node has zombie pod and another node has enough reource, due to the relaesing status, I think waiting time larger than TerminationGracePeriodSeconds it should not be releaseing, may happen something error, its should be another status like ReleasingFailed

Signed-off-by: chenfengyu <yuxubst@126.com>
@volcano-sh-bot volcano-sh-bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 28, 2023
@volcano-sh-bot volcano-sh-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jun 28, 2023
@lowang-bh
Copy link
Member

Thanks for your contributions!
This is a great change and will have a large effect.
Would you please provide those informations?

  1. the necessary ut tests.
  2. the scenario what happens without the change, and the result with this fix.

@ycfnana ycfnana closed this Jul 3, 2023
@ycfnana ycfnana reopened this Jul 3, 2023
@ycfnana
Copy link
Author

ycfnana commented Jul 3, 2023

Thanks for your contributions! This is a great change and will have a large effect. Would you please provide those informations?

  1. the necessary ut tests.
  2. the scenario what happens without the change, and the result with this fix.

Thanks, I'll add ut test next week.
I have an issue, there are enough resource both in nodeA and nodeB. PodB in nodeB has been deleted, but the status is still "Terminating" due to the zombie progress.
In the time, add the new job named JobC with PodD, it can be scheduled in nodeA, but status of podD is Pending. From its log, I find status of JobC is "Pipelined", waiting nodeB to releasing podB. But podB has something wrong like "ReleasingFailed".
So I want to fix this condition.

@ycfnana ycfnana marked this pull request as ready for review July 3, 2023 13:14
ycfnana added 2 commits July 13, 2023 01:11
Signed-off-by: chenfengyu <yuxubst@126.com>
Signed-off-by: chenfengyu <yuxubst@126.com>
@ycfnana
Copy link
Author

ycfnana commented Jul 12, 2023

Thanks for your contributions! This is a great change and will have a large effect. Would you please provide those informations?

  1. the necessary ut tests.
  2. the scenario what happens without the change, and the result with this fix.

can you help review it again?

Signed-off-by: chenfengyu <yuxubst@126.com>
@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign qiankunli
You can assign the PR to them by writing /assign @qiankunli in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jul 13, 2023
Signed-off-by: chenfengyu <yuxubst@126.com>
@ycfnana
Copy link
Author

ycfnana commented Jul 13, 2023

/assign @qiankunli

@lowang-bh
Copy link
Member

lowang-bh commented Jul 14, 2023

Thanks for your contributions! This is a great change and will have a large effect. Would you please provide those informations?

  1. the necessary ut tests.
  2. the scenario what happens without the change, and the result with this fix.

can you help review it again?

It is clear to me now. Just a little comment, why don't put node-name "n3" at the parameter location of function buildPod? Thanks.

case2Pod3 := util.BuildPod("c2", "p3", "", v1.PodRunning, util.BuildResourceListWithGPU("2", "4G", "3"), "pg3", make(map[string]string), make(map[string]string))
case2Pod3.CreationTimestamp = metav1.Time{Time: time.Now().Add(-20 * time.Minute)}
case2Pod3.Spec.NodeName = "n3"

Hi, @wangyang0616 please help to review this pr. It has something to do with the enhancement in PR #2815 .

Signed-off-by: chenfengyu <yuxubst@126.com>
@ycfnana
Copy link
Author

ycfnana commented Jul 28, 2023

/assign @wangyang0616 @william-wang @huone1
can somebody help review it

@wangyang0616
Copy link
Member

Can this pr #2815 fix your problem.

@ycfnana
Copy link
Author

ycfnana commented Aug 1, 2023

Can this pr #2815 fix your problem.

thanks,it looks better

@Thor-wl
Copy link
Contributor

Thor-wl commented Aug 2, 2023

@ycfnana Thanks for your report. Personally speaking, I think the root reason is something goes wrong with the unexpected pod status instead of task's status. There is no enough resource to start new pods, which belong to the pipelined jobs. So I don't get the point why should add a task status and what this can help for users.
/hold

@volcano-sh-bot volcano-sh-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 2, 2023
@ycfnana
Copy link
Author

ycfnana commented Aug 2, 2023

@ycfnana Thanks for your report. Personally speaking, I think the root reason is something goes wrong with the unexpected pod status instead of task's status. There is no enough resource to start new pods, which belong to the pipelined jobs. So I don't get the point why should add a task status and what this can help for users. /hold

thanks, sometimes pod status is pending even if there are enough resources to start a new pod because of unexpected pod status, you can see my comment in my ut code. But this pr #2815 may solve my problem. I'll close this pr if test it's ok

@ycfnana ycfnana closed this Aug 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants