Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add task status named ReleasingFailed #2922

Closed
wants to merge 13 commits into from
Closed

Conversation

ycfnana
Copy link

@ycfnana ycfnana commented Jun 18, 2023

if pod being terminating for long time because of zombie process,the task will be scheduled this node,and then this job may hang until the pod with zombie process killed force

@volcano-sh-bot
Copy link
Contributor

Welcome @ycfnana! It looks like this is your first PR to volcano-sh/volcano 🎉

@volcano-sh-bot volcano-sh-bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jun 18, 2023
@ycfnana ycfnana force-pushed the master branch 2 times, most recently from e4cf4c6 to d9a757a Compare June 18, 2023 17:58
@hwdef
Copy link
Member

hwdef commented Jun 21, 2023

How does this PR fix the problems you raised?

@ycfnana
Copy link
Author

ycfnana commented Jun 21, 2023

How does this PR fix the problems you raised?

job may be status pipelined when this case happens. if job request resource is future idle, it will be pipelined status, the releasing resource will be used in the way of the future idle.
I think if the pod has something wrong with termimating process, it should be detected.  the process of pod terminating, will send signal TERM and wait pod.Spec.TerminationGracePeriodSeconds(default 30s) to send signal kill. I think pod terminating process has something wrong if wait time exceed TerminationGracePeriodSeconds.
however, I think this code will result in task status from Releasing to Running, it's not good but I have no idea to chang it to another status

@hwdef
Copy link
Member

hwdef commented Jun 21, 2023

From the code point of view, you just make the conditions for returning releasing more stringent

@ycfnana
Copy link
Author

ycfnana commented Jun 21, 2023

From the code point of view, you just make the conditions for returning releasing more stringent

my fault, its title is vague. I'll update it
I met a issue, pod always be Pending when one node has zombie pod and another node has enough reource, due to the relaesing status, I think waiting time larger than TerminationGracePeriodSeconds it should not be releaseing, may happen something error, its should be another status.

@ycfnana ycfnana changed the title avoid pod being terminating for a long time avoid pod is pending for long tme when node has terminating pod with zombie process Jun 21, 2023
return Releasing
}

return Running
case v1.PodPending:
if pod.DeletionTimestamp != nil {
if pod.DeletionTimestamp != nil &&
time.Now().Unix()-pod.DeletionTimestamp.Unix() <= gracePeriodSeconds {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time.Now().Unix()-pod.DeletionTimestamp.Unix()will keep growing, If once the time point of less than or equal to is missed, then this judgment will always be false and will never return release. Is this what we want?

Copy link
Author

@ycfnana ycfnana Jun 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Running case, if cost time of deleting pod exceed gracePeriodSeconds . I think the pod has something error, but the request resource is allocated for scheduler, So this pod should not continue stay releasing state.
In pending case, it means this pod has been deleted but node name has already assigned , resource also cannot be freed and allocated for scheduler. although I have not met this case... just think should add it

Copy link
Author

@ycfnana ycfnana Jun 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time.Now().Unix()-pod.DeletionTimestamp.Unix()will keep growing, If once the time point of less than or equal to is missed, then this judgment will always be false and will never return release. Is this what we want?

I think gracePeriodSeconds can set bigger, and it should be another status not "Running" status when waiting time bigger than gracePeriodSeconds, but I have no idea what status to set

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I don't think this is a perfect fix

Copy link
Author

@ycfnana ycfnana Jun 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, but this need to fix, I think it should add a status like "ReleaseFailed", in the code of resource calculation, it should be the same as Running. what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wangyang0616
Can you help review this PR

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fix it in #2943, can you help review it again?

@ycfnana ycfnana changed the title avoid pod is pending for long tme when node has terminating pod with zombie process fix pod is always pending when node has terminating pod with zombie process Jun 21, 2023
@ycfnana ycfnana closed this Jun 24, 2023
@ycfnana ycfnana changed the title fix pod is always pending when node has terminating pod with zombie process add task status named ReleasingFailed Jun 28, 2023
@ycfnana ycfnana reopened this Jun 28, 2023
@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign k82cn after the PR has been reviewed.
You can assign the PR to them by writing /assign @k82cn in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

1 similar comment
@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign k82cn after the PR has been reviewed.
You can assign the PR to them by writing /assign @k82cn in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jun 28, 2023
@ycfnana ycfnana closed this Jun 28, 2023
ycfnana and others added 7 commits June 29, 2023 00:47
Signed-off-by: chenfengyu <yuxubst@126.com>
Signed-off-by: chenfengyu <yuxubst@126.com>
Signed-off-by: chenfengyu <yuxubst@126.com>
Signed-off-by: kingeasternsun <kingeasternsun@gmail.com>
Signed-off-by: chenfengyu <yuxubst@126.com>
Signed-off-by: wangyang <wangyang8126@gmail.com>
Signed-off-by: chenfengyu <yuxubst@126.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: chenfengyu <yuxubst@126.com>
Signed-off-by: rayoluo <hustly123@gmail.com>
Signed-off-by: chenfengyu <yuxubst@126.com>
Aakcht and others added 5 commits June 29, 2023 00:47
Signed-off-by: aakcht <aakcht@gmail.com>
Signed-off-by: chenfengyu <yuxubst@126.com>
Signed-off-by: aakcht <aakcht@gmail.com>

Signed-off-by: chenfengyu <yuxubst@126.com>
Signed-off-by: aakcht <aakcht@gmail.com>

Signed-off-by: chenfengyu <yuxubst@126.com>
Signed-off-by: chenfengyu <yuxubst@126.com>
Signed-off-by: chenfengyu <yuxubst@126.com>
@ycfnana ycfnana reopened this Jun 28, 2023
@volcano-sh-bot volcano-sh-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed do-not-merge/contains-merge-commits size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 28, 2023
@volcano-sh-bot
Copy link
Contributor

@ycfnana: Adding label do-not-merge/contains-merge-commits because PR contains merge commits, which are not allowed in this repository.
Use git rebase to reapply your commits on top of the target branch. Detailed instructions for doing so can be found here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@volcano-sh-bot volcano-sh-bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 28, 2023
@ycfnana ycfnana closed this Jun 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/contains-merge-commits size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants