Skip to content
This repository has been archived by the owner on Jul 7, 2019. It is now read-only.

Fix the scheduler panic whenever the GPU is lost on node #38

Closed
wants to merge 4 commits into from

Conversation

william-wang
Copy link

What this PR does / why we need it:
The volcano-sh/scheduler panics when startup whever the GPU is lost on nodes.
The cause is that the GPU resource descreaced, the node.idle <task.Req. The panic happened
when ni.Idle.Sub(ti.Resreq). This PR is used to fix the issue.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes volcano-sh/volcano#294

Special notes for your reviewer:

@volcano-sh-bot volcano-sh-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jul 5, 2019
@volcano-sh-bot volcano-sh-bot requested review from hex108 and k82cn July 5, 2019 09:37
@volcano-sh-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: william-wang
To complete the pull request process, please assign hex108
You can assign the PR to them by writing /assign @hex108 in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@@ -161,6 +161,19 @@ func (ni *NodeInfo) SetNode(node *v1.Node) {
}
}

func (ni *NodeInfo) updateIdleRes(ti *TaskInfo) bool {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/updateIdleRes/allocateIdleResource

and return error instead of bool

@k82cn
Copy link

k82cn commented Jul 5, 2019

/cc @hex108 @kinglion811

@k82cn
Copy link

k82cn commented Jul 7, 2019

@william-wang , please help to move this PR to volcano-sh/volcano and cherry-pick to kube-batch later :)

/close

@volcano-sh-bot
Copy link
Collaborator

@k82cn: Closed this PR.

In response to this:

@william-wang , please help to move this PR to volcano-sh/volcano and cherry-pick to kube-batch later :)

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scheduler panic happens when the GPU is lost on node
3 participants