The `Lost of GPUs` problem has not been resolved completely #1818

eggiter · 2021-11-02T10:39:43Z

What happened:

Allocatable GPU of node1 is now 6;
Create pod pod1 requests 6 GPUs;
Volcano successfully schedules pod pod1 to node node1;
However, Pod turned to Failed because of UnexpectedAdmissionError;

What you expected to happen:

node1 should not be scheduled any other pods;

How to reproduce it (as minimally and precisely as possible):

node1 has 8 allocatable GPUs;
Create pod pod0 requests 8 GPUs;
GPU error detected and allocatable GPU of node1 becomes 6;
Label or annotate node so that resource version of node updated;
Create pod pod1 requests 6 GPUs;
The bug occurred, volcano successfully schedules pod pod1 to node node1, but pod1 turned to Failed because of UnexpectedAdmissionError;

Anything else we need to know?:

It seems that the lost of GPU has not been throughly handled by now, related issues:
- Scheduler panic happens when the GPU is lost on node #294
- When gpu lost, scheduler will assign pod to wrong node #1782
All the PRs have failed to cover the above bug:

Main reason to cause:

As the comment Catch add pod out of sync error #1783 (comment) said, node will become Ready when its resource is updated. But there is a problem, when SetNode is called and node is updated, there might be tasks which were not added back to this node due to previous AllocateFailError.
As a supplement to the above question(When will node be changed to ready?), there are two possible cases that the node will become Ready:
- The resources of node is updated.
  - But in this case the node needs further re-synchronization to take back missing tasks(due to previous AllocateFailError);
- The removal of tasks which were not added to this node due to previous AllocateFailError.
  - In this case, maybe the lost CPUs didn't come back to the node, but the state of this node is consistent with its actual counterpart, and it's ready to accept Pods.

Environment:

Volcano Version:
- latest version.

The text was updated successfully, but these errors were encountered:

eggiter · 2021-11-02T10:41:28Z

/assign @k82cn @Thor-wl @hwdef @zhiyuone

volcano-sh-bot · 2021-11-02T10:41:30Z

@eggiter: GitHub didn't allow me to assign the following users: zhiyuone.

Note that only volcano-sh members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @k82cn @Thor-wl @hwdef @zhiyuone

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

eggiter · 2021-11-02T10:51:47Z

#1783 not in v1.3.0, can you can try master branch?

updated, the latest version(commit) of volcano is used.
BTW, feat(outOfSync): tracks and re-synchronizes node when it becomes 'OutOfSync' #1819 is a possible fix to this, PTAL.

stale · 2022-02-02T06:19:53Z

Hello 👋 Looks like there was no activity on this issue for last 90 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

xkd045 · 2022-04-12T13:02:30Z

i am confused if you delete pod0

if you delete pod0, then the node1 really have 6 idle GPUs
if you don't delete pod0, then the node1 state will be OutOfSync and not join to schedule session, pod1 cannot be scheduled to node1

can reproduce description be more accurate? tks @eggiter

eggiter · 2022-04-14T12:25:24Z

i am confused if you delete pod0

@xkd045

It's not about if you delete pod0 or not, the thing is, if node lost gpus which are already been assigned to some pod, state of this node should be OutOfSync.
Let's say if you deleted pod0, it's ok for node to be scheduled as long as the idle gpus is correctly calculated; if you don't delete pod0, then the actual state of this node need to be OutOfSync cause idle gpus of this node cannot be -2.

xkd045 · 2022-04-14T13:05:17Z

It's not about if you delete pod0 or not, the thing is, if node lost gpus which are already been assigned to some pod, state of this node should be OutOfSync.

that's right. What confused me is how could pod0 task be missing in your description. In my opinion, only when the scheduler restarts, then the pod0 task info will be missing.
can the reproduce description be more accurate, tks @eggiter

kenoung · 2022-04-19T01:50:43Z

I think the reproduction instructions are based on an older version of volcano (before #1685 was merged) when the scheduler would panic and restart if amount of resources allocatable fell below the amount being used. After the fix in #1685, the node will be marked NotReady.

How to reproduce it (as minimally and precisely as possible):

node1 has 8 allocatable GPUs;
Create pod pod0 requests 8 GPUs -> this gets allocated on node1;
GPU error detected and allocatable GPU of node1 becomes 6 -> this triggers SetNode and causes node1 to be set to NotReady;
Label or annotate node so that resource version of node updated;
Create pod pod1 requests 6 GPUs;
~~The bug occurred, volcano successfully schedules pod pod1 to node node1, but pod1 turned to Failed because of UnexpectedAdmissionError;~~ -> node1 is NotReady so pod1 does not get scheduled.
...
pod0 crashes or gets terminated, and node1 is now free.
pod1 still cannot get allocated to node1 because it is NotReady.

eggiter · 2022-04-19T08:10:04Z

What confused me is how could pod0 task be missing in your description

@xkd045 Sorry, my bad. There is a KEY action missing between step3 and step4: restart scheduler. The updated reproduce steps are:

No.	Step	Result
1	`node1` has 8 allocatable GPUs;
2	Create pod `pod0` requests 8 GPUs	`pod0` assigned to `node1`
3	GPU error detected and allocatable GPU of `node1` becomes 6	`node1` became `NotReady(OutOfSync)`
4	Restart scheduler	`node.AddTask(pod0)` failed due to `AllocateFailError`(`node1` still has 6 GPUs only and became `NotReady(OutOfSync)`; This is the place where TaskInfo of pod0 was not added to node1;
5	Label or annotate node so that resource version of node updated	this triggers `SetNode` and causes node1 to recover from `NotReady(OutOfSync)`
6	Create pod `pod1` requests 6 GPUs	node1 is ready to accept new pods
7	The bug occurred, volcano successfully schedules pod `pod1` to node `node1`, but `pod1` turned to `Failed` because of `UnexpectedAdmissionError`

/cc @kenoung

stale · 2022-07-30T18:40:42Z

Hello 👋 Looks like there was no activity on this issue for last 90 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

stale · 2022-10-01T00:14:49Z

Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗

eggiter added the kind/bug Categorizes issue or PR as related to a bug. label Nov 2, 2021

volcano-sh-bot assigned hwdef, k82cn and Thor-wl Nov 2, 2021

eggiter mentioned this issue Nov 2, 2021

feat(outOfSync): tracks and re-synchronizes node when it becomes 'OutOfSync' #1819

Closed

Thor-wl pinned this issue Nov 4, 2021

Thor-wl unpinned this issue Nov 4, 2021

eggiter mentioned this issue Nov 17, 2021

Request to be a Member of Volcano community #1838

Closed

stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 2, 2022

Thor-wl added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 7, 2022

Thor-wl assigned william-wang and shinytang6 and unassigned k82cn Feb 7, 2022

stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 30, 2022

stale bot closed this as completed Oct 1, 2022

Monokaix mentioned this issue Jul 26, 2023

remove node out of sync state #2998

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The `Lost of GPUs` problem has not been resolved completely #1818

The `Lost of GPUs` problem has not been resolved completely #1818

eggiter commented Nov 2, 2021 •

edited

Loading

eggiter commented Nov 2, 2021

volcano-sh-bot commented Nov 2, 2021

eggiter commented Nov 2, 2021 •

edited

Loading

stale bot commented Feb 2, 2022

xkd045 commented Apr 12, 2022 •

edited

Loading

eggiter commented Apr 14, 2022

xkd045 commented Apr 14, 2022 •

edited

Loading

kenoung commented Apr 19, 2022

eggiter commented Apr 19, 2022 •

edited

Loading

stale bot commented Jul 30, 2022

stale bot commented Oct 1, 2022

The Lost of GPUs problem has not been resolved completely #1818

The Lost of GPUs problem has not been resolved completely #1818

Comments

eggiter commented Nov 2, 2021 • edited Loading

eggiter commented Nov 2, 2021

volcano-sh-bot commented Nov 2, 2021

eggiter commented Nov 2, 2021 • edited Loading

stale bot commented Feb 2, 2022

xkd045 commented Apr 12, 2022 • edited Loading

eggiter commented Apr 14, 2022

xkd045 commented Apr 14, 2022 • edited Loading

kenoung commented Apr 19, 2022

eggiter commented Apr 19, 2022 • edited Loading

stale bot commented Jul 30, 2022

stale bot commented Oct 1, 2022

The `Lost of GPUs` problem has not been resolved completely #1818

The `Lost of GPUs` problem has not been resolved completely #1818

eggiter commented Nov 2, 2021 •

edited

Loading

eggiter commented Nov 2, 2021 •

edited

Loading

xkd045 commented Apr 12, 2022 •

edited

Loading

xkd045 commented Apr 14, 2022 •

edited

Loading

eggiter commented Apr 19, 2022 •

edited

Loading