Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The Lost of GPUs problem has not been resolved completely #1818

Closed
eggiter opened this issue Nov 2, 2021 · 11 comments
Closed

The Lost of GPUs problem has not been resolved completely #1818

eggiter opened this issue Nov 2, 2021 · 11 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@eggiter
Copy link
Member

eggiter commented Nov 2, 2021

What happened:

  1. Allocatable GPU of node1 is now 6;
  2. Create pod pod1 requests 6 GPUs;
  3. Volcano successfully schedules pod pod1 to node node1;
  4. However, Pod turned to Failed because of UnexpectedAdmissionError;

What you expected to happen:

  • node1 should not be scheduled any other pods;

How to reproduce it (as minimally and precisely as possible):

  1. node1 has 8 allocatable GPUs;
  2. Create pod pod0 requests 8 GPUs;
  3. GPU error detected and allocatable GPU of node1 becomes 6;
  4. Label or annotate node so that resource version of node updated;
  5. Create pod pod1 requests 6 GPUs;
  6. The bug occurred, volcano successfully schedules pod pod1 to node node1, but pod1 turned to Failed because of UnexpectedAdmissionError;

Anything else we need to know?:

Main reason to cause:

  1. As the comment Catch add pod out of sync error #1783 (comment) said, node will become Ready when its resource is updated. But there is a problem, when SetNode is called and node is updated, there might be tasks which were not added back to this node due to previous AllocateFailError.
  2. As a supplement to the above question(When will node be changed to ready?), there are two possible cases that the node will become Ready:
    • The resources of node is updated.
      • But in this case the node needs further re-synchronization to take back missing tasks(due to previous AllocateFailError);
    • The removal of tasks which were not added to this node due to previous AllocateFailError.
      • In this case, maybe the lost CPUs didn't come back to the node, but the state of this node is consistent with its actual counterpart, and it's ready to accept Pods.

Environment:

  • Volcano Version:
    • latest version.
@eggiter eggiter added the kind/bug Categorizes issue or PR as related to a bug. label Nov 2, 2021
@eggiter
Copy link
Member Author

eggiter commented Nov 2, 2021

/assign @k82cn @Thor-wl @hwdef @zhiyuone

@volcano-sh-bot
Copy link
Contributor

@eggiter: GitHub didn't allow me to assign the following users: zhiyuone.

Note that only volcano-sh members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @k82cn @Thor-wl @hwdef @zhiyuone

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@eggiter
Copy link
Member Author

eggiter commented Nov 2, 2021

#1783 not in v1.3.0, can you can try master branch?

@stale
Copy link

stale bot commented Feb 2, 2022

Hello 👋 Looks like there was no activity on this issue for last 90 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

@stale stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 2, 2022
@Thor-wl Thor-wl added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 7, 2022
@Thor-wl Thor-wl assigned william-wang and shinytang6 and unassigned k82cn Feb 7, 2022
@xkd045
Copy link

xkd045 commented Apr 12, 2022

i am confused if you delete pod0

  • if you delete pod0, then the node1 really have 6 idle GPUs
  • if you don't delete pod0, then the node1 state will be OutOfSync and not join to schedule session, pod1 cannot be scheduled to node1

can reproduce description be more accurate? tks @eggiter

@eggiter
Copy link
Member Author

eggiter commented Apr 14, 2022

i am confused if you delete pod0

@xkd045

  • It's not about if you delete pod0 or not, the thing is, if node lost gpus which are already been assigned to some pod, state of this node should be OutOfSync.
  • Let's say if you deleted pod0, it's ok for node to be scheduled as long as the idle gpus is correctly calculated; if you don't delete pod0, then the actual state of this node need to be OutOfSync cause idle gpus of this node cannot be -2.

@xkd045
Copy link

xkd045 commented Apr 14, 2022

  • It's not about if you delete pod0 or not, the thing is, if node lost gpus which are already been assigned to some pod, state of this node should be OutOfSync.

that's right. What confused me is how could pod0 task be missing in your description. In my opinion, only when the scheduler restarts, then the pod0 task info will be missing.
can the reproduce description be more accurate, tks @eggiter

@kenoung
Copy link

kenoung commented Apr 19, 2022

I think the reproduction instructions are based on an older version of volcano (before #1685 was merged) when the scheduler would panic and restart if amount of resources allocatable fell below the amount being used. After the fix in #1685, the node will be marked NotReady.

How to reproduce it (as minimally and precisely as possible):

  1. node1 has 8 allocatable GPUs;
  2. Create pod pod0 requests 8 GPUs -> this gets allocated on node1;
  3. GPU error detected and allocatable GPU of node1 becomes 6 -> this triggers SetNode and causes node1 to be set to NotReady;
  4. Label or annotate node so that resource version of node updated;
  5. Create pod pod1 requests 6 GPUs;
  6. The bug occurred, volcano successfully schedules pod pod1 to node node1, but pod1 turned to Failed because of UnexpectedAdmissionError; -> node1 is NotReady so pod1 does not get scheduled.
    ...
  7. pod0 crashes or gets terminated, and node1 is now free.
  8. pod1 still cannot get allocated to node1 because it is NotReady.

@eggiter
Copy link
Member Author

eggiter commented Apr 19, 2022

What confused me is how could pod0 task be missing in your description

@xkd045 Sorry, my bad. There is a KEY action missing between step3 and step4: restart scheduler. The updated reproduce steps are:

No. Step Result
1 node1 has 8 allocatable GPUs;
2 Create pod pod0 requests 8 GPUs pod0 assigned to node1
3 GPU error detected and allocatable GPU of node1 becomes 6 node1 became NotReady(OutOfSync)
4 Restart scheduler node.AddTask(pod0) failed due to AllocateFailError(node1 still has 6 GPUs only and became NotReady(OutOfSync); This is the place where TaskInfo of pod0 was not added to node1;
5 Label or annotate node so that resource version of node updated this triggers SetNode and causes node1 to recover from NotReady(OutOfSync)
6 Create pod pod1 requests 6 GPUs node1 is ready to accept new pods
7 The bug occurred, volcano successfully schedules pod pod1 to node node1, but pod1 turned to Failed because of UnexpectedAdmissionError

/cc @kenoung

@stale
Copy link

stale bot commented Jul 30, 2022

Hello 👋 Looks like there was no activity on this issue for last 90 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

@stale stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 30, 2022
@stale
Copy link

stale bot commented Oct 1, 2022

Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants