proposal: add limit-aware scheduling #1495

ZiMengSheng · 2023-07-25T09:36:58Z

Ⅰ. Describe what this PR does

This PR proposes a resource limit aware plugin to mitigate the risk of overcommitment.

Ⅱ. Does this pull request fix one issue?

Ⅲ. Describe how to verify it

Ⅳ. Special notes for reviews

V. Checklist

I have written necessary docs and comments
I have added necessary unit tests and integration tests
All checks passed in make test

koordinator-bot · 2023-07-25T09:37:08Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign hormes after the PR has been reviewed.
You can assign the PR to them by writing /assign @hormes in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ZiMengSheng · 2023-07-25T09:46:53Z

/cc @eahydra

codecov · 2023-07-25T09:47:23Z

Codecov Report

Patch coverage has no change and project coverage change: +0.67% 🎉

Comparison is base (667b3ff) 64.09% compared to head (516e2bf) 64.76%.
Report is 37 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1495      +/-   ##
==========================================
+ Coverage   64.09%   64.76%   +0.67%     
==========================================
  Files         341      347       +6     
  Lines       34943    35376     +433     
==========================================
+ Hits        22396    22913     +517     
+ Misses      10899    10780     -119     
- Partials     1648     1683      +35

Flag	Coverage Δ
unittests	`64.76% <ø> (+0.67%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

see 102 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

docs/proposals/20230712-resource-limit-aware-scheduling.md

eahydra · 2023-08-08T06:37:23Z

There are several issues to consider:

Whether to handle DaemonSet specially during the Filter stage. It is possible for DaemonSet not to set Request, but to set Limit; and DaemonSet is quite special, generally it is a basic component, and it is possible to use DaemonSet to expand its functions.
In the scoring stage, Pods of the BestEffort type also need to be considered, and a default limit can be set for these Pods.

Signed-off-by: wangjianyu.wjy <wangjianyu.wjy@alibaba-inc.com>

Huang-Wei · 2023-08-18T16:59:30Z

docs/proposals/20230712-resource-limit-aware-scheduling.md

+
+To schedule a new pod pod5 (request: 1, limit: 4), although node1's total request/allocatable ratio is smaller than node2's, our `LimitAware` plugin considers node1's resource limit is already oversubscribed and hence place pod5 on node 2 instead.
+
+#### Story 2: Use limit aware filter plugin to filter out nodes with high risk of resource over-subscription


I'd say we should be extra cautious on introducing a Filter plugin. Several reasons:

users, to my experience, may prefer their pods get scheduled (and possible OOMed / CPU throttled afterwards) than not being able to get scheduled at first place

it'd introduce a new semantics on KRM, which has a large blasting radius - CA has to be aware of this, so does other integrators

@Huang-Wei Thanks for reviewing the PR.

Thanks for reviewing this PR.
Indeed, as you said, introducing a new Filter will cause many problems.
Specific to the Limit-aware scheduling scenario, considering the runtime stability of the node, it is risky if the total limit of a node is too high. IMO there are two ways to control this risk:

Lower the weight of such nodes based on the scoring algorithm;

Introduce Filter to strongly restrict the total amount of Limit to not exceed the threshold.

The first method is a must. As for the second method, it can be given to users as an option, so that they have the opportunity to use it when necessary.

WDYT? @Huang-Wei

The 2nd option is not necessarily a big no, it depends on your use-case. It's just when offering it, it'd break KRM (kubernetes resource model), you have to document the restrictions clearly like vanilla CA are unable to handle this kind of filter error (unless it gets re-compiled), etc.

stale · 2023-11-28T05:38:46Z

This issue has been automatically marked as stale because it has not had recent activity.
This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, the issue is closed
You can:
Mark this issue or PR as fresh with /remove-lifecycle stale
Close this issue or PR with /close
Thank you for your contributions.

stale · 2023-12-28T05:52:05Z

This issue has been automatically closed because it has not had recent activity.
This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, the issue is closed
You can:
Reopen this PR with /reopen
Thank you for your contributions.

koordinator-bot bot requested review from hormes and yihuifeng July 25, 2023 09:37

koordinator-bot bot added the size/L label Jul 25, 2023

eahydra changed the title ~~docs: add limit aware scheduling~~ proposal: add limit aware scheduling Jul 25, 2023

koordinator-bot bot requested a review from eahydra July 25, 2023 09:46

ZiMengSheng force-pushed the limitawareproposal branch 3 times, most recently from a4b552b to 123276d Compare July 25, 2023 09:49

hormes reviewed Jul 26, 2023

View reviewed changes

docs/proposals/20230712-resource-limit-aware-scheduling.md Outdated Show resolved Hide resolved

eahydra reviewed Aug 1, 2023

View reviewed changes

docs/proposals/20230712-resource-limit-aware-scheduling.md Outdated Show resolved Hide resolved

eahydra changed the title ~~proposal: add limit aware scheduling~~ proposal: add limit-aware scheduling Aug 8, 2023

ZiMengSheng force-pushed the limitawareproposal branch 2 times, most recently from 0c01996 to 23f65ab Compare August 8, 2023 03:30

ZiMengSheng force-pushed the limitawareproposal branch 13 times, most recently from 0564dfc to 1846c64 Compare August 9, 2023 12:06

docs: add limit aware scheduling

516e2bf

Signed-off-by: wangjianyu.wjy <wangjianyu.wjy@alibaba-inc.com>

ZiMengSheng force-pushed the limitawareproposal branch from 1846c64 to 516e2bf Compare August 10, 2023 02:09

eahydra marked this pull request as draft August 18, 2023 02:09

koordinator-bot bot added the do-not-merge/work-in-progress label Aug 18, 2023

Huang-Wei reviewed Aug 18, 2023

View reviewed changes

stale bot added the lifecycle/stale label Nov 28, 2023

stale bot closed this Dec 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal: add limit-aware scheduling #1495

proposal: add limit-aware scheduling #1495

ZiMengSheng commented Jul 25, 2023

koordinator-bot bot commented Jul 25, 2023

ZiMengSheng commented Jul 25, 2023

codecov bot commented Jul 25, 2023 •

edited

Loading

eahydra commented Aug 8, 2023

Huang-Wei Aug 18, 2023 •

edited

Loading

eahydra Aug 30, 2023

Huang-Wei Aug 30, 2023

stale bot commented Nov 28, 2023

stale bot commented Dec 28, 2023


		To schedule a new pod pod5 (request: 1, limit: 4), although node1's total request/allocatable ratio is smaller than node2's, our `LimitAware` plugin considers node1's resource limit is already oversubscribed and hence place pod5 on node 2 instead.

		#### Story 2: Use limit aware filter plugin to filter out nodes with high risk of resource over-subscription

proposal: add limit-aware scheduling #1495

proposal: add limit-aware scheduling #1495

Conversation

ZiMengSheng commented Jul 25, 2023

Ⅰ. Describe what this PR does

Ⅱ. Does this pull request fix one issue?

Ⅲ. Describe how to verify it

Ⅳ. Special notes for reviews

V. Checklist

koordinator-bot bot commented Jul 25, 2023

ZiMengSheng commented Jul 25, 2023

codecov bot commented Jul 25, 2023 • edited Loading

Codecov Report

eahydra commented Aug 8, 2023

Huang-Wei Aug 18, 2023 • edited Loading

Choose a reason for hiding this comment

eahydra Aug 30, 2023

Choose a reason for hiding this comment

Huang-Wei Aug 30, 2023

Choose a reason for hiding this comment

stale bot commented Nov 28, 2023

stale bot commented Dec 28, 2023

codecov bot commented Jul 25, 2023 •

edited

Loading

Huang-Wei Aug 18, 2023 •

edited

Loading