-
Notifications
You must be signed in to change notification settings - Fork 331
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: add limit-aware scheduling #1495
Closed
Closed
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
165 changes: 165 additions & 0 deletions
165
docs/proposals/20230712-resource-limit-aware-scheduling.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,165 @@ | ||
--- | ||
title: Limit Aware Scheduling | ||
authors: | ||
- "@ZiMengSheng" | ||
reviewers: | ||
- "@hormes" | ||
- "@eahydra" | ||
creation-date: 2023-07-12 | ||
last-updated: 2023-07-12 | ||
--- | ||
|
||
# Limit Aware Scheduling | ||
|
||
## Motivation | ||
|
||
Kubernetes divides pod QoS into guaranteed, burstable, and best effort. Among them, the limits of burstable pods is greater than requests. However, the current scheduler actually only considers pod requests when scheduling pods. In this way, if a large number of burstable pods are scheduled to the node, the total resource usage of the pods on the node will probably exceed the allocatable capacity of the node, that is, the resources are over-subscriped. Resource over-subscription may lead to resource contention, performance degradation and even pod failure (e.g., OOM). | ||
|
||
This proposal uses the limit to allocatable ratio of the node to measure the degree of resource over-subscription, and proposes a scheduling plugin named LimitAware to control the limit/allocatable of the node. | ||
|
||
### Goals | ||
|
||
1. Introduce a filtering plugin, determine whether the Pod can be scheduled on the node according to limit to allocatable ratios of nodes. | ||
2. Introduce a scoring plugin to mitigate resource over-subscription issue caused by burstable pods through "spreading" or "balancing" pod's resource limits across nodes. | ||
|
||
### Non-Goals | ||
|
||
Replace existing resource allocated plugin. | ||
|
||
## Proposal | ||
|
||
### User Story | ||
|
||
#### Story 1: Use limit aware score plugin to mitigate resource over-subscription | ||
|
||
Consider two nodes with the same resource allocatable (8 CPU cores) and each has two pods running. | ||
|
||
- node1: pod1(request: 2, limit: 6), pod2(request: 2, limit: 4); total request/allocatable: 4/8, total limit/allocatable: 10/8 | ||
- node2: pod3(request: 3, limit: 3), pod4(request: 2, limit: 2); total request/allocatable: 5/8, total limit/allocatable: 5/8 | ||
|
||
To schedule a new pod pod5 (request: 1, limit: 4), although node1's total request/allocatable ratio is smaller than node2's, our `LimitAware` plugin considers node1's resource limit is already oversubscribed and hence place pod5 on node 2 instead. | ||
|
||
#### Story 2: Use limit aware filter plugin to filter out nodes with high risk of resource over-subscription | ||
|
||
Consider a node with resource allocatable of 8 CPU cores and having two pods running. The user expects limit/allocatable less than or equal 125%. | ||
|
||
- node1: pod1(request: 2, limit: 6), pod2(request: 2, limit: 4); total request/allocatable: 4/8, total limit/allocatable: 10/8 | ||
|
||
To schedule a new pod pod5 (request: 1, limit: 4), our `LimitAware` plugin will think pod doesn't fit node1 as node1 total limits > allocatable * (expected limit/allocatable): 14 > 8 x 125%. | ||
|
||
## Design Details | ||
|
||
### Allocatable and Allocated Limit of a node | ||
|
||
This proposal actually provides a limit aware ResourceFit filter, so it is necessary to define the allocatable and allocated limit of the node. The allocated limit of a node is equal to the sum of the limits of all pods on the node, and the allocatable limit of a node is determined by the user according to the desired degree of resource over-subscription. This proposal provides plugin configuration and node annotation for users to configure the limit to allocatable ratio in cluster level and node level respectively according to the desired degree of resource over-subscription. | ||
|
||
The node level protocol is as follows: | ||
|
||
```go | ||
|
||
const ( | ||
// AnnotationLimitToAllocatable The limit to allocatable ratio used by limit aware scheduler plugin, represent by percentage | ||
AnnotationLimitToAllocatable = NodeDomainPrefix + "/limit-to-allocatable" | ||
) | ||
|
||
type LimitToAllocatableRatio map[corev1.ResourceName]intstr.IntOrString | ||
|
||
|
||
func GetNodeLimitToAllocatableRatio(annotations map[string]string, defaultLimitToAllocatableRatio LimitToAllocatableRatio) (LimitToAllocatableRatio, error) { | ||
data, ok := annotations[AnnotationLimitToAllocatable] | ||
if !ok { | ||
return defaultLimitToAllocatableRatio, nil | ||
} | ||
limitToAllocatableRatio := LimitToAllocatableRatio{} | ||
err := json.Unmarshal([]byte(data), &limitToAllocatableRatio) | ||
if err != nil { | ||
return nil, err | ||
} | ||
return limitToAllocatableRatio, nil | ||
} | ||
``` | ||
|
||
The cluster level protocols is as follows: | ||
|
||
```go | ||
// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object | ||
|
||
// LimitAwareArgs defines the parameters for LimitAware plugin. | ||
type LimitAwareArgs struct { | ||
metav1.TypeMeta | ||
|
||
// Resources a list of pairs <resource, weight> to be considered while scoring | ||
//allowed weights start from 1. | ||
Resources []schedconfig.ResourceSpec | ||
|
||
DefaultLimitToAllocatableRatio extension.LimitToAllocatableRatio | ||
} | ||
``` | ||
|
||
### Requested Limit of Pod | ||
|
||
Similar to calculating the resource request of a pod, we use the following function to calculate the requested limit of a Pod. | ||
|
||
```go | ||
func getPodResourceLimit(pod *corev1.Pod, nonZero bool) (result *framework.Resource) { | ||
result = &framework.Resource{} | ||
non0CPU, non0Mem := int64(0), int64(0) | ||
for _, container := range pod.Spec.Containers { | ||
limit := quotav1.Max(container.Resources.Limits, container.Resources.Requests) | ||
result.Add(limit) | ||
non0CPUReq, non0MemReq := schedutil.GetNonzeroRequests(&limit) | ||
non0CPU += non0CPUReq | ||
non0Mem += non0MemReq | ||
} | ||
// take max_resource(sum_pod, any_init_container) | ||
for _, container := range pod.Spec.InitContainers { | ||
limit := quotav1.Max(container.Resources.Limits, container.Resources.Requests) | ||
result.SetMaxResource(limit) | ||
non0CPUReq, non0MemReq := schedutil.GetNonzeroRequests(&limit) | ||
non0CPU = max(non0CPU, non0CPUReq) | ||
non0Mem = max(non0Mem, non0MemReq) | ||
} | ||
// If Overhead is being utilized, add to the total limits for the pod | ||
if pod.Spec.Overhead != nil { | ||
result.Add(pod.Spec.Overhead) | ||
if _, found := pod.Spec.Overhead[corev1.ResourceCPU]; found { | ||
non0CPU += pod.Spec.Overhead.Cpu().MilliValue() | ||
} | ||
if _, found := pod.Spec.Overhead[corev1.ResourceMemory]; found { | ||
non0Mem += pod.Spec.Overhead.Memory().Value() | ||
} | ||
} | ||
if nonZero { | ||
result.MilliCPU = non0CPU | ||
result.Memory = non0Mem | ||
} | ||
return | ||
} | ||
``` | ||
|
||
When the container does not specify Limits, use requests as the limits of the container. | ||
For cpu and memory, a pod that doesn't explicitly specify request and limit will be treated as having limited the amount indicated below, for the purpose of computing priority only. | ||
This ensures that when scheduling zero-limit pods, such pods will not all be scheduled to the machine with the smallest in-use limit, and that when scheduling regular pods, such pods will not see zero-limit pods as consuming no resources whatsoever. | ||
|
||
### Filter out nodes with high risk of resource over-subscription | ||
|
||
From before, we can already determine the allocatable limit and allocated limit of Node. When scheduling a new pod, all nodes which satisfy requested limit + allocated limit > allocatable limit will be filtered out. | ||
|
||
It is possible for DaemonSet not to set Request, but to set Limit; and DaemonSet is quite special, generally it is a basic component, and it is possible to use DaemonSet to expand its functions. | ||
So we skip to check DaemonSet pods during the Filter stage. | ||
### Score nodes to favor less risk of resource over-subscription | ||
|
||
This proposal actually borrows [the idea of kubernetes-sigs to score nodes](https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/kep/217-resource-limit-aware-scoring/README.md#design-details) but replace node's allocatable resource with our allocatable limit. | ||
|
||
Since the main purpose of the limit aware score plugin is to reduce the risk of resource over-subscription through spreading or balancing, this proposal uses the least allocated limit strategy to score nodes. | ||
|
||
1. Calculate each node’s raw score. A node's raw score will be negative for an node of which the allocated limit is greater than allocatable limit . | ||
- `Score = (allocatable limit - allocated limit) * MaxNodeScore / allocatable limit` | ||
- For multiple resources, the raw score is the weighted sum of all allocatable resources. The resources and their weights are configurable as the plugin arguments. Like the other `NodeResources` scoring plugins, the plugin supports standard resource types: CPU, Memory, Ephemeral-Storage plus any Scalar-Resources. | ||
2. Normalize the raw score to `[MinNodeScore , MaxNodeScore]` (e.g., [0, 100]) after all nodes are scored. | ||
- `NormalizedScore = MinNodeScore+(RawScore-LowestRawScore)/(HighestRawScore-LowestRawScore) *(MaxNodScore - MinNodeScore)` | ||
3. Choose the node with the highest score. | ||
|
||
## Alternatives | ||
|
||
The kubernetes-sigs of scheduler plugins also noticed this problem and proposed [resource limit aware scoring](https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/kep/217-resource-limit-aware-scoring/README.md) plugins. However, they don't support configuring limit to allocatable ratio of nodes. Configurable allocatable limit is really useful when different nodes have different resource-contention degree according to prior knowledge. |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say we should be extra cautious on introducing a Filter plugin. Several reasons:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Huang-Wei Thanks for reviewing the PR.
Thanks for reviewing this PR.
Indeed, as you said, introducing a new Filter will cause many problems.
Specific to the Limit-aware scheduling scenario, considering the runtime stability of the node, it is risky if the total limit of a node is too high. IMO there are two ways to control this risk:
The first method is a must. As for the second method, it can be given to users as an option, so that they have the opportunity to use it when necessary.
WDYT? @Huang-Wei
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 2nd option is not necessarily a big no, it depends on your use-case. It's just when offering it, it'd break KRM (kubernetes resource model), you have to document the restrictions clearly like vanilla CA are unable to handle this kind of filter error (unless it gets re-compiled), etc.