Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: add limit-aware scheduling #1495

Closed
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
165 changes: 165 additions & 0 deletions docs/proposals/20230712-resource-limit-aware-scheduling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
---
title: Limit Aware Scheduling
authors:
- "@ZiMengSheng"
reviewers:
- "@hormes"
- "@eahydra"
creation-date: 2023-07-12
last-updated: 2023-07-12
---

# Limit Aware Scheduling

## Motivation

Kubernetes divides pod QoS into guaranteed, burstable, and best effort. Among them, the limits of burstable pods is greater than requests. However, the current scheduler actually only considers pod requests when scheduling pods. In this way, if a large number of burstable pods are scheduled to the node, the total resource usage of the pods on the node will probably exceed the allocatable capacity of the node, that is, the resources are over-subscriped. Resource over-subscription may lead to resource contention, performance degradation and even pod failure (e.g., OOM).

This proposal uses the limit to allocatable ratio of the node to measure the degree of resource over-subscription, and proposes a scheduling plugin named LimitAware to control the limit/allocatable of the node.

### Goals

1. Introduce a filtering plugin, determine whether the Pod can be scheduled on the node according to limit to allocatable ratios of nodes.
2. Introduce a scoring plugin to mitigate resource over-subscription issue caused by burstable pods through "spreading" or "balancing" pod's resource limits across nodes.

### Non-Goals

Replace existing resource allocated plugin.

## Proposal

### User Story

#### Story 1: Use limit aware score plugin to mitigate resource over-subscription

Consider two nodes with the same resource allocatable (8 CPU cores) and each has two pods running.

- node1: pod1(request: 2, limit: 6), pod2(request: 2, limit: 4); total request/allocatable: 4/8, total limit/allocatable: 10/8
- node2: pod3(request: 3, limit: 3), pod4(request: 2, limit: 2); total request/allocatable: 5/8, total limit/allocatable: 5/8

To schedule a new pod pod5 (request: 1, limit: 4), although node1's total request/allocatable ratio is smaller than node2's, our `LimitAware` plugin considers node1's resource limit is already oversubscribed and hence place pod5 on node 2 instead.

#### Story 2: Use limit aware filter plugin to filter out nodes with high risk of resource over-subscription
Copy link

@Huang-Wei Huang-Wei Aug 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say we should be extra cautious on introducing a Filter plugin. Several reasons:

  • users, to my experience, may prefer their pods get scheduled (and possible OOMed / CPU throttled afterwards) than not being able to get scheduled at first place
  • it'd introduce a new semantics on KRM, which has a large blasting radius - CA has to be aware of this, so does other integrators

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Huang-Wei Thanks for reviewing the PR.

Thanks for reviewing this PR.
Indeed, as you said, introducing a new Filter will cause many problems.
Specific to the Limit-aware scheduling scenario, considering the runtime stability of the node, it is risky if the total limit of a node is too high. IMO there are two ways to control this risk:

  1. Lower the weight of such nodes based on the scoring algorithm;
  2. Introduce Filter to strongly restrict the total amount of Limit to not exceed the threshold.

The first method is a must. As for the second method, it can be given to users as an option, so that they have the opportunity to use it when necessary.

WDYT? @Huang-Wei

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 2nd option is not necessarily a big no, it depends on your use-case. It's just when offering it, it'd break KRM (kubernetes resource model), you have to document the restrictions clearly like vanilla CA are unable to handle this kind of filter error (unless it gets re-compiled), etc.


Consider a node with resource allocatable of 8 CPU cores and having two pods running. The user expects limit/allocatable less than or equal 125%.

- node1: pod1(request: 2, limit: 6), pod2(request: 2, limit: 4); total request/allocatable: 4/8, total limit/allocatable: 10/8

To schedule a new pod pod5 (request: 1, limit: 4), our `LimitAware` plugin will think pod doesn't fit node1 as node1 total limits > allocatable * (expected limit/allocatable): 14 > 8 x 125%.

## Design Details

### Allocatable and Allocated Limit of a node

This proposal actually provides a limit aware ResourceFit filter, so it is necessary to define the allocatable and allocated limit of the node. The allocated limit of a node is equal to the sum of the limits of all pods on the node, and the allocatable limit of a node is determined by the user according to the desired degree of resource over-subscription. This proposal provides plugin configuration and node annotation for users to configure the limit to allocatable ratio in cluster level and node level respectively according to the desired degree of resource over-subscription.

The node level protocol is as follows:

```go

const (
// AnnotationLimitToAllocatable The limit to allocatable ratio used by limit aware scheduler plugin, represent by percentage
AnnotationLimitToAllocatable = NodeDomainPrefix + "/limit-to-allocatable"
)

type LimitToAllocatableRatio map[corev1.ResourceName]intstr.IntOrString


func GetNodeLimitToAllocatableRatio(annotations map[string]string, defaultLimitToAllocatableRatio LimitToAllocatableRatio) (LimitToAllocatableRatio, error) {
data, ok := annotations[AnnotationLimitToAllocatable]
if !ok {
return defaultLimitToAllocatableRatio, nil
}
limitToAllocatableRatio := LimitToAllocatableRatio{}
err := json.Unmarshal([]byte(data), &limitToAllocatableRatio)
if err != nil {
return nil, err
}
return limitToAllocatableRatio, nil
}
```

The cluster level protocols is as follows:

```go
// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object

// LimitAwareArgs defines the parameters for LimitAware plugin.
type LimitAwareArgs struct {
metav1.TypeMeta

// Resources a list of pairs <resource, weight> to be considered while scoring
//allowed weights start from 1.
Resources []schedconfig.ResourceSpec

DefaultLimitToAllocatableRatio extension.LimitToAllocatableRatio
}
```

### Requested Limit of Pod

Similar to calculating the resource request of a pod, we use the following function to calculate the requested limit of a Pod.

```go
func getPodResourceLimit(pod *corev1.Pod, nonZero bool) (result *framework.Resource) {
result = &framework.Resource{}
non0CPU, non0Mem := int64(0), int64(0)
for _, container := range pod.Spec.Containers {
limit := quotav1.Max(container.Resources.Limits, container.Resources.Requests)
result.Add(limit)
non0CPUReq, non0MemReq := schedutil.GetNonzeroRequests(&limit)
non0CPU += non0CPUReq
non0Mem += non0MemReq
}
// take max_resource(sum_pod, any_init_container)
for _, container := range pod.Spec.InitContainers {
limit := quotav1.Max(container.Resources.Limits, container.Resources.Requests)
result.SetMaxResource(limit)
non0CPUReq, non0MemReq := schedutil.GetNonzeroRequests(&limit)
non0CPU = max(non0CPU, non0CPUReq)
non0Mem = max(non0Mem, non0MemReq)
}
// If Overhead is being utilized, add to the total limits for the pod
if pod.Spec.Overhead != nil {
result.Add(pod.Spec.Overhead)
if _, found := pod.Spec.Overhead[corev1.ResourceCPU]; found {
non0CPU += pod.Spec.Overhead.Cpu().MilliValue()
}
if _, found := pod.Spec.Overhead[corev1.ResourceMemory]; found {
non0Mem += pod.Spec.Overhead.Memory().Value()
}
}
if nonZero {
result.MilliCPU = non0CPU
result.Memory = non0Mem
}
return
}
```

When the container does not specify Limits, use requests as the limits of the container.
For cpu and memory, a pod that doesn't explicitly specify request and limit will be treated as having limited the amount indicated below, for the purpose of computing priority only.
This ensures that when scheduling zero-limit pods, such pods will not all be scheduled to the machine with the smallest in-use limit, and that when scheduling regular pods, such pods will not see zero-limit pods as consuming no resources whatsoever.

### Filter out nodes with high risk of resource over-subscription

From before, we can already determine the allocatable limit and allocated limit of Node. When scheduling a new pod, all nodes which satisfy requested limit + allocated limit > allocatable limit will be filtered out.

It is possible for DaemonSet not to set Request, but to set Limit; and DaemonSet is quite special, generally it is a basic component, and it is possible to use DaemonSet to expand its functions.
So we skip to check DaemonSet pods during the Filter stage.
### Score nodes to favor less risk of resource over-subscription

This proposal actually borrows [the idea of kubernetes-sigs to score nodes](https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/kep/217-resource-limit-aware-scoring/README.md#design-details) but replace node's allocatable resource with our allocatable limit.

Since the main purpose of the limit aware score plugin is to reduce the risk of resource over-subscription through spreading or balancing, this proposal uses the least allocated limit strategy to score nodes.

1. Calculate each node’s raw score. A node's raw score will be negative for an node of which the allocated limit is greater than allocatable limit .
- `Score = (allocatable limit - allocated limit) * MaxNodeScore / allocatable limit`
- For multiple resources, the raw score is the weighted sum of all allocatable resources. The resources and their weights are configurable as the plugin arguments. Like the other `NodeResources` scoring plugins, the plugin supports standard resource types: CPU, Memory, Ephemeral-Storage plus any Scalar-Resources.
2. Normalize the raw score to `[MinNodeScore , MaxNodeScore]` (e.g., [0, 100]) after all nodes are scored.
- `NormalizedScore = MinNodeScore+(RawScore-LowestRawScore)/(HighestRawScore-LowestRawScore) *(MaxNodScore - MinNodeScore)`
3. Choose the node with the highest score.

## Alternatives

The kubernetes-sigs of scheduler plugins also noticed this problem and proposed [resource limit aware scoring](https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/kep/217-resource-limit-aware-scoring/README.md) plugins. However, they don't support configuring limit to allocatable ratio of nodes. Configurable allocatable limit is really useful when different nodes have different resource-contention degree according to prior knowledge.