Skip to content

Commit

Permalink
Add a design proposal for adding priority to Kubernetes API
Browse files Browse the repository at this point in the history
  • Loading branch information
bsalamat committed May 5, 2017
1 parent efed6b2 commit 92cbbde
Showing 1 changed file with 187 additions and 0 deletions.
187 changes: 187 additions & 0 deletions contributors/design-proposals/pod-priority-api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,187 @@
# Priority in Kubernetes API

@bsalamat

May 2017
* [Objective](#objective)
* [Non-Goals](#oon-Goals)
* [Background](#background)
* [Overview](#overview)
* [Detailed Design](#detailed-design)
* [Effect of priority on scheduling](#effect-of-priority-on-scheduling)
* [Effect of priority on preemption](#effect-of-priority-on-preemption)
* [Priority in PodSpec](#priority-in-podspec)
* [Priority names](#priority-names)
* [Resolving priority names](#resolving-priority-names)
* [Modifying priority names](#modifying-priority-names)
* [Ordering of priorities](#ordering-of-priorities)
* [Dynamic change of priority names](#dynamic-change-of-priority-names)
* [Drawbacks of changing priority names](#drawbacks-of-changing-priority-names)
* [Priority and QoS classes](#priority-and-qos-classes)


## Objective



* How to specify priority for workloads in Kubernetes API.
* Define how the order of these priorities are specified.
* Define how new priority levels are added.
* Effect of priority on scheduling and preemption.

### Non-Goals



* How preemption works in Kubernetes.
* How quota allocation and accounting works for each priority.

## Background

It is fairly common in clusters to have more tasks than what the cluster
resources can handle. Often times the workload is a mix of high priority
critical tasks, and non-urgent tasks that can wait. Cluster management should be
able to distinguish these workloads in order to decide which ones should acquire
the resources sooner and which ones can wait. Priority of the workload is one of
the key metrics that provides the information to the cluster. This document is a
more detailed design proposal for part of the high-level architecture described
in [Resource sharing architecture for batch and serving workloads in Kubernetes](https://docs.google.com/document/d/1-H2hnZap7gQivcSU-9j4ZrJ8wE_WwcfOkTeAGjzUyLA).

## Overview

This design doc introduces the concept of priorities for pods in Kubernetes and
how the priority impacts scheduling and preemption of pods when the cluster
runs out of resources. A pod can specify a priority at the creation time. The
priority must be one of the valid values and there is a total order on the
values. The priority of a pod is independent of its workload type. The priority
is global and not specific to a particular namespace.

## Detailed Design

### Effect of priority on scheduling

One could generally expect a pod with higher priority has a higher chance of
getting scheduled than the same pod with lower priority. However, there are
many other parameters that affect scheduling decisions. So, a high priority pod
may or may not be scheduled sooner than lower priority pods. The details of
what determines the order at which pods are scheduled are beyond the scope of
this document.

### Effect of priority on preemption

Generally, lower priority pods are more likely to get preempted by higher
priority pods when cluster is out of resources. In such a case, scheduler may
decide to preempt lower priority pods to release enough resources for higher
priority pending pods. As mentioned before, there are many other parameters
that affect scheduling decisions, such as affinity and anti-affinity. If
scheduler determines that a high priority pod cannot be scheduled even if lower
priority pods are preempted, it will not preempt lower priority pods. Scheduler
may have other restrictions on preempting pods, for example, it may refuse to
preempt a pod if PodDisruptionBudget is violated. The details of scheduling and
preemption decisions are beyond the scope of this document.

### Priority in PodSpec

Pods may have priority in their pod spec. PodSpec has field called
"PriorityName" which is specified by user. PodStatus has a corresponding field
called "Priority". User-specified priority (PriorityName) is a string and all
of the valid priority names are defined by a system wide mapping that maps each
string to an integer. The PriorityName specified in a pod spec must be found in
this map or the pod creation request will be rejected. If PriorityName is
empty, "tier3" (see below) is assumed. I am open to other suggestions about
what the default priority should be. Once the PriorityName is resolved to an
integer, it is placed in "Priority" field of PodStatus.


```
type PodSpec struct {
...
PriorityName string
}
type PodStatus struct {
...
Priority int // Populated by Admission Controller.
}
```

### Priority names

A Kubernetes cluster has four predefined (built-in) priority names. The top
priority is reserved for the system pods and its value is the largest 32-bit
integer. The following list is an example of how the priorities and their
values may look like. I am open to suggestions for the names and values of the
priorities.


```
system 2147483647 (int_max)
tier1 4000
tier2 2000
tier3 1000
```


### Resolving priority names

User requests sent to Kubernetes may have "PriorityName" in their PodSpec.
Admission controller resolves a PriorityName to its corresponding number and
populates the "Priority" field of the pod status. The rest of Kubernetes
components look at the "Priority" field of pod status and work with the integer
value. In other words, "PriorityName" will be ignored by the rest of the
system.

### Modifying priority names

Admission controller can optionally receive a ConfigMap from the API server
that defines a list of new priority names and their numbers or changes the
integral value of existing ones. New configurations cannot introduce priority
names mapped to a value higher than "system" and cannot change the value of
"system" priority. The reason for this restriction is that kubernetes critical
system pods will have "system" priority and no pod should be able to preempt
them.

Alternatively, we may ship Kubernetes with only the highest priority level
hard-coded. The rest of the priority names are defined in a ConfigMap which
could be shipped with Kubernetes.

### Ordering of priorities

As mentioned earlier, a PriorityName is resolved by the admission controller to
its integral value and Kubernetes components use the integral value. The higher
the value, the higher the priority.

### Dynamic change of priority names

Kubernetes supports change of priority names at runtime, however the changes
won't change the priority of existing pods. As mentioned earlier, admission
controller resolves priority names and populates the "Priority" field of pod
specs. A new ConfigMap may be fed to the admission controller to define new
priority names or change the integer value of existing names (other than
"system"). These new changes affects only the pods created after the changes
and existing pods will keep their existing Priority.

#### Drawbacks of changing priority names

While Kubernetes allows changing priority names, it should be done only when
absolutely needed. Changing priority names has the following disadvantages:



* May remove config portability: pod specs written for one cluster are no
longer guaranteed to work on a different cluster if the same priority names
do not exist on the new cluster.
* If quota is specified for existing priorities (at the time of this writing,
we don't have this feature in Kubernetes), adding or changing priorities
will require reconfiguration of quota allocations.

### Priority and QoS classes

Kubernetes has [three QoS
classes](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/resource-qos.md#qos-classes)
which are derived from request and limit of pods. Priority is introduced as an
independent concept; meaning that any QoS class may have any valid priority.
When a node is out of resources and pods needs to be preempted, we may give
priority a higher weight over QoS classes. In other words, we may decide to
preempt the lowest priority pod and break ties with QoS class. This is not
finalized yet. We will discuss and finalize preemption in a separate doc.

0 comments on commit 92cbbde

Please sign in to comment.