Skip to content

Commit

Permalink
Design proposal for adding priority to Kubernetes API (kubernetes#604)
Browse files Browse the repository at this point in the history
* Add a design proposal for adding priority to Kubernetes API
  • Loading branch information
bsalamat authored and Erin A Boyd committed Oct 23, 2017
1 parent 8a221e8 commit 5dc8166
Showing 1 changed file with 243 additions and 0 deletions.
243 changes: 243 additions & 0 deletions contributors/design-proposals/pod-priority-api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,243 @@
# Priority in Kubernetes API

@bsalamat

May 2017
* [Objective](#objective)
* [Non-Goals](#non-goals)
* [Background](#background)
* [Overview](#overview)
* [Detailed Design](#detailed-design)
* [Effect of priority on scheduling](#effect-of-priority-on-scheduling)
* [Effect of priority on preemption](#effect-of-priority-on-preemption)
* [Priority in PodSpec](#priority-in-podspec)
* [Priority Classes](#priority-classes)
* [Resolving priority class names](#resolving-priority-class-names)
* [Ordering of priorities](#ordering-of-priorities)
* [System Priority Class Names](#system-priority-class-names)
* [Modifying Priority Classes](#modifying-priority-classes)
* [Drawbacks of changing priority names](#drawbacks-of-changing-priority-classes)
* [Priority and QoS classes](#priority-and-qos-classes)


## Objective



* How to specify priority for workloads in Kubernetes API.
* Define how the order of these priorities are specified.
* Define how new priority levels are added.
* Effect of priority on scheduling and preemption.

### Non-Goals



* How preemption works in Kubernetes.
* How quota allocation and accounting works for each priority.

## Background

It is fairly common in clusters to have more tasks than what the cluster
resources can handle. Often times the workload is a mix of high priority
critical tasks, and non-urgent tasks that can wait. Cluster management should be
able to distinguish these workloads in order to decide which ones should acquire
the resources sooner and which ones can wait. Priority of the workload is one of
the key metrics that provides the information to the cluster. This document is a
more detailed design proposal for part of the high-level architecture described
in [Resource sharing architecture for batch and serving workloads in Kubernetes](https://docs.google.com/document/d/1-H2hnZap7gQivcSU-9j4ZrJ8wE_WwcfOkTeAGjzUyLA).

## Overview

This design doc introduces the concept of priorities for pods in Kubernetes and
how the priority impacts scheduling and preemption of pods when the cluster
runs out of resources. A pod can specify a priority at the creation time. The
priority must be one of the valid values and there is a total order on the
values. The priority of a pod is independent of its workload type. The priority
is global and not specific to a particular namespace.

## Detailed Design

### Effect of priority on scheduling

One could generally expect a pod with higher priority has a higher chance of
getting scheduled than the same pod with lower priority. However, there are
many other parameters that affect scheduling decisions. So, a high priority pod
may or may not be scheduled before lower priority pods. The details of
what determines the order at which pods are scheduled are beyond the scope of
this document.

### Effect of priority on preemption

Generally, lower priority pods are more likely to get preempted by higher
priority pods when cluster has reached a threshold. In such a case, scheduler
may decide to preempt lower priority pods to release enough resources for higher
priority pending pods. As mentioned before, there are many other parameters
that affect scheduling decisions, such as affinity and anti-affinity. If
scheduler determines that a high priority pod cannot be scheduled even if lower
priority pods are preempted, it will not preempt lower priority pods. Scheduler
may have other restrictions on preempting pods, for example, it may refuse to
preempt a pod if PodDisruptionBudget is violated. The details of scheduling and
preemption decisions are beyond the scope of this document.

### Priority in PodSpec

Pods may have priority in their pod spec. PodSpec will have two new fields
called "PriorityClassName" which is specified by user, and "Priority" which will
be populated by Kubernetes. User-specified priority (PriorityClassName) is a
string and all of the valid priority classes are defined by a system wide
mapping that maps each string to an integer. The PriorityClassName specified in
a pod spec must be found in this map or the pod creation request will be
rejected. If PriorityClassName is empty, it will resolve to the default
priority (See below for more info on name resolution). Once the
PriorityClassName is resolved to an integer, it is placed in "Priority" field of
PodSpec.


```
type PodSpec struct {
...
PriorityClassName string
Priority *int32 // Populated by Admission Controller. Users are not allowed to set it directly.
}
```

### Priority Classes

The cluster may have many user defined priority classes for
various use cases. The following list is an example of how the priorities and
their values may look like.
Kubernetes will also have special priority class names reserved for critical system
pods. Please see [System Priority Class Names](#system-priority-class-names) for
more information. Any priority value above 1 billion is reserved for system use.
Aside from those system priority classes, Kubernetes is not shipped with predefined
priority classes usable by user pods. The main goal of having no built-in
priority classes for user pods is to avoid creating defacto standard names which
may be hard to change in the future.

```
system 2147483647 (int_max)
tier1 4000
tier2 2000
tier3 1000
```

The following shows a list of example workloads in a Kubernetes cluster in decreasing order of priority:

* Kubernetes system daemons (per-node like fluentd, and cluster-level like
Heapster)
* Critical user infrastructure (e.g. storage servers, monitoring system like
Prometheus, etc.)
* Components that are in the user-facing request serving path and must be able
to scale up arbitrarily in response to load spikes (web servers, middleware,
etc.)
* Important interruptible workloads that need strong guarantee of
schedulability and of not being interrupted
* Less important interruptible workloads that need a less strong guarantee of
schedulability and of not being interrupted
* Best effort / opportunistic

### Resolving priority class names

User requests sent to Kubernetes may have `PriorityClassName` in their PodSpec.
Admission controller resolves a PriorityClassName to its corresponding number
and populates the "Priority" field of the pod spec. The rest of Kubernetes
components look at the "Priority" field of pod status and work with the integer
value. In other words, `PriorityClassName` will be ignored by the rest of the
system.

We are going to add a new API object called PriorityClass. The priority class
defines the mapping between the priority name and its value. It can have an
optional description. It is an arbitrary string and is provided
only as a guideline for users.

A priority class can be marked as "Global Default" by setting its
`GlobalDefault` field to true. If a pod does not specify any `PriorityClassName`,
the system resolves it to the value of the global default priority class if
exists. If there is no global default, the pod's priority will be resolved to
zero. Priority admission controller ensures that there is only one global
default priority class.

```
type PriorityClass struct {
metav1.TypeMeta
// +optional
metav1.ObjectMeta
// The value of this priority class. This is the actual priority that pods
// recieve when they have the above name in their pod spec.
Value int32
GlobalDefault bool
Description string
}
```

### Ordering of priorities

As mentioned earlier, a PriorityClassName is resolved by the admission controller to
its integral value and Kubernetes components use the integral value. The higher
the value, the higher the priority.

### System Priority Class Names
There will be special priority class names reserved for system use only. These
classes have a value larger than one billion.
Priority admission controller ensures that new priority classes will be not
created with those names. They are used for critical system pods that must not
be preempted. We set default policies that deny creation of pods with
PriorityClassNames corresponding to these priorities. Cluster admins can
authorize users or service accounts to create pods with these priorities. When
non-authorized users set PriorityClassName to one of these priority classes in
their pod spec, their pod creation request will be rejected. For pods created by
controllers, the service account must be authorized by cluster admins.

### Modifying priority classes

Priority classes can be added or removed, but their name and value cannot be
updated. We allow updating `GlobalDefault` and `Description` as long as there is
a maximum of one global default. While
Kubernetes can work fine if priority classes are changed at run-time, the change
can be confusing to users as pods with a priority class which were created
before the change will have a different priority value than those created after
the change. Deletion of priority classes is allowed, despite the fact that there
may be existing pods that have specified such priority class names in their pod
spec. In other words, there will be no referential integrity for priority
classes. This is another reason that all system components should only work with
the integer value of the priority and not with the `PriorityClassName`.

One could delete an existing priority class and create another one with the same
name and a different value. By doing so, they can achieve the same effect as
updating a priority class, but we still do not allow updating priority classes
to prevent accidental changes.

Newly added priority classes cannot have a value higher than what is reserved
for "system". The reason for this restriction
is that Kubernetes critical system pods will have one of the "system" priorities
and no pod should be able to preempt them.

#### Drawbacks of changing priority classes

While Kubernetes effectively allows changing priority classes (by deleting and
adding them with a different value), it should be done only when
absolutely needed. Changing priority classes has the following disadvantages:


* May remove config portability: pod specs written for one cluster are no
longer guaranteed to work on a different cluster if the same priority classes
do not exist in the second cluster.
* If quota is specified for existing priority classes (at the time of this writing,
we don't have this feature in Kubernetes), adding or deleting priority classes
will require reconfiguration of quota allocations.
* An existing pods may have an integer value of priority that does not reflect
the current value of its PriorityClass.

### Priority and QoS classes

Kubernetes has [three QoS
classes](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/resource-qos.md#qos-classes)
which are derived from request and limit of pods. Priority is introduced as an
independent concept; meaning that any QoS class may have any valid priority.
When a node is out of resources and pods needs to be preempted, we give
priority a higher weight over QoS classes. In other words, we preempt the lowest
priority pod and break ties with some other metrics, such as, QoS class, usage
above request, etc. This is not finalized yet. We will discuss and finalize
preemption in a separate doc.

0 comments on commit 5dc8166

Please sign in to comment.