Skip to content

Commit

Permalink
Add a design proposal for adding priority to Kubernetes API
Browse files Browse the repository at this point in the history
  • Loading branch information
bsalamat committed Jun 14, 2017
1 parent e3d4ac0 commit 08e3a31
Showing 1 changed file with 244 additions and 0 deletions.
244 changes: 244 additions & 0 deletions contributors/design-proposals/pod-priority-api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,244 @@
# Priority in Kubernetes API

@bsalamat

May 2017
* [Objective](#objective)
* [Non-Goals](#non-Goals)
* [Background](#background)
* [Overview](#overview)
* [Detailed Design](#detailed-design)
* [Effect of priority on scheduling](#effect-of-priority-on-scheduling)
* [Effect of priority on preemption](#effect-of-priority-on-preemption)
* [Priority in PodSpec](#priority-in-podspec)
* [Priority Classes](#priority-classes)
* [Resolving priority class names](#resolving-priority-class-names)
* [Ordering of priorities](#ordering-of-priorities)
* [system PriorityClassName](#system-priorityclassname)
* [Modifying Priority Classes](#modifying-priority-classes)
* [Drawbacks of changing priority names](#drawbacks-of-changing-priority-classes)
* [Priority and QoS classes](#priority-and-qos-classes)


## Objective



* How to specify priority for workloads in Kubernetes API.
* Define how the order of these priorities are specified.
* Define how new priority levels are added.
* Effect of priority on scheduling and preemption.

### Non-Goals



* How preemption works in Kubernetes.
* How quota allocation and accounting works for each priority.

## Background

It is fairly common in clusters to have more tasks than what the cluster
resources can handle. Often times the workload is a mix of high priority
critical tasks, and non-urgent tasks that can wait. Cluster management should be
able to distinguish these workloads in order to decide which ones should acquire
the resources sooner and which ones can wait. Priority of the workload is one of
the key metrics that provides the information to the cluster. This document is a
more detailed design proposal for part of the high-level architecture described
in [Resource sharing architecture for batch and serving workloads in Kubernetes](https://docs.google.com/document/d/1-H2hnZap7gQivcSU-9j4ZrJ8wE_WwcfOkTeAGjzUyLA).

## Overview

This design doc introduces the concept of priorities for pods in Kubernetes and
how the priority impacts scheduling and preemption of pods when the cluster
runs out of resources. A pod can specify a priority at the creation time. The
priority must be one of the valid values and there is a total order on the
values. The priority of a pod is independent of its workload type. The priority
is global and not specific to a particular namespace.

## Detailed Design

### Effect of priority on scheduling

One could generally expect a pod with higher priority has a higher chance of
getting scheduled than the same pod with lower priority. However, there are
many other parameters that affect scheduling decisions. So, a high priority pod
may or may not be scheduled before lower priority pods. The details of
what determines the order at which pods are scheduled are beyond the scope of
this document.

### Effect of priority on preemption

Generally, lower priority pods are more likely to get preempted by higher
priority pods when cluster has reached a threshold. In such a case, scheduler
may decide to preempt lower priority pods to release enough resources for higher
priority pending pods. As mentioned before, there are many other parameters
that affect scheduling decisions, such as affinity and anti-affinity. If
scheduler determines that a high priority pod cannot be scheduled even if lower
priority pods are preempted, it will not preempt lower priority pods. Scheduler
may have other restrictions on preempting pods, for example, it may refuse to
preempt a pod if PodDisruptionBudget is violated. The details of scheduling and
preemption decisions are beyond the scope of this document.

### Priority in PodSpec

Pods may have priority in their pod spec. PodSpec will have two new fields
called "PriorityClassName" which is specified by user, and "Priority" which will
be populated by Kubernetes. User-specified priority (PriorityClassName) is a
string and all of the valid priority classes are defined by a system wide
mapping that maps each string to an integer. The PriorityClassName specified in
a pod spec must be found in this map or the pod creation request will be
rejected. If PriorityClassName is empty, it will resolve to the default
priority (See below for more info on name resolution). Once the
PriorityClassName is resolved to an integer, it is placed in "Priority" field of
PodSpec.


```
type PodSpec struct {
...
PriorityClassName string
Priority int32 // Populated by Admission Controller. Users are not allowed to set it directly.
}
```

### Priority Classes

A Kubernetes cluster has one predefined (built-in) priority class which is
reserved for the system pods and its value is the largest 32-bit
integer. The cluster may have many more user defined priority classes for
various use cases. The following list is an example of how the priorities and
their values may look like.
We decided to ship Kubernetes with no predefined priority classes, except
for the "system" priority which will be built in Kubernetes. The main goal of
of having no built-in priority classes is to avoid creating defacto standard
names which may be hard to change in the future.

```
system 2147483647 (int_max)
tier1 4000
tier2 2000
tier3 1000
```

The following shows a list of example workloads in a Kubernetes cluster in decreasing order of priority:

* Kubernetes system daemons (per-node like fluentd, and cluster-level like
Heapster)
* Critical user infrastructure (e.g. storage servers, monitoring system like
Prometheus, etc.)
* Components that are in the user-facing request serving path and must be able
to scale up arbitrarily in response to load spikes (web servers, middleware,
etc.)
* Important interruptible workloads that need strong guarantee of
schedulability and of not being interrupted
* Less important interruptible workloads that need a less strong guarantee of
schedulability and of not being interrupted
* Best effort / opportunistic

### Resolving priority class names

User requests sent to Kubernetes may have `PriorityClassName` in their PodSpec.
Admission controller resolves a PriorityClassName to its corresponding number
and populates the "Priority" field of the pod spec. The rest of Kubernetes
components look at the "Priority" field of pod status and work with the integer
value. In other words, `PriorityClassName` will be ignored by the rest of the
system.

We are going to add a new API object called PriorityClass. The priority class
defines the mapping between the priority name and its value. It can have an
optional description. The description is provided as an annotation with the
`kubernetes.io/description` key. It is an arbitrary string and is provided
only as a guideline for users.

If a pod does not specify any `PriorityClassName`, the system resolves it to
zero in the first version, but we will add support to annotate one of the
`PriorityClass` objects that tells the system that it is the default one. When
there is a default priority class, a pod with no `PriorityClassName` will get
the value of this default class (instead of zero).

```
type PriorityClass struct {
metav1.TypeMeta
// +optional
metav1.ObjectMeta
// The value of this priority class. This is the actual priority that pods
// recieve when they have the above name in their pod spec.
Value int32
}
```

And the annotation for the default class will be something like:

```yaml
...
annotations:
priorityclass.alpha.kubernetes.io/admissionPolicy: "default"
...
```

### Ordering of priorities

As mentioned earlier, a PriorityClassName is resolved by the admission controller to
its integral value and Kubernetes components use the integral value. The higher
the value, the higher the priority.

### `system` PriorityClassName
`system` is a special priority name which is reserved and cannot be changed. It
is used for critical system pods that must never be preempted. We set default
policies that deny creation of pods with `system` priority. Cluster admins can
authorize users or service accounts to create pods with this priority. When
non-authorized users set PriorityClassName to `system` in their pod spec, their
pod creation request will be rejected. For pods created by controllers, the
service account must be authorized by cluster admins.

### Modifying priority classes

Priority classes can be added or removed, but they cannot be updated. While
Kubernetes can work fine if priority classes are changed at run-time, the change
can be confusing to users as pods with a priority class which were created
before the change will have a different priority value than those created after
the change. Deletion of priority classes is allowed, despite the fact that there
may be existing pods that have specified such priority class names in their pod
spec. In other words, there will be no referential integrity for priority
classes. This is another reason that all system components should only work with
the integer value of the priority and not with the `PriorityClassName`.

One could delete an existing priority class and create another one with the same
name and a different value. By doing so, they can achieve the same effect as
updating a priority class, but we still do not allow updating priority classes
to prevent accidental changes.

Newly added priority classes cannot have a value higher than "system" and no one
can change the value of "system" priority class. The reason for this restriction
is that kubernetes critical system pods will have "system" priority and no pod
should be able to preempt them.

#### Drawbacks of changing priority classes

While Kubernetes effectively allows changing priority classes (by deleting and
adding them with a different value), it should be done only when
absolutely needed. Changing priority classes has the following disadvantages:


* May remove config portability: pod specs written for one cluster are no
longer guaranteed to work on a different cluster if the same priority classes
do not exist in the second cluster.
* If quota is specified for existing priority classes (at the time of this writing,
we don't have this feature in Kubernetes), adding or deleting priority classes
will require reconfiguration of quota allocations.
* An existing pods may have an integer value of priority that does not reflect
the current value of its PriorityClass.

### Priority and QoS classes

Kubernetes has [three QoS
classes](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/resource-qos.md#qos-classes)
which are derived from request and limit of pods. Priority is introduced as an
independent concept; meaning that any QoS class may have any valid priority.
When a node is out of resources and pods needs to be preempted, we give
priority a higher weight over QoS classes. In other words, we preempt the lowest
priority pod and break ties with some other metrics, such as, QoS class, usage
above request, etc. This is not finalized yet. We will discuss and finalize
preemption in a separate doc.

0 comments on commit 08e3a31

Please sign in to comment.