Add a design proposal for adding priority to Kubernetes API

kubernetes · Jun 14, 2017 · 08e3a31 · 08e3a31
1 parent e3d4ac0
commit 08e3a31
Showing 1 changed file with 244 additions and 0 deletions.
diff --git a/contributors/design-proposals/pod-priority-api.md b/contributors/design-proposals/pod-priority-api.md
@@ -0,0 +1,244 @@
+# Priority in Kubernetes API
+
+@bsalamat
+
+May 2017
+  * [Objective](#objective)
+    * [Non-Goals](#non-Goals)
+  * [Background](#background)
+  * [Overview](#overview)
+  * [Detailed Design](#detailed-design)
+    * [Effect of priority on scheduling](#effect-of-priority-on-scheduling)
+    * [Effect of priority on preemption](#effect-of-priority-on-preemption)
+    * [Priority in PodSpec](#priority-in-podspec)
+    * [Priority Classes](#priority-classes)
+    * [Resolving priority class names](#resolving-priority-class-names)
+    * [Ordering of priorities](#ordering-of-priorities)
+    * [system PriorityClassName](#system-priorityclassname)
+    * [Modifying Priority Classes](#modifying-priority-classes)
+    * [Drawbacks of changing priority names](#drawbacks-of-changing-priority-classes)
+    * [Priority and QoS classes](#priority-and-qos-classes)
+
+
+## Objective
+
+
+
+*   How to specify priority for workloads in Kubernetes API.
+*   Define how the order of these priorities are specified. 
+*   Define how new priority levels are added.
+*   Effect of priority on scheduling and preemption.
+
+### Non-Goals 
+
+
+
+*   How preemption works in Kubernetes.
+*   How quota allocation and accounting works for each priority.
+
+## Background 
+
+It is fairly common in clusters to have more tasks than what the cluster
+resources can handle. Often times the workload is a mix of high priority
+critical tasks, and non-urgent tasks that can wait. Cluster management should be
+able to distinguish these workloads in order to decide which ones should acquire
+the resources sooner and which ones can wait. Priority of the workload is one of
+the key metrics that provides the information to the cluster. This document is a
+more detailed design proposal for part of the high-level architecture described
+in [Resource sharing architecture for batch and serving workloads in Kubernetes](https://docs.google.com/document/d/1-H2hnZap7gQivcSU-9j4ZrJ8wE_WwcfOkTeAGjzUyLA).
+
+## Overview 
+
+This design doc introduces the concept of priorities for pods in Kubernetes and
+how the priority impacts scheduling and preemption of pods when the cluster
+runs out of resources. A pod can specify a priority at the creation time. The
+priority must be one of the valid values and there is a total order on the
+values. The priority of a pod is independent of its workload type. The priority
+is global and not specific to a particular namespace.
+
+## Detailed Design 
+
+### Effect of priority on scheduling 
+
+One could generally expect a pod with higher priority has a higher chance of
+getting scheduled than the same pod with lower priority. However, there are
+many other parameters that affect scheduling decisions. So, a high priority pod
+may or may not be scheduled before lower priority pods. The details of
+what determines the order at which pods are scheduled are beyond the scope of
+this document.
+
+### Effect of priority on preemption 
+
+Generally, lower priority pods are more likely to get preempted by higher
+priority pods when cluster has reached a threshold. In such a case, scheduler
+may decide to preempt lower priority pods to release enough resources for higher
+priority pending pods. As mentioned before, there are many other parameters
+that affect scheduling decisions, such as affinity and anti-affinity. If
+scheduler determines that a high priority pod cannot be scheduled even if lower
+priority pods are preempted, it will not preempt lower priority pods. Scheduler
+may have other restrictions on preempting pods, for example, it may refuse to
+preempt a pod if PodDisruptionBudget is violated. The details of scheduling and
+preemption decisions are beyond the scope of this document.
+
+### Priority in PodSpec 
+
+Pods may have priority in their pod spec. PodSpec will have two new fields
+called "PriorityClassName" which is specified by user, and "Priority" which will
+be populated by Kubernetes. User-specified priority (PriorityClassName) is a 
+string and all of the valid priority classes are defined by a system wide
+mapping that maps each string to an integer. The PriorityClassName specified in
+a pod spec must be found in this map or the pod creation request will be
+rejected. If PriorityClassName is empty, it will resolve to the default
+priority (See below for more info on name resolution). Once the 
+PriorityClassName is resolved to an integer, it is placed in "Priority" field of
+PodSpec.
+
+
+```
+type PodSpec struct {
+  ...
+  PriorityClassName string
+  Priority          int32  // Populated by Admission Controller. Users are not allowed to set it directly.
+}
+```
+
+### Priority Classes 
+
+A Kubernetes cluster has one predefined (built-in) priority class which is 
+reserved for the system pods and its value is the largest 32-bit
+integer. The cluster may have many more user defined priority classes for
+various use cases. The following list is an example of how the priorities and
+their values may look like.
+We decided to ship Kubernetes with no predefined priority classes, except
+for the "system" priority which will be built in Kubernetes. The main goal of
+of having no built-in priority classes is to avoid creating defacto standard
+names which may be hard to change in the future.
+
+```
+system  2147483647 (int_max)
+tier1   4000
+tier2   2000
+tier3   1000
+```
+
+The following shows a list of example workloads in a Kubernetes cluster in decreasing order of priority:
+
+* Kubernetes system daemons (per-node like fluentd, and cluster-level like 
+  Heapster)
+* Critical user infrastructure (e.g. storage servers, monitoring system like
+  Prometheus, etc.)
+* Components that are in the user-facing request serving path and must be able 
+  to scale up arbitrarily in response to load spikes (web servers, middleware,
+  etc.)
+* Important interruptible workloads that need strong guarantee of
+  schedulability and of not being interrupted
+* Less important interruptible workloads that need a less strong guarantee of
+  schedulability and of not being interrupted
+* Best effort / opportunistic
+
+### Resolving priority class names 
+
+User requests sent to Kubernetes may have `PriorityClassName` in their PodSpec.
+Admission controller resolves a PriorityClassName to its corresponding number
+and populates the "Priority" field of the pod spec. The rest of Kubernetes
+components look at the "Priority" field of pod status and work with the integer
+value. In other words, `PriorityClassName` will be ignored by the rest of the
+system.
+
+We are going to add a new API object called PriorityClass. The priority class
+defines the mapping between the priority name and its value. It can have an
+optional description. The description is provided as an annotation with the 
+`kubernetes.io/description` key. It is an arbitrary string and is provided
+only as a guideline for users.
+
+If a pod does not specify any `PriorityClassName`, the system resolves it to 
+zero in the first version, but we will add support to annotate one of the
+`PriorityClass` objects that tells the system that it is the default one. When
+there is a default priority class, a pod with no `PriorityClassName` will get
+the value of this default class (instead of zero).
+
+```
+type PriorityClass struct {
+  metav1.TypeMeta
+  // +optional
+  metav1.ObjectMeta
+  
+  // The value of this priority class. This is the actual priority that pods
+  // recieve when they have the above name in their pod spec.
+  Value        int32
+}
+```
+
+And the annotation for the default class will be something like:
+
+```yaml
+...
+annotations:
+  priorityclass.alpha.kubernetes.io/admissionPolicy: "default"
+...
+```
+
+### Ordering of priorities 
+
+As mentioned earlier, a PriorityClassName is resolved by the admission controller to
+its integral value and Kubernetes components use the integral value. The higher
+the value, the higher the priority.
+
+### `system` PriorityClassName
+`system` is a special priority name which is reserved and cannot be changed. It
+is used for critical system pods that must never be preempted. We set default
+policies that deny creation of pods with `system` priority. Cluster admins can
+authorize users or service accounts to create pods with this priority. When
+non-authorized users set PriorityClassName to `system` in their pod spec, their
+pod creation request will be rejected. For pods created by controllers, the
+service account must be authorized by cluster admins.
+
+### Modifying priority classes 
+
+Priority classes can be added or removed, but they cannot be updated. While
+Kubernetes can work fine if priority classes are changed at run-time, the change
+can be confusing to users as pods with a priority class which were created
+before the change will have a different priority value than those created after
+the change. Deletion of priority classes is allowed, despite the fact that there
+may be existing pods that have specified such priority class names in their pod
+spec. In other words, there will be no referential integrity for priority
+classes. This is another reason that all system components should only work with
+the integer value of the priority and not with the `PriorityClassName`.
+
+One could delete an existing priority class and create another one with the same
+name and a different value. By doing so, they can achieve the same effect as
+updating a priority class, but we still do not allow updating priority classes
+to prevent accidental changes.
+
+Newly added priority classes cannot have a value higher than "system" and no one
+can change the value of "system" priority class. The reason for this restriction
+is that kubernetes critical system pods will have "system" priority and no pod 
+should be able to preempt them.
+
+#### Drawbacks of changing priority classes 
+
+While Kubernetes effectively allows changing priority classes (by deleting and
+adding them with a different value), it should be done only when
+absolutely needed. Changing priority classes has the following disadvantages:
+
+
+*   May remove config portability: pod specs written for one cluster are no
+    longer guaranteed to work on a different cluster if the same priority classes
+    do not exist in the second cluster. 
+*   If quota is specified for existing priority classes (at the time of this writing,
+    we don't have this feature in Kubernetes), adding or deleting priority classes
+    will require reconfiguration of quota allocations.
+*   An existing pods may have an integer value of priority that does not reflect
+    the current value of its PriorityClass.
+
+### Priority and QoS classes 
+
+Kubernetes has [three QoS
+classes](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/resource-qos.md#qos-classes)
+which are derived from request and limit of pods. Priority is introduced as an
+independent concept; meaning that any QoS class may have any valid priority.
+When a node is out of resources and pods needs to be preempted, we give
+priority a higher weight over QoS classes. In other words, we preempt the lowest
+priority pod and break ties with some other metrics, such as, QoS class, usage
+above request, etc. This is not finalized yet. We will discuss and finalize
+preemption in a separate doc.