diff --git a/contributors/design-proposals/pod-priority-api.md b/contributors/design-proposals/pod-priority-api.md new file mode 100644 index 00000000000..11f2b6e29d8 --- /dev/null +++ b/contributors/design-proposals/pod-priority-api.md @@ -0,0 +1,187 @@ +# Priority in Kubernetes API + +@bsalamat + +May 2017 + * [Objective](#objective) + * [Non-Goals](#oon-Goals) + * [Background](#background) + * [Overview](#overview) + * [Detailed Design](#detailed-design) + * [Effect of priority on scheduling](#effect-of-priority-on-scheduling) + * [Effect of priority on preemption](#effect-of-priority-on-preemption) + * [Priority in PodSpec](#priority-in-podspec) + * [Priority names](#priority-names) + * [Resolving priority names](#resolving-priority-names) + * [Modifying priority names](#modifying-priority-names) + * [Ordering of priorities](#ordering-of-priorities) + * [Dynamic change of priority names](#dynamic-change-of-priority-names) + * [Drawbacks of changing priority names](#drawbacks-of-changing-priority-names) + * [Priority and QoS classes](#priority-and-qos-classes) + + +## Objective + + + +* How to specify priority for workloads in Kubernetes API. +* Define how the order of these priorities are specified. +* Define how new priority levels are added. +* Effect of priority on scheduling and preemption. + +### Non-Goals + + + +* How preemption works in Kubernetes. +* How quota allocation and accounting works for each priority. + +## Background + +It is fairly common in clusters to have more tasks than what the cluster +resources can handle. Often times the workload is a mix of high priority +critical tasks, and non-urgent tasks that can wait. Cluster management should be +able to distinguish these workloads in order to decide which ones should acquire +the resources sooner and which ones can wait. Priority of the workload is one of +the key metrics that provides the information to the cluster. This document is a +more detailed design proposal for part of the high-level architecture described +in [Resource sharing architecture for batch and serving workloads in Kubernetes](https://docs.google.com/document/d/1-H2hnZap7gQivcSU-9j4ZrJ8wE_WwcfOkTeAGjzUyLA). + +## Overview + +This design doc introduces the concept of priorities for pods in Kubernetes and +how the priority impacts scheduling and preemption of pods when the cluster +runs out of resources. A pod can specify a priority at the creation time. The +priority must be one of the valid values and there is a total order on the +values. The priority of a pod is independent of its workload type. The priority +is global and not specific to a particular namespace. + +## Detailed Design + +### Effect of priority on scheduling + +One could generally expect a pod with higher priority has a higher chance of +getting scheduled than the same pod with lower priority. However, there are +many other parameters that affect scheduling decisions. So, a high priority pod +may or may not be scheduled sooner than lower priority pods. The details of +what determines the order at which pods are scheduled are beyond the scope of +this document. + +### Effect of priority on preemption + +Generally, lower priority pods are more likely to get preempted by higher +priority pods when cluster is out of resources. In such a case, scheduler may +decide to preempt lower priority pods to release enough resources for higher +priority pending pods. As mentioned before, there are many other parameters +that affect scheduling decisions, such as affinity and anti-affinity. If +scheduler determines that a high priority pod cannot be scheduled even if lower +priority pods are preempted, it will not preempt lower priority pods. Scheduler +may have other restrictions on preempting pods, for example, it may refuse to +preempt a pod if PodDisruptionBudget is violated. The details of scheduling and +preemption decisions are beyond the scope of this document. + +### Priority in PodSpec + +Pods may have priority in their pod spec. PodSpec has field called +"PriorityName" which is specified by user. PodStatus has a corresponding field +called "Priority". User-specified priority (PriorityName) is a string and all +of the valid priority names are defined by a system wide mapping that maps each +string to an integer. The PriorityName specified in a pod spec must be found in +this map or the pod creation request will be rejected. If PriorityName is +empty, "tier3" (see below) is assumed. I am open to other suggestions about +what the default priority should be. Once the PriorityName is resolved to an +integer, it is placed in "Priority" field of PodStatus. + + +``` +type PodSpec struct { + ... + PriorityName string +} + +type PodStatus struct { + ... + Priority int // Populated by Admission Controller. +} +``` + +### Priority names + +A Kubernetes cluster has four predefined (built-in) priority names. The top +priority is reserved for the system pods and its value is the largest 32-bit +integer. The following list is an example of how the priorities and their +values may look like. I am open to suggestions for the names and values of the +priorities. + + +``` +system 2147483647 (int_max) +tier1 4000 +tier2 2000 +tier3 1000 +``` + + +### Resolving priority names + +User requests sent to Kubernetes may have "PriorityName" in their PodSpec. +Admission controller resolves a PriorityName to its corresponding number and +populates the "Priority" field of the pod status. The rest of Kubernetes +components look at the "Priority" field of pod status and work with the integer +value. In other words, "PriorityName" will be ignored by the rest of the +system. + +### Modifying priority names + +Admission controller can optionally receive a ConfigMap from the API server +that defines a list of new priority names and their numbers or changes the +integral value of existing ones. New configurations cannot introduce priority +names mapped to a value higher than "system" and cannot change the value of +"system" priority. The reason for this restriction is that kubernetes critical +system pods will have "system" priority and no pod should be able to preempt +them. + +Alternatively, we may ship Kubernetes with only the highest priority level +hard-coded. The rest of the priority names are defined in a ConfigMap which +could be shipped with Kubernetes. + +### Ordering of priorities + +As mentioned earlier, a PriorityName is resolved by the admission controller to +its integral value and Kubernetes components use the integral value. The higher +the value, the higher the priority. + +### Dynamic change of priority names + +Kubernetes supports change of priority names at runtime, however the changes +won't change the priority of existing pods. As mentioned earlier, admission +controller resolves priority names and populates the "Priority" field of pod +specs. A new ConfigMap may be fed to the admission controller to define new +priority names or change the integer value of existing names (other than +"system"). These new changes affects only the pods created after the changes +and existing pods will keep their existing Priority. + +#### Drawbacks of changing priority names + +While Kubernetes allows changing priority names, it should be done only when +absolutely needed. Changing priority names has the following disadvantages: + + + +* May remove config portability: pod specs written for one cluster are no + longer guaranteed to work on a different cluster if the same priority names + do not exist on the new cluster. +* If quota is specified for existing priorities (at the time of this writing, + we don't have this feature in Kubernetes), adding or changing priorities + will require reconfiguration of quota allocations. + +### Priority and QoS classes + +Kubernetes has [three QoS +classes](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/resource-qos.md#qos-classes) +which are derived from request and limit of pods. Priority is introduced as an +independent concept; meaning that any QoS class may have any valid priority. +When a node is out of resources and pods needs to be preempted, we may give +priority a higher weight over QoS classes. In other words, we may decide to +preempt the lowest priority pod and break ties with QoS class. This is not +finalized yet. We will discuss and finalize preemption in a separate doc.