From e2af94c0f7e9791871ec10686bdf28153a4192b7 Mon Sep 17 00:00:00 2001
From: "Da K. Ma" <klaus1982.cn@gmail.com>
Date: Tue, 3 Jul 2018 09:57:49 +0800
Subject: [PATCH] Gang scheduling.

Signed-off-by: Da K. Ma <klaus1982.cn@gmail.com>
---
 .../scheduling/gang-scheduling.md             | 188 ++++++++++++++++++
 keps/NEXT_KEP_NUMBER                          |   2 +-
 2 files changed, 189 insertions(+), 1 deletion(-)
 create mode 100644 contributors/design-proposals/scheduling/gang-scheduling.md

diff --git a/contributors/design-proposals/scheduling/gang-scheduling.md b/contributors/design-proposals/scheduling/gang-scheduling.md
new file mode 100644
index 00000000000..011076dd4b6
--- /dev/null
+++ b/contributors/design-proposals/scheduling/gang-scheduling.md
@@ -0,0 +1,188 @@
+---
+kep-number: 19
+title: Gang Scheduling
+authors:
+  - "@k82cn"
+owning-sig: sig-scheduling, machine-learning WG
+reviewers:
+  - "@bsalamat"
+  - "@vishh"
+approvers:
+  - "@bsalamat"
+  - "@vishh"
+editor: TBD
+creation-date: 2018-07-03
+last-updated: 2018-07-25
+status: provisional
+---
+
+# Gang Scheduling
+
+## Table of Contents
+
+* [Table of Contents](#table-of-contents)
+* [Motivation](#motivation)
+* [Function Detail](#function%20detail)
+  * [API Definitation](#api%20definition)
+  * [Lifecycle Policy](#lifecycle%20policy)
+  * [kube-arbitrator](#kube-arbitrator)
+  * [Customized Controller](#customized%20controller)
+* [Feature Interaction](#feature%20interaction)
+  * [Pod RestartPolicy](#pod%20restartpolicy)
+  * [Controller Manager](#controller%20manager)
+  * [Admission Controller](#admission%20controller)
+* [References](#references)
+
+## Motivation
+
+After the discussion at [Gang-scheduling](https://docs.google.com/document/d/1AUwcvTtULNvow5M9e428FnlvINO1uQ7ojRoTGuTp4DA/edit#heading=h.ckn8nv2jj0xv) proposal, we decide to make Gang-scheduling API in core, and implement it in [kube-arbitrator](https://github.com/kubernetes-incubator/kube-arbitrator). kube-arbitrator focuses on "batch" workload in kubernetes, and will share the same [scheduling frameworks](https://github.com/kubernetes/community/pull/2281) when it's ready. This document is used to provide definition of API object and the scheduler behaviour of gang-scheduling.
+
+## Function Detail
+
+### API Definition
+
+Although workload lifecycle requirements may be different, e.g. MPI vs. Spark, the requirements to scheduler are similar. To meet the requirements to scheduler, the following **Kind** is introduced in core part under `scheduling.k8s.io/v1alpha1` **Group**/**Version**.
+
+```go
+// PodSchedulingGroup defines the scheduling requirement of a pod group
+type PodSchedulingGroup struct {
+    metav1.TypeMeta
+    metav1.ObjectMeta
+
+    Spec PodSchedulingGroupTemplate
+    Status PodSchedulingGroupStatus
+}
+
+// LifecyclePolicy represents the lifecycle policy of PodSchedulingGroup
+// according to Pod's phase.
+type LifeCyclePolicy struct {
+    // The action that will be taken to the PodSchedulingGroup according to
+    // Pod's phase. One of "Restart", "None".
+    // Default to None.
+	Action Action
+
+    // The phase of pod; the controller takes actions according to this
+    // pod's phase. One of "PodFailed", "Unschedulable".
+    Event Event
+}
+
+// PodSchedulingGroupTemplate represents the template of a pod group.
+type PodSchedulingGroupTemplate struct {
+    // MinAvailable defines the minimal available tasks to run the Job;
+    // if there's not enough resources to start all tasks, the scheduler
+    // will not start anyone.
+    MinAvailable int
+
+    // Policy defines the policy of PodSchedulingGroup lifecycle.
+    // Default to 'Action: None, PodPhase: Failed'
+    // +optional
+    Policy []LifeCyclePolicy
+}
+
+// PodSchedulingGroupStatus represents the current state of a pod group.
+type PodSchedulingGroupStatus struct {
+    // The number of actively running pods.
+    // +optional
+    Running int32
+
+    // The number of pods which reached phase Succeeded.
+    // +optional
+    Succeeded int32
+
+    // The number of pods which reached phase Failed.
+    // +optional
+    Failed int32
+}
+```
+
+The `PodSchedulingGroup` , which is a namespaced object, specifies the scheduling constraints of a pod group, e.g. minimal available pods of a group, lifecycle policy. To define which pods are member of `PodSchedulingGroup`, the following field is introduced to `PodSpec`.
+
+```go
+// PodSpec is a description of a pod.
+type PodSpec struct {
+    ...
+
+    // The name of PodSchedulingGroup that the pod belongs to; the pod
+    // must belong to the PodSchedulingGroup in the same namespace.
+    // +optional
+    GroupName string
+
+    ...
+} 
+```
+
+The `.spec.GroupName` specifies the `PodSchedulingGroup` that it belongs to; and the pod can only belong to the `PodSchedulingGroup` in the same namespace. The pod, controlled by different collections, can also belong to the same `PodSchedulingGroup`.  Because of performance concern,  it does not use `LabelSelector` to build the relationship between `PodSchedulingGroup` and `Pod`.
+
+### Lifecycle Policy
+
+A new controller, named `PodSchedulingGroupController`, is introduced to manage `PodSchedulingGroup`. It updates the status of `PodSchedulingGroup` accordingly, and takes action according to `LifeCyclePolicy` of each `PodSchedulingGroup`. The following policy are supported for now:
+
+| Event         | Action  | Version | Behaviour                                                    |
+| ------------- | ------- | ------- | ------------------------------------------------------------ |
+| PodFailed     | Restart | 1.12    | The `PodSchedulingGroupController` deletes all `Running`  and `Failed` Pods of `PodSchedulingGroup` with default grace period; it also records an event for `PodSchedulingGroup` when restarting. It's up to pod's controller to re-create new pods. |
+| PodFailed     | None    | 1.12    | The `PodSchedulingGroupController` does nothing              |
+| Unschedulable | Restart | 1.12    | The `PodSchedulingGroupController` deletes all `Running` Pods of `PodSchedulingGroup` with default grace period; it also records an event for `PodSchedulingGroup` when restarting. It's up to pod's controller to re-create new pods. |
+| Unschedulable | None    | 1.12    | The `PodSchedulingGroupController` does nothing              |
+
+There's a limitation that: when setting policy to `Action: Restart, Phase: Failed`, the `PodSchedulingGroup` may keep restarting in some cases, e.g. bug of the application in pod. The solution, e.g. `RetryCount`, will be proposed in coming release.
+
+The update to `PodSchedulingGroup` is not support for now; and deleting `PodSchedulingGroup` does not impact Pod's status.
+
+### kube-arbitrator
+
+`kube-arbitrator` only watches `PodSchedulingGroup` and `Pod`. It'll reconstruct 'Job' by `GroupName` of Pod and `PodSchedulingGroup`, the `Pod`s are considered as 'Task' of 'Job'; if `GroupName` is empty, `kube-arbitrator` records an unschedulable event of pod to ask user/controller to resubmit it to `kube-scheduler`. `kube-arbitrator` does not schedule pods until its `PodSchedulingGroup` is ready.
+
+As `kube-arbitrator` and `kube-scheduler` maybe running in parallel; `kube-arbitrator` follows multi-scheduler feature to only handle the `Pod` that submitted to `kube-arbitrator`. The `kube-arbitrator` does scheduling as follow:
+
+1. Reconstruct 'Job' by `.spec.GroupName` of `Pod` and `PodSchedulingGroup`
+2. If `PodSchedulingGroup` is not ready, the 'job' will not be scheduled; and an unschedulable event of Pod will be recorded
+3. In `allocate` phase, `kube-arbitrator` will
+   * record an `Unschedulable` event of `PodSchedulingGroup` if some pods are running but `succeeded + pending + running < minAvailable`, the `PodSchedulingGroupController` takes action according to `LifecyclePolicy`
+   * allocate (but not bind) resource to Pods according to Pod's spec, e.g. `NodeAffinity`
+   * bind all Pods to hosts until job ready (`minAvailable` <= `allocated Pods` + `succeeded Pods`)
+4. If can not allocate enough resources to the job, the pods are keeping pending; and the resource can not allocate to other job
+
+That may make resources (less than job's resource request) idle for a while, e.g. a huge job. The solution, e.g. backfill other smaller jobs to improve the resource utilization, will be proposed in coming release. In `allocate` phase, only pod's `NodeAffinity` takes effect; the other predicates/priorities will be included on-demand in coming release.
+
+### Customized Controller
+
+A typical example of customized controller is [kubeflow/tf-operator](https://github.com/kubeflow/tf-operator), which managed the Pods for TensorFlow on Kubernetes, required `gang-scheduling` in upstream. Here's an example of customized controller that demonstrated the usage of  `gang-scheduling` in `kube-arbitrator`.
+
+Usually, CRD ([CustomResourceDefinitions](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/)) feature is used to introduce a customized **Kind**, named `CRDJob` as example. The customized controller, named `CRDJobController`, watches it and manages lifecycle of it:
+
+1. For each `CRDJob`, `CRDJobController` creates a `CRDJob` and `PodSchedulingGroup` (One `CRDJob` with one `PodSchedulingGroup` as example). The attributes of `PodSchedulingGroup`should be set accordingly, e.g `minAvailable`; it's up to customized controller on how to manage relationship between `PodSchedulingGroup` and `CRDJobob`, e.g. `metadata.name`.
+2. When `CRDJobobController` create Pods, its `.spec.GroupName` should be set accordingly. `kube-arbitrator` follows gang-scheduling logic to schedule those pods in batch.
+3. When pods failed/deleted/unschedulable, it up to `CRDJobController` on how to manage `CRDJob`'s lifecycle. For example, if `CRDJobController` manages lifecycle itself, set `.spec.Policy` of `PodSchedulingGroup` to nil; otherwise, `PodSchedulingGroupController` will manage the lifecycle.
+4. If `CRDJob` was deleted, the `PodSchedulingGroup` must be deleted accordingly.
+
+## Feature Interaction
+
+### Multi-scheduler
+
+When multi-scheduler was enabled, there maybe decision conflict between different schedulers; and the kubelet will reject one pod (failed) if conflict. The controller will handle rejected pods based on its `LifeCyclePolicy` for failed pods. It up to cluster admin to avoid such kind of conflict, e.g. node affinity which is supported by kube-arbitrator in 1.12.
+
+### Priority/Preemption
+
+As described in `multi-scheduler` section, it's better to run batch and service workload in different zone to avoid confliction.
+Rejected a batch/run-to-complete pod may trigger the restarting of the whole `PodSchedulingGroup` which impact the performance. The solution on how to handle confliction better will be proposed in coming release.
+
+The default scheduler should also consider `PodSchedulingGroup` when preempting pods, similar to `PodDisruptionBudgets`.
+
+### Pod RestartPolicy
+
+`LifeCyclePolicy`, defined above, only represents the life cycle policy of `PodSchedulingGroup`; Pod's `RestartPolicy` still works as before. But for batch/run-to-compelete workload, it's better to set `RestartPolicy` to `Never` to avoid endless restart loop.
+
+### Admission Controller
+
+Because of AdmissionController, e.g. ResourceQuota, a few pods of a gang maybe created, but the rest fail admission and are not created. For batch or run-to-complete workload, it's better to have some pending jobs there; so it can start the pending jobs immediately when some running jobs finished, instead of waiting for client to submit new jobs; a kind of pipeline. The difference is that the pod's controller has to make sure not introduce deadlock (e.g. for every pod group, just start part of them), only the last pod group should be impacted. The solution will be proposed in coming release to handle hug group that impacted by AdmissionController.
+
+### Kubectl
+
+kubectl is enhanced to support `PodSchedulingGroup`, including its status.
+
+## References
+
+* [Gang scheduling in Kubernetes](https://docs.google.com/document/d/1AUwcvTtULNvow5M9e428FnlvINO1uQ7ojRoTGuTp4DA/edit#heading=h.ckn8nv2jj0xv)
+* [Indexed Job](https://github.com/kubernetes/kubernetes/issues/14188)
+* [Schedule a group of pods all at once](https://github.com/kubernetes/kubernetes/issues/16845)
+* [kubeflow/tf-operator: Prevent scheduling deadlocks](https://github.com/kubeflow/tf-operator/issues/165)
diff --git a/keps/NEXT_KEP_NUMBER b/keps/NEXT_KEP_NUMBER
index 3c032078a4a..d6b24041cf0 100644
--- a/keps/NEXT_KEP_NUMBER
+++ b/keps/NEXT_KEP_NUMBER
@@ -1 +1 @@
-18
+19