Skip to content

Commit

Permalink
Gang scheduling.
Browse files Browse the repository at this point in the history
Signed-off-by: Da K. Ma <klaus1982.cn@gmail.com>
  • Loading branch information
k82cn committed Jul 10, 2018
1 parent d7dee02 commit c6a1d38
Show file tree
Hide file tree
Showing 3 changed files with 174 additions and 0 deletions.
174 changes: 174 additions & 0 deletions contributors/design-proposals/scheduling/gang-scheduling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,174 @@
# Kubernetes Gang-Scheduling

**Authors**: @k82cn , **Shepherds**: @vishh, @bsalamat

**SIG/WG**: sig-scheduling, ML-WG

## Motivation

After the discussion at [Gang-scheduling](https://docs.google.com/document/d/1AUwcvTtULNvow5M9e428FnlvINO1uQ7ojRoTGuTp4DA/edit#heading=h.ckn8nv2jj0xv) proposal, we decide to make Gang-scheduling API out of core by CRD, and implement it in [kube-arbitrator](https://github.com/kubernetes-incubator/kube-arbitrator). kube-arbitrator focus on "batch" workload in kubernetes, and will share the same [scheduling frameworks](https://github.com/kubernetes/community/pull/2281) when it's ready. This document is used to provide definition of CRD for gang-scheduling, and suggestion of customized job controllers.

## Function Detail

### CRD Definition

Although workload lifecycle requirements maybe different, e.g. MPI vs. Spark, the requirements to scheduler are similar. To meet both scheduling and lifecycle requirements, two **Kinds** are going to be defined. One new **Kind**, named `SchedulingSpec`, is defined as follow for scheduler:

```go
// SchedulingSpec defines the scheduling requirement of a 'Job'
type SchedulingSpec struct {
metav1.TypeMeta
metav1.ObjectMeta

Spec SchedulingSpecTemplate
}

// SchedulingSpecTemplate represents the template of SchedulingSpec.
type SchedulingSpecTemplate struct {
// If specified, indicates the Job's priority. "system-node-critical" and
// "system-cluster-critical" are two special keywords which indicate the
// highest priorities with the former being the highest priority. Any other
// name must be defined by creating a PriorityClass object with that name.
// If not specified, the Job's priority will be default or zero if there is no
// default.
// +optional
PriorityClassName string

// The priority value. Various system components use this field to find the
// priority of the Job. When Priority Admission Controller is enabled, it
// prevents users from setting this field. The admission controller populates
// this field from PriorityClassName.
// The higher the value, the higher the priority.
// +optional
Priority *int32

// NodeSelector is a selector which must be true for the pod of 'Job' to
// fit on a node. Selector which must match a node's labels for the pod of
// 'Job' to be scheduled on that node.
// +optional
NodeSelector map[string]string

// MinAvailable defines the minimal available tasks to run the Job;
// if there's not enough resources to start all tasks, the scheduler
// will not start anyone.
MinAvailable int
}
```

The `SchedulingSpec` defines necessary parameters of Job for scheduler, e.g. priority of Job, minimal available tasks of Job. `SchedulingSpec` does not include lifecycle parameters which are managed by following new **Kind**, named `QueueJob`:

```go
// QueueJob is a Kind for batch workload.
type QueueJob struct {
metav1.TypeMeta
metav1.ObjectMeta

// Specification of the desired behavior of a QueueJob,
// including the minAvailable.
Specs QueueJobSpec

// Current status of QueueJob
Status QueueJobStatus
}

// QueueJobSpec describes how the job execution will look like and
// when it will actually run.
type QueueJobSpec struct {
// SchedulerName is the default value of `taskSpecs.template.spec.schedulerName`.
SchedulerName string

// SchedSpec specifies the parameters for scheduling.
SchedSpec SchedulingSpecTemplate

// TaskSpecs specifies the task specification of QueueJob
TaskSpecs []TaskSpec
}

// QueueJobStatus represents the current state of a QueueJob.
type QueueJobStatus struct {
// The number of actively running pods.
// +optional
Running int32

// The number of pods which reached phase Succeeded.
// +optional
Succeeded int32

// The number of pods which reached phase Failed.
// +optional
Failed int32

// The minimal available pods to run for this QueueJob
// +optional
MinAvailable int32
}

// TaskSpecs specifies the task specification of QueueJob
type TaskSpec struct {
// A label query over pods that should match the pod count.
// Normally, the system sets this field for you.
// +optional
Selector *metav1.LabelSelector

// Replicas specifies the replicas of this TaskSpec in QueueJob.
Replicas int32

// Specifies the pod that will be created for this TaskSpec
// when executing a QueueJob
Template v1.PodTemplateSpec
}
```

`QueueJob` is the default **Kind** for batch job, it includes multiple types of tasks according to the requirements from ML frameworks, e.g. Tensorflow. `QueueJob` also uses `SchedulingSpec` to describe its requirements to scheduler. The `QueueJob` is managed by `QueueJobController` in `kar-controllers`, the detail function is described in following section.

### kar-scheduler (scheduler component of kube-arbitrator)

The scheduler only watches `SchedulingSpec` and `Pod`; it'll reconstruct 'Job' by `OwerReference` of controller (`OwnerReference.Controller == true`), and the `Pod`s are consider as 'Task' of 'Job'. That requires controller set the `OwerReference` correctly. Different from other components, `kar-scheduler` does not use `LabelSelector` to reconstruct the job because of performance concern.

As `kar-scheduler` and `kube-scheduler` maybe running in parallel; `kar-scheduler` follows multi-scheduler feature to only handle the `Pod` and `SchedulingSpec` that submitted to `kar-scheduler`. The `kar-scheduler` schedule jobs as follow:

1. Reconstruct Job by `OwnerReference` of `Pod` and `SchedulingSpec`; the objects (`Pods` and `SchedulingSpec`) with same `OwerReference` are considered belonging to the same job
2. If `SchedulingSpec` is not created, the job will not be scheduled
3. In `predicate` phase, filter nodes for each job; if no `NodeSelector`, all nodes are candidates
4. In `allocate` phase, `kar-schedular` will allocate resource to job; and bind `Pod` to host until job ready (`minAvailable` <= `allocated Pods`)
5. If can not allocate enough resources to the job, the resource can not allocate to other job until next scheduling cycle

That may make resources (less than job's resource request) idle for a while, e.g. a huge job. One of solution is to backfill other smaller jobs to improve the resource utilization.

### kar-controllers (controllers component of kube-arbitrator)

Similar to `kube-controller-manager`, `kar-controller` is a collection of controllers, including `QueueJobController` and `QueueController` (in future). `QueueJobController` manages the lifecycle of `QueueJob` as following:

* **Pod/QueueJob Creations**: The `QueueJobController` will create pods up to `.spec.SchedSpec.minAvailable` ordered by priority in `spec.TaskSpecs.Template`, and waiting for `kar-scheduler` to schedule pods in batch. `QueueJobController` will handle `QueueJob` by priority: creating `.spec.SchedSpec.minAvailable` pods for QueueJob firstly, and then creating others pods (exceed `.spec.SchedSpec.minAvailable`) in round robin. When exceed `ResourceQuota`, that makes sure only the `QueueJob` with lowest priority is impact. After creating, `QueueJob`'s (`.spec.SchedSpec.minAvailable`) maybe scale up; if `Quota` exceeded, the scheduler will not schedule the `QueueJob` for the new pods until all pods are created.

* **Pod Failures/Deletion/Unschedulable**: When a pod was failed, deleted or unschedulable, `QueueJobController` manages `QueueJob`'s lifecycle according to its status:

1. `Number of Running pods` < `.spec.SchedSpec.minAvailable` : If the number of running pods less then `.spec.SchedSpec.minAvailable`, the `QueueJobController` will kill the whole QueueJob and recreate pods for rescheduling in batch.

2. `Number of Running pods` >= `.spec.SchedSpec.minAvailable`: If there are still enough running pods for the `QueueJob`, the `QueueJobController` will only recreate failed pods, and wait for scheduler to bind the pending pods.

* **QueueJob Update/Delete**: There's not rolling update for `QueueJob` ; if pod template was updated, the whole `QueueJob` is re-created. If `QueueJob` was deleted, all its pods and `SchedulingSpec` will be cleanup.


![queuejob](images/gang-scheduling-queuejob.png)


### Customized controller

A typical example of customized controller is [kubeflow/tf-operator](https://github.com/kubeflow/tf-operator), which managed the Pods for TensorFlow on Kubernetes, required `gang-scheduling` in upstream. Here's an example of customized controller that demonstrated the usage of `gang-scheduling` in `kube-arbitrator`.

Usually, CRD ([CustomResourceDefinitions](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/)) feature is used to introduce a customized **Kind**, named `CRDJob` as example. The customized controller, named `CRDJobController`, watches it and manages lifecycle of it:

1. For each `CRDJob`, `CRDJobController` creates a `CRDJob` with `SchedulingSpec` whose `OwnerReference` is the `CRDJob`. The attributes of `SchedulingSpec`should be set accordingly, e.g `minAvailable`; it's up to customized controller on how to manage relationship between `SchedulingSpec` and `CRDJobob`, e.g. `metadata.name`
2. When `CRDJobobController` create Pods, it also need to set Pod's `OwnerReference` to `CRDobJob`. `kar-scheduler` follows gang-scheduling logic to schedule those pods in batch.
3. When pods failed/deleted/unschedulable, it up to `CRDJobController` on how to manage `CRDJob`'s lifecycle. If `CRDJob` was deleted, the `SchedulingSpec` must be deleted accordingly.

![crdjob](images/gang-scheduling-crdjob.png)


## References

* [Gang scheduling in Kubernetes](https://docs.google.com/document/d/1AUwcvTtULNvow5M9e428FnlvINO1uQ7ojRoTGuTp4DA/edit#heading=h.ckn8nv2jj0xv)
* [Indexed Job](https://github.com/kubernetes/kubernetes/issues/14188)
* [Schedule a group of pods all at once](https://github.com/kubernetes/kubernetes/issues/16845)
* [kubeflow/tf-operator: Prevent scheduling deadlocks](https://github.com/kubeflow/tf-operator/issues/165)
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit c6a1d38

Please sign in to comment.