Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Queue design doc. #95

Merged
merged 1 commit into from
Apr 27, 2019
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
113 changes: 113 additions & 0 deletions docs/design/queue.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
# Queue

[@k82cn](http://github.com/k82cn); April 17, 2019

## Motivation

`Queue` was introduced in [kube-batch](http://github.com/kubernetes-sigs/kube-batch) long time ago as an internal feature, which makes all jobs are submitted to the same queue, named `default`. As more and more users would like to share resources with each other by queue, this proposal is going to cover primary features of queue achieve that.

## Function Specification

The queue is cluster level, so the user from different namespaces can share resource within a `Queue`. The following section defines the api of queue.

### API

```go
type Queue struct {
metav1.TypeMeta `json:",inline"`

metav1.ObjectMeta `json:"metadata,omitempty" protobuf:"bytes,1,opt,name=metadata"`

// Specification of the desired behavior of a queue
// +optional
Spec QueueSpec `json:"spec,omitempty" protobuf:"bytes,2,opt,name=spec"`

// Current status of Queue
// +optional
Status QueueStatus `json:"status,omitempty" protobuf:"bytes,3,opt,name=status"`
}

type QueueSpec struct {
// The weight of queue to share the resources with each other.
Weight int32 `json:"weight,omitempty" protobuf:"bytes,1,opt,name=weight"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the meaning of the value? Should have a clear desc.

}

type QueueStatus struct {
// The number of job in Unknown status
Unknown int32 `json:"running,omitempty" protobuf:"bytes,1,opt,name=running"`
// The number of job in Running status
Running int32 `json:"running,omitempty" protobuf:"bytes,2,opt,name=running"`
// The number of job in Pending status
Pending int32 `json:"pending,omitempty" protobuf:"bytes,3,opt,name=pending"`
// The number of job in Completed status
Completed int32 `json:"completed,omitempty" protobuf:"bytes,4,opt,name=completed"`
// The number of job in Failed status
Failed int32 `json:"failed,omitempty" protobuf:"bytes,5,opt,name=failed"`
// The number of job in Aborted status
Aborted int32 `json:"aborted,omitempty" protobuf:"bytes,6,opt,name=aborted"`
k82cn marked this conversation as resolved.
Show resolved Hide resolved
k82cn marked this conversation as resolved.
Show resolved Hide resolved
}
```

### QueueController

The `QueueController` will manage the lifecycle of queue:

1. Watching `PodGroup`/`Job` for status
2. If `Queue` was deleted, also delete all related `PodGroup`/`Job` in the queue

### Admission Controller

The admission controller will check `PodGroup`/`Job` 's queue when creation:

1. if the queue does not exist, the creation will be rejected
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we inject the default queue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe reject it in the first version; we can refer to how prioirtyCass for such kind of default value.

2. if the queue is releasing, the creation will be also rejected
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about the default weight in admission hook


### Feature Interaction

#### Customized Job/PodGroup

If the `PodGroup` is created by customized controller, the `QueueController` will count those `PodGroup` into `Unknown` status; because `PodGroup` focus on scheduling specification which did not include customized job's status.

#### cli

Command line is also enhanced for operator engineers. Three sub-commands are introduced as follow:

__create__:

`create` command is used to create a queue with weight; for example, the following command will create a queue named `myqueue` with weight 10.

```shell
$ vkctl queue create --name myqueue --weight 10
```

__view__:

`view` command is used to show the detail of a queue, e.g. creation time; the following command will show the detail of queue `myqueue`

```shell
$ vkctl queue view myqueue
```

__list__:

`list` command is used to show all available queues to current user

```shell
$ vkctl queue list
Name Weight Total Pending Running ...
myqueue 10 10 5 5
```

#### Scheduler

* Proportion plugin:

Proportion plugin is used to share resource between `Queue`s by weight. The deserved resource of a queue is `(weight/total-weight) * total-resource`. When allocating resources, it will not allocate resource more than its deserved resources.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, it seems total-weight will sum up all the weight of the queues, as the queue increases, the (weight/total-weight) will change. How do scheduler adjust the proportion of resources an old queue occupies ? Let's say the queue resources are all used up, will this trigger a preemt?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another case, cluster resources may also change

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kube-batch will continue executing those actions every X period.


* Reclaim action:

`reclaim` action will go through all queues to reclaim others by `ReclaimableFn`'s return value; the time complexity is `O(n^2)`. In `ReclaimableFn`, both `proportion` and `gang` will take effect: 1. `proportion` makes sure the queue will not be under-used after reclaim, 2. `gang` makes sure the job will not be reclaimed if its `minAvailable` > 1.

* Backfill action:

When `allocate` action assign resources to each queue, there's a case that ([kube-batch#492](<https://github.com/kubernetes-sigs/kube-batch/issues/492>)) the resources maybe unnecessary idle because of `proportion` plugin: there are one pending job in two queue each, and the deserved resources of each queue can not meet the requirement of their jobs. In such case, `backfill` action will ignore deserved guarantee of queue to fill idle resources as much as possible. This introduces another potential case that the coming smaller job is blocked; this case will be handle by reserved resources of each queue in other project.