Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation Refresh #244

Merged
merged 5 commits into from
Mar 4, 2020
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 33 additions & 28 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,9 @@ Scheduling in Kubernetes is the process of binding pending pods to nodes, and is
a component of Kubernetes called kube-scheduler. The scheduler's decisions, whether or where a
pod can or can not be scheduled, are guided by its configurable policy which comprises of set of
rules, called predicates and priorities. The scheduler's decisions are influenced by its view of
a Kubernetes cluster at that point of time when a new pod appears first time for scheduling.
As Kubernetes clusters are very dynamic and their state change over time, there may be desired
to move already running pods to some other nodes for various reasons:
a Kubernetes cluster at that point of time when a new pod appears for scheduling.
As Kubernetes clusters are very dynamic and their state changes over time, there may be desire
to move already running pods to some other nodes for various reasons.
seanmalloy marked this conversation as resolved.
Show resolved Hide resolved

* Some nodes are under or over utilized.
* The original scheduling decision does not hold true any more, as taints or labels are added to
Expand Down Expand Up @@ -54,7 +54,7 @@ being able to be run multiple times without needing user intervention.
The descheduler pod is run as a critical pod to avoid being evicted by itself,
or by the kubelet due to an eviction event. Since critical pods are created in the
`kube-system` namespace, the descheduler job and its pod will also be created
in `kube-system` namespace.
in the `kube-system` namespace.
seanmalloy marked this conversation as resolved.
Show resolved Hide resolved

### Setup RBAC

Expand Down Expand Up @@ -84,20 +84,21 @@ $ kubectl create -f kubernetes/cronjob.yaml

## Policy and Strategies

Descheduler's policy is configurable and includes strategies to be enabled or disabled.
Five strategies, `RemoveDuplicates`, `LowNodeUtilization`, `RemovePodsViolatingInterPodAntiAffinity`, `RemovePodsViolatingNodeAffinity` , `RemovePodsViolatingNodeTaints` are currently implemented.
Descheduler's policy is configurable and includes strategies that can be enabled or disabled.
Five strategies `RemoveDuplicates`, `LowNodeUtilization`, `RemovePodsViolatingInterPodAntiAffinity`,
`RemovePodsViolatingNodeAffinity`, and `RemovePodsViolatingNodeTaints` are currently implemented.
As part of the policy, the parameters associated with the strategies can be configured too.
By default, all strategies are enabled.

### RemoveDuplicates

This strategy makes sure that there is only one pod associated with a Replica Set (RS),
Replication Controller (RC), Deployment, or Job running on same node. If there are more,
Replication Controller (RC), Deployment, or Job running on the same node. If there are more,
those duplicate pods are evicted for better spreading of pods in a cluster. This issue could happen
if some nodes went down due to whatever reasons, and pods on them were moved to other nodes leading to
more than one pod associated with RS or RC, for example, running on same node. Once the failed nodes
more than one pod associated with a RS or RC, for example, running on the same node. Once the failed nodes
are ready again, this strategy could be enabled to evict those duplicate pods. Currently, there are no
parameters associated with this strategy. To disable this strategy, the policy should look like:
parameters associated with this strategy. To disable this strategy, the policy should look like this.
seanmalloy marked this conversation as resolved.
Show resolved Hide resolved

```
apiVersion: "descheduler/v1alpha1"
Expand All @@ -116,15 +117,15 @@ parameters of this strategy are configured under `nodeResourceUtilizationThresho
The under utilization of nodes is determined by a configurable threshold, `thresholds`. The threshold
seanmalloy marked this conversation as resolved.
Show resolved Hide resolved
`thresholds` can be configured for cpu, memory, and number of pods in terms of percentage. If a node's
usage is below threshold for all (cpu, memory, and number of pods), the node is considered underutilized.
Currently, pods' request resource requirements are considered for computing node resource utilization.
Currently, pods request resource requirements are considered for computing node resource utilization.

There is another configurable threshold, `targetThresholds`, that is used to compute those potential nodes
from where pods could be evicted. Any node, between the thresholds, `thresholds` and `targetThresholds` is
considered appropriately utilized and is not considered for eviction. The threshold, `targetThresholds`,
can be configured for cpu, memory, and number of pods too in terms of percentage.

These thresholds, `thresholds` and `targetThresholds`, could be tuned as per your cluster requirements.
An example of the policy for this strategy would look like:
Here is an example of a policy for this strategy.
seanmalloy marked this conversation as resolved.
Show resolved Hide resolved

```
apiVersion: "descheduler/v1alpha1"
Expand All @@ -144,14 +145,14 @@ strategies:
"pods": 50
```

There is another parameter associated with `LowNodeUtilization` strategy, called `numberOfNodes`.
This parameter can be configured to activate the strategy only when number of under utilized nodes
There is another parameter associated with the `LowNodeUtilization` strategy, called `numberOfNodes`.
This parameter can be configured to activate the strategy only when the number of under utilized nodes
are above the configured value. This could be helpful in large clusters where a few nodes could go
under utilized frequently or for a short period of time. By default, `numberOfNodes` is set to zero.

### RemovePodsViolatingInterPodAntiAffinity

This strategy makes sure that pods violating interpod anti-affinity are removed from nodes. For example, if there is podA on node and podB and podC(running on same node) have antiaffinity rules which prohibit them to run on the same node, then podA will be evicted from the node so that podB and podC could run. This issue could happen, when the anti-affinity rules for pods B,C are created when they are already running on node. Currently, there are no parameters associated with this strategy. To disable this strategy, the policy should look like:
This strategy makes sure that pods violating interpod anti-affinity are removed from nodes. For example, if there is podA on a node and podB and podC(running on the same node) have antiaffinity rules which prohibit them to run on the same node, then podA will be evicted from the node so that podB and podC could run. This issue could happen, when the anti-affinity rules for pods B,C are created when they are already running on node. Currently, there are no parameters associated with this strategy. To disable this strategy, the policy should look like this.
seanmalloy marked this conversation as resolved.
Show resolved Hide resolved

```
apiVersion: "descheduler/v1alpha1"
Expand All @@ -163,7 +164,7 @@ strategies:

### RemovePodsViolatingNodeAffinity

This strategy makes sure that pods violating node affinity are removed from nodes. For example, there is podA that was scheduled on nodeA which satisfied the node affinity rule `requiredDuringSchedulingIgnoredDuringExecution` at the time of scheduling, but over time nodeA no longer satisfies the rule, then if another node nodeB is available that satisfies the node affinity rule, then podA will be evicted from nodeA. The policy file should like this -
This strategy makes sure that pods violating node affinity are removed from nodes. For example, there is podA that was scheduled on nodeA which satisfied the node affinity rule `requiredDuringSchedulingIgnoredDuringExecution` at the time of scheduling, but over time nodeA no longer satisfies the rule, then if another node nodeB is available that satisfies the node affinity rule, then podA will be evicted from nodeA. The policy file should like this.
seanmalloy marked this conversation as resolved.
Show resolved Hide resolved

```
apiVersion: "descheduler/v1alpha1"
Expand All @@ -175,9 +176,10 @@ strategies:
nodeAffinityType:
- "requiredDuringSchedulingIgnoredDuringExecution"
```

### RemovePodsViolatingNodeTaints

This strategy makes sure that pods violating NoSchedule taints on nodes are removed. For example: there is a pod "podA" with toleration to tolerate a taint ``key=value:NoSchedule`` scheduled and running on the tainted node. If the node's taint is subsequently updated/removed, taint is no longer satisfied by its pods' tolerations and will be evicted. The policy file should look like:
This strategy makes sure that pods violating NoSchedule taints on nodes are removed. For example there is a pod "podA" with a toleration to tolerate a taint ``key=value:NoSchedule`` scheduled and running on the tainted node. If the node's taint is subsequently updated/removed, taint is no longer satisfied by its pods' tolerations and will be evicted. The policy file should look like this.
seanmalloy marked this conversation as resolved.
Show resolved Hide resolved

````
apiVersion: "descheduler/v1alpha1"
Expand All @@ -186,23 +188,25 @@ strategies:
"RemovePodsViolatingNodeTaints":
enabled: true
````

## Pod Evictions

When the descheduler decides to evict pods from a node, it employs following general mechanism:
When the descheduler decides to evict pods from a node, it employs the following general mechanism.
seanmalloy marked this conversation as resolved.
Show resolved Hide resolved

* [Critical pods](https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/) (with priorityClassName set to system-cluster-critical or system-node-critical) are never evicted.
* Pods (static or mirrored pods or stand alone pods) not part of an RC, RS, Deployment or Jobs are
* Pods (static or mirrored pods or stand alone pods) not part of an RC, RS, Deployment or Job are
never evicted because these pods won't be recreated.
* Pods associated with DaemonSets are never evicted.
* Pods with local storage are never evicted.
* Best efforts pods are evicted before Burstable and Guaranteed pods.
* All types of pods with annotation descheduler.alpha.kubernetes.io/evict are evicted. This
annotation is used to override checks which prevent eviction and user can select which pod is evicted.
User should know how and if the pod will be recreated.
* Best efforts pods are evicted before burstable and guaranteed pods.
* All types of pods with the annotation descheduler.alpha.kubernetes.io/evict are evicted. This
annotation is used to override checks which prevent eviction and users can select which pod is evicted.
seanmalloy marked this conversation as resolved.
Show resolved Hide resolved
Users should know how and if the pod will be recreated.

### Pod Disruption Budget (PDB)

### Pod disruption Budget (PDB)
Pods subject to Pod Disruption Budget (PDB) are not evicted if descheduling violates its pod
disruption budget (PDB). The pods are evicted by using eviction subresource to handle PDB.
Pods subject to a Pod Disruption Budget(PDB) are not evicted if descheduling violates its PDB. The pods
are evicted by using the eviction subresource to handle PDB.

## Roadmap

Expand All @@ -216,12 +220,13 @@ This roadmap is not in any particular order.
* Consideration of Kubernetes's scheduler's predicates


## Compatibility matrix
## Compatibility Matrix

Descheduler | supported Kubernetes version
-------------|-----------------------------
0.4+ | 1.9+
0.1-0.3 | 1.7-1.8
v0.10 | v1.17
v0.4-v0.9 | v1.9+
v0.1-v0.3 | v1.7-v1.8

## Community, discussion, contribution, and support

Expand Down