Skip to content

Commit

Permalink
feat(leaderelection): impl leader election
Browse files Browse the repository at this point in the history
Signed-off-by: Furkan <furkan.turkal@trendyol.com>
Signed-off-by: eminaktas <eminaktas34@gmail.com>
Co-authored-by: Emin <emin.aktas@trendyol.com>
Co-authored-by: Yasin <yasintaha.erol@trendyol.com>
  • Loading branch information
3 people authored and eminaktas committed Mar 24, 2022
1 parent 54c50c5 commit b2b7e11
Show file tree
Hide file tree
Showing 33 changed files with 1,882 additions and 79 deletions.
40 changes: 29 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,8 @@ Table of Contents
- [Node Fit filtering](#node-fit-filtering)
- [Pod Evictions](#pod-evictions)
- [Pod Disruption Budget (PDB)](#pod-disruption-budget-pdb)
- [High Availability](#high-availability)
- [Configure HA Mode](#configure-ha-mode)
- [Metrics](#metrics)
- [Compatibility Matrix](#compatibility-matrix)
- [Getting Involved and Contributing](#getting-involved-and-contributing)
Expand Down Expand Up @@ -773,6 +775,23 @@ Setting `--v=4` or greater on the Descheduler will log all reasons why any pod i
Pods subject to a Pod Disruption Budget(PDB) are not evicted if descheduling violates its PDB. The pods
are evicted by using the eviction subresource to handle PDB.

## High Availability

In High Availability mode, Descheduler starts [leader election](https://github.com/kubernetes/client-go/tree/master/tools/leaderelection) process in Kubernetes. You can activate HA mode
if you choose to deploy your application as Deployment.

Deployment starts with 1 replica by default. If you want to use more than 1 replica, you must consider
enable High Availability mode since we don't want to run descheduler pods simultaneously.

### Configure HA Mode

The leader election process can be enabled by setting `--leader-elect` in the CLI. You can also set
`--set=leaderElection.enabled=true` flag if you are using Helm.

To get best results from HA mode some additional configurations might require:
* Configure a [podAntiAffinity](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node) rule if you want to schedule onto a node only if that node is in the same zone as at least one already-running descheduler
* Set the replica count greater than 1

## Metrics

| name | type | description |
Expand All @@ -792,17 +811,16 @@ v0.18 should work with k8s v1.18, v1.17, and v1.16.
Starting with descheduler release v0.18 the minor version of descheduler matches the minor version of the k8s client
packages that it is compiled with.

Descheduler | Supported Kubernetes Version
-------------|-----------------------------
v0.22 | v1.22
v0.21 | v1.21
v0.20 | v1.20
v0.19 | v1.19
v0.18 | v1.18
v0.10 | v1.17
v0.4-v0.9 | v1.9+
v0.1-v0.3 | v1.7-v1.8

| Descheduler | Supported Kubernetes Version |
|-------------|------------------------------|
| v0.22 | v1.22 |
| v0.21 | v1.21 |
| v0.20 | v1.20 |
| v0.19 | v1.19 |
| v0.18 | v1.18 |
| v0.10 | v1.17 |
| v0.4-v0.9 | v1.9+ |
| v0.1-v0.3 | v1.7-v1.8 |

## Getting Involved and Contributing

Expand Down
4 changes: 3 additions & 1 deletion charts/descheduler/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ The command removes all the Kubernetes components associated with the chart and
The following table lists the configurable parameters of the _descheduler_ chart and their default values.

| Parameter | Description | Default |
| ------------------------------ | --------------------------------------------------------------------------------------------------------------------- | ------------------------------------ |
|--------------------------------|-----------------------------------------------------------------------------------------------------------------------|--------------------------------------|
| `kind` | Use as CronJob or Deployment | `CronJob` |
| `image.repository` | Docker repository to use | `k8s.gcr.io/descheduler/descheduler` |
| `image.tag` | Docker tag to use | `v[chart appVersion]` |
Expand All @@ -58,6 +58,8 @@ The following table lists the configurable parameters of the _descheduler_ chart
| `successfulJobsHistoryLimit` | If set, configure `successfulJobsHistoryLimit` for the _descheduler_ job | `nil` |
| `failedJobsHistoryLimit` | If set, configure `failedJobsHistoryLimit` for the _descheduler_ job | `nil` |
| `deschedulingInterval` | If using kind:Deployment, sets time between consecutive descheduler executions. | `5m` |
| `replicas` | The replica count for Deployment | `1` |
| `leaderElection` | The options for high availability when running replicated components | _see values.yaml_ |
| `cmdOptions` | The options to pass to the _descheduler_ command | _see values.yaml_ |
| `deschedulerPolicy.strategies` | The _descheduler_ strategies to apply | _see values.yaml_ |
| `priorityClassName` | The name of the priority class to add to pods | `system-cluster-critical` |
Expand Down
8 changes: 7 additions & 1 deletion charts/descheduler/templates/NOTES.txt
Original file line number Diff line number Diff line change
@@ -1 +1,7 @@
Descheduler installed as a {{ .Values.kind }} .
Descheduler installed as a {{ .Values.kind }}.

{{- if eq .Values.kind "Deployment" }}
{{- if eq .Values.replicas 1.0}}
WARNING: You set replica count as 1 and workload kind as Deployment however leaderElection is not enabled. Consider enabling Leader Election for HA mode.
{{- end}}
{{- end}}
34 changes: 34 additions & 0 deletions charts/descheduler/templates/_helpers.tpl
Original file line number Diff line number Diff line change
Expand Up @@ -65,3 +65,37 @@ Create the name of the service account to use
{{ default "default" .Values.serviceAccount.name }}
{{- end -}}
{{- end -}}

{{/*
Leader Election
*/}}
{{- define "descheduler.leaderElection"}}
{{- if .Values.leaderElection -}}
- --leader-elect
- {{ default false .Values.leaderElection.enabled }}
{{- if .Values.leaderElection.leaseDuration }}
- --leader-elect-lease-duration
- {{ .Values.leaderElection.leaseDuration }}
{{- end }}
{{- if .Values.leaderElection.renewDeadline }}
- --leader-elect-renew-deadline
- {{ .Values.leaderElection.renewDeadline }}
{{- end }}
{{- if .Values.leaderElection.retryPeriod }}
- --leader-elect-retry-period
- {{ .Values.leaderElection.retryPeriod }}
{{- end }}
{{- if .Values.leaderElection.resourceLock }}
- --leader-elect-resource-lock
- {{ .Values.leaderElection.resourceLock }}
{{- end }}
{{- if .Values.leaderElection.resourceName }}
- --leader-elect-resource-name
- {{ .Values.leaderElection.resourceName }}
{{- end }}
{{- if .Values.leaderElection.resourceNamescape }}
- --leader-elect-resource-namespace
- {{ .Values.leaderElection.resourceNamescape }}
{{- end -}}
{{- end }}
{{- end }}
9 changes: 9 additions & 0 deletions charts/descheduler/templates/clusterrole.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,15 @@ rules:
- apiGroups: ["scheduling.k8s.io"]
resources: ["priorityclasses"]
verbs: ["get", "watch", "list"]
{{- if .Values.leaderElection.enabled }}
- apiGroups: ["coordination.k8s.io"]
resources: ["leases"]
verbs: ["create"]
- apiGroups: ["coordination.k8s.io"]
resources: ["leases"]
resourceNames: ["descheduler"]
verbs: ["get", "patch", "delete"]
{{- end }}
{{- if .Values.podSecurityPolicy.create }}
- apiGroups: ['policy']
resources: ['podsecuritypolicies']
Expand Down
8 changes: 8 additions & 0 deletions charts/descheduler/templates/deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,14 @@ metadata:
labels:
{{- include "descheduler.labels" . | nindent 4 }}
spec:
{{- if gt .Values.replicas 1.0}}
{{- if not .Values.leaderElection.enabled }}
{{- fail "You must set leaderElection to use more than 1 replica"}}
{{- end}}
replicas: {{ required "leaderElection required for running more than one replica" .Values.replicas }}
{{- else }}
replicas: 1
{{- end }}
selector:
matchLabels:
{{- include "descheduler.selectorLabels" . | nindent 6 }}
Expand Down Expand Up @@ -48,6 +55,7 @@ spec:
- {{ $value | quote }}
{{- end }}
{{- end }}
{{- include "descheduler.leaderElection" . | nindent 12 }}
ports:
- containerPort: 10258
protocol: TCP
Expand Down
27 changes: 26 additions & 1 deletion charts/descheduler/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,23 @@ suspend: false
# Required when running as a Deployment
deschedulingInterval: 5m

# Specifies the replica count for Deployment
# Set leaderElection if you want to use more than 1 replica
# Set affinity.podAntiAffinity rule if you want to schedule onto a node
# only if that node is in the same zone as at least one already-running descheduler
replicas: 1

# Specifies whether Leader Election resources should be created
# Required when running as a Deployment
leaderElection: {}
# enabled: true
# leaseDuration: 15
# renewDeadline: 10
# retryPeriod: 2
# resourceLock: "leases"
# resourceName: "descheduler"
# resourceNamescape: "kube-system"

cmdOptions:
v: 3

Expand Down Expand Up @@ -86,7 +103,15 @@ affinity: {}
# values:
# - e2e-az1
# - e2e-az2

# podAntiAffinity:
# requiredDuringSchedulingIgnoredDuringExecution:
# - labelSelector:
# matchExpressions:
# - key: app.kubernetes.io/name
# operator: In
# values:
# - descheduler
# topologyKey: "kubernetes.io/hostname"
tolerations: []
# - key: 'management'
# operator: 'Equal'
Expand Down
20 changes: 17 additions & 3 deletions cmd/descheduler/app/options/options.go
Original file line number Diff line number Diff line change
Expand Up @@ -19,13 +19,15 @@ package options

import (
"github.com/spf13/pflag"

metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
apiserveroptions "k8s.io/apiserver/pkg/server/options"
clientset "k8s.io/client-go/kubernetes"

componentbaseconfig "k8s.io/component-base/config"
componentbaseoptions "k8s.io/component-base/config/options"
"sigs.k8s.io/descheduler/pkg/apis/componentconfig"
"sigs.k8s.io/descheduler/pkg/apis/componentconfig/v1alpha1"
deschedulerscheme "sigs.k8s.io/descheduler/pkg/descheduler/scheme"
"time"
)

const (
Expand Down Expand Up @@ -58,7 +60,17 @@ func NewDeschedulerServer() (*DeschedulerServer, error) {
}

func newDefaultComponentConfig() (*componentconfig.DeschedulerConfiguration, error) {
versionedCfg := v1alpha1.DeschedulerConfiguration{}
versionedCfg := v1alpha1.DeschedulerConfiguration{
LeaderElection: componentbaseconfig.LeaderElectionConfiguration{
LeaderElect: false,
LeaseDuration: metav1.Duration{Duration: 15 * time.Second},
RenewDeadline: metav1.Duration{Duration: 10 * time.Second},
RetryPeriod: metav1.Duration{Duration: 2 * time.Second},
ResourceLock: "leases",
ResourceName: "descheduler",
ResourceNamespace: "kube-system",
},
}
deschedulerscheme.Scheme.Default(&versionedCfg)
cfg := componentconfig.DeschedulerConfiguration{}
if err := deschedulerscheme.Scheme.Convert(&versionedCfg, &cfg, nil); err != nil {
Expand All @@ -76,5 +88,7 @@ func (rs *DeschedulerServer) AddFlags(fs *pflag.FlagSet) {
fs.BoolVar(&rs.DryRun, "dry-run", rs.DryRun, "execute descheduler in dry run mode.")
fs.BoolVar(&rs.DisableMetrics, "disable-metrics", rs.DisableMetrics, "Disables metrics. The metrics are by default served through https://localhost:10258/metrics. Secure address, resp. port can be changed through --bind-address, resp. --secure-port flags.")

componentbaseoptions.BindLeaderElectionFlags(&rs.LeaderElection, fs)

rs.SecureServing.AddFlags(fs)
}
Loading

0 comments on commit b2b7e11

Please sign in to comment.