Reschedules CrashLoopBackOff
Pod
to fix permanent crashes caused by stale init-container/sidecar/configmap
- Listens to Pod update events and does a Pod list
- Looks for containers in CrashLoopBackOff with
restartCount
> 5 (failureThreshold
config) - Ignores Pods with annotation
kube-remediator/CrashLoopBackOffRemediator: "false"
- Can work in a single namespace, default is all namespaces
""
(namespace
config) - Ignores Pods without
ownerReferences
(Avoid deleting something which does not come back)
Deletes Pods
with label kube-remediator/OldPodDeleter=true
older than 24h
Reschedules Failed
Pods
by deleting them, since they are not automatically cleaned up.
- Listens to Pod update events and does a Pod list
- Finds pods in Failed status with reason
OutOfCpu
,OutofMemory
. - Ignores Pods without
ownerReferences
(Avoid deleting something which does not come back) - Ignores Pods for Jobs because they can be automatically cleaned up.
- Deletes the pods in failed status after 5 mins to have time to debug
Deletes Pods
that in Completed
status for more than 24h.
Deletes PersistentVolumeClaim
left behind by deleted StatefulSet
, that are not automatically cleaned up otherwise
- Waits for 7 days(configurable) before deleting
- Ignores if
PersistentVolume
haspersistentVolumeReclaimPolicy
set toRetain
You can define a remediator policy to control the following options:
disabled_remediators
: All remediators are enabled by default unless listed in this option. example:This can also be set via an environment variable:{ "disabled_remediators": ["FailedPodRescheduler"] }
DISABLED_REMEDIATORS=OldPodDeleter,FailedPodRescheduler
kubectl apply -f kubernetes/rbac.yaml
kubectl apply -f kubernetes/app-server.yml
Configuration options:
- Deploy provided image to use defaults under
config/*
- Make a new image
FROM
the provided image and add/removeconfig/*
- Overwrite
config/*
with a mountedConfigMap
Run in local kubernetes with docker-for-mac
rake server
Run against local kubernetes cluster with go:
unset GOPATH
go mod vendor # install into local directory instead of global path
make dev # run on cluster from $KUBECONFIG (defaults to ~/.kube/config)
- Run unit tests:
make test
- Run a single suite:
go test -run TestSuiteFailedPodRescheduler github.com/aksgithub/kube_remediator/pkg/remediator
- Run a single test: comment out all other test in the suite and run the suite. TODO: improve.
# CrashLoopBackOffRemediator: pod is rescheduled after restarting 5 times ?
kubectl apply -f examples/crashloop_pod.yml
# OldPodDeleter: pod is deleted when it gets 24h old ? (best change the 24h in the code to 1min)
kubectl apply -f examples/old_pod.yml
Note: failed expectation in one test can lead to other tests failing. Only run one test when debugging.