-
Notifications
You must be signed in to change notification settings - Fork 808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ebs-csi-controller blocks updating EKS managed node group #758
Comments
Experienced the same and this seems to be done on purpose. "tolerateAllTaints" was added to the helm chart with a default value of true. There is some discussion in 594. I am installing with Kustomize, pulling in github.com/kubernetes-sigs/aws-ebs-csi-driver/deploy/kubernetes/overlays/stable/?ref=release-0.9, so I will patch it after the fact, but will follow this issue to understand why the default of tolerating everything is the right behavior. |
Let's tone it down to:
Thoughts? These are the tolerations I see from a default kops install for kube control plane components and they seem reasonable to me. The csi controller is basically taking/replacing some of kube-controller-manager's responsibility so it makes sense for it to have the same uptime guarantees. Clearly tolerating all taints is excessive.
|
/kind bug
What happened?
While updating managed node group by changing the version of launch template, I observed that if one of nodes from older node group has
ebs-csi-controller
, then it failed to evict pods withPodEvictionFailure
after it hit the max retries to evictebs-csi-controller
pod.Looking into the issue, it appeared that kube-scheduler kept rescheduling the ebs-csi-controller pod to the node right after the pod was evicted. While looking at the ebs-csi-controller's manifest (ebs-csi-controller deployment: https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/deploy/kubernetes/base/controller.yaml), I could see the tolerations setting for the ebs-csi-controller is not preventing from being rescheduled even though the node has been tainted as
eks.amazonaws.com/nodegroup=unschedulable:NoSchedule
.According to Kubernetes document, "An empty key with operator Exists matches all keys, values and effects which means this will tolerate everything."
What you expected to happen?
I expected that ebs-csi-controller was going to be evicted like other pods to end up with updating managed node group successfully.
To resolve this, I had to
remove the tolerations
from the ebs-csi-controller or modify tolerations to be applied only to specific effects as below.How to reproduce it (as minimally and precisely as possible)?
PodEvictionFailure
.Anything else we need to know?:
Environment
kubectl version
): v1.18The text was updated successfully, but these errors were encountered: