Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ebs-csi-controller blocks updating EKS managed node group #758

Closed
imbohyun1 opened this issue Feb 22, 2021 · 2 comments · Fixed by #856
Closed

ebs-csi-controller blocks updating EKS managed node group #758

imbohyun1 opened this issue Feb 22, 2021 · 2 comments · Fixed by #856
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@imbohyun1
Copy link

imbohyun1 commented Feb 22, 2021

/kind bug

What happened?
While updating managed node group by changing the version of launch template, I observed that if one of nodes from older node group has ebs-csi-controller, then it failed to evict pods with PodEvictionFailure after it hit the max retries to evict ebs-csi-controller pod.

"Errors": [
  {
    "ErrorCode": "PodEvictionFailure",
    "ErrorMessage": "Reached max retries while trying to evict pods from nodes in node group ng-upgrade",
    "ResourceIds": [
      "ip-192-168-48-49.ap-northeast-2.compute.internal"
    ]
  }
]

Looking into the issue, it appeared that kube-scheduler kept rescheduling the ebs-csi-controller pod to the node right after the pod was evicted. While looking at the ebs-csi-controller's manifest (ebs-csi-controller deployment: https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/deploy/kubernetes/base/controller.yaml), I could see the tolerations setting for the ebs-csi-controller is not preventing from being rescheduled even though the node has been tainted as eks.amazonaws.com/nodegroup=unschedulable:NoSchedule.

  tolerations:
    - operator: Exists

According to Kubernetes document, "An empty key with operator Exists matches all keys, values and effects which means this will tolerate everything."

What you expected to happen?
I expected that ebs-csi-controller was going to be evicted like other pods to end up with updating managed node group successfully.
To resolve this, I had to remove the tolerations from the ebs-csi-controller or modify tolerations to be applied only to specific effects as below.

  tolerations:
    - operator: Exists
      effect: NoExecute
      tolerationSeconds: 300

How to reproduce it (as minimally and precisely as possible)?

  1. Create a node group with launch template.
  2. Deploy EBS CSI Driver on the cluster by following guidance of official document.
  3. Update managed node group with new version of launch template
  4. The node running ebs-csi-controller will not be drained. and It will fail with PodEvictionFailure.

Anything else we need to know?:

Environment

  • Kubernetes version (use kubectl version): v1.18
  • Driver version: v0.9.0
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Feb 22, 2021
@andrewgeller
Copy link

andrewgeller commented Feb 26, 2021

Experienced the same and this seems to be done on purpose. "tolerateAllTaints" was added to the helm chart with a default value of true. There is some discussion in 594.

I am installing with Kustomize, pulling in github.com/kubernetes-sigs/aws-ebs-csi-driver/deploy/kubernetes/overlays/stable/?ref=release-0.9, so I will patch it after the fact, but will follow this issue to understand why the default of tolerating everything is the right behavior.

@wongma7
Copy link
Contributor

wongma7 commented Feb 26, 2021

Let's tone it down to:

  tolerations:
  - key: CriticalAddonsOnly
    operator: Exists
  - effect: NoExecute
    operator: Exists

Thoughts?

These are the tolerations I see from a default kops install for kube control plane components and they seem reasonable to me. The csi controller is basically taking/replacing some of kube-controller-manager's responsibility so it makes sense for it to have the same uptime guarantees. Clearly tolerating all taints is excessive.

  • CriticalAddonsOnly:
    since csi controller is as critical as kcm, this makes sense to me. it is doing cluster-wide volume operations

  • NoExecute:
    This one I am not too sure about. You could argue that , when draining nodes, it's important the controller stay up to detach volumes instead of getting instantly evicted. We can set tolerationSeconds here so at least the controller pod will eventually be evicted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants