tolerationSeconds
for tolerations for controller
#588
Labels
good first issue
Denotes an issue ready for a new contributor, according to the "help wanted" guidelines.
help wanted
Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines.
lifecycle/frozen
Indicates that an issue or PR should not be auto-closed due to staleness.
Is your feature request related to a problem?/Why is this needed
At the moment
tolerations
for the controller deployment https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/aws-ebs-csi-driver/templates/controller.yaml looks like this:which effectively allows it run everywhere. Which is fine. The problem is that with this toleration kubernetes won't reschedule the controller if a node where it runs suddenly becomes unavailable.
See https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/
(highlight is mine)
So, you run it with default replica count 2, then you get unlucky and both nodes where those replicas run die. Now you end up having no ebs controller whatsoever and it won't fix without manual intervention.
/feature
Describe the solution you'd like in detail
I think the default
tolerationSeconds: 300
(configurable) should be added as well.Describe alternatives you've considered
There is no, it's how kubernetes works.
Additional context
The text was updated successfully, but these errors were encountered: