-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mitigate unavailability of instance type in AZ #92
Comments
Post grooming discussion
This is already provided by Cluster Autoscaler where it backs off from the node grp and tries the other node grp. (currently in some cases , issues have been seen with the backoff , see #154
This needs more discussion internally , and doesn't seem to be a required soln. We'll also have to look whether autoscaler supports node-grps with fall-back instance types. |
@anbrsap if you want to achieve the fallback to another instance type in the same zone, then with the current logic in-place , you can define multiple node grps in the same zone with different machine types . Then you use priority expander to define the order of fallback. |
/close as the second soln could be implemented using the already existing first solution with the help of priority expander. |
@himanshu-kun Please explain how the priority expander should solve this! To my understanding, expanders are codified strategies to decide which node group out of possible candidates to expand. But they cannot do any fallback, i.e. choose another node group in case expanding on one node group failed (e.g. due to instance type unavailability). |
Every expander type ( |
What would you like to be added:
The cluster autoscaler should mitigate the unavailability of instances of certain types in an availability zone (AZ).
Currently Gardener just reports an error in such case and retries:
I propose to add the following two fallback mechanisms applied in the given order:
Why is this needed:
The current mitigation of retrying, i.e. waiting for the cloud provider to make instances available, leads to possibly very long delays in node provisioning, with the consequence that pods remain unscheduled for that time. Depending on the type of workload this could lead to SLA violations.
Mitigating the problem by over-provisioning cluster nodes, for instance via low-priority spacer pods, is possible but increases costs. In particular for low-utilized clusters the extra costs can be significant, especially when expensive instance types like AWS metal instances are required.
The text was updated successfully, but these errors were encountered: