Mitigate unavailability of instance type in AZ #92

anbrsap · 2021-08-03T11:02:25Z

What would you like to be added:

The cluster autoscaler should mitigate the unavailability of instances of certain types in an availability zone (AZ).

Currently Gardener just reports an error in such case and retries:

Action required

There is a problem with your secret secret1: The underlying infrastructure provider proclaimed that it does not have enough resources to fulfill your request at this point in time. You might want to wait or change your shoot configuration.

Worker extension (shoot--group1--cluster1/cluster1) reports failing health check: machine "shoot--group1--cluster1-worker1-z1-11111-11111" failed: Cloud provider message - machine codes error: code = [Internal] message = [InsufficientInstanceCapacity: We currently do not have sufficient m5zn.metal capacity in the Availability Zone you requested (eu-central-1a). Our system will be working on provisioning additional capacity. You can currently get m5zn.metal capacity by not specifying an Availability Zone in your request or choosing eu-central-1b.
status code: 500, request id: edb75363-7217-4f15-912b-5b11bcd89a85].

I propose to add the following two fallback mechanisms applied in the given order:

Try to create a new instance in another AZ if the respective worker group has been configured for more than one AZ.
Try to get an instance of a fallback instance type (which may be more expensive). Fallback instance types should be configurable for worker groups.

Why is this needed:

The current mitigation of retrying, i.e. waiting for the cloud provider to make instances available, leads to possibly very long delays in node provisioning, with the consequence that pods remain unscheduled for that time. Depending on the type of workload this could lead to SLA violations.

Mitigating the problem by over-provisioning cluster nodes, for instance via low-priority spacer pods, is possible but increases costs. In particular for low-utilized clusters the extra costs can be significant, especially when expensive instance types like AWS metal instances are required.

The text was updated successfully, but these errors were encountered:

himanshu-kun · 2023-02-28T06:50:04Z

Post grooming discussion

Try to create a new instance in another AZ if the respective worker group has been configured for more than one AZ.

This is already provided by Cluster Autoscaler where it backs off from the node grp and tries the other node grp. (currently in some cases , issues have been seen with the backoff , see #154

Try to get an instance of a fallback instance type (which may be more expensive). Fallback instance types should be configurable for worker groups.

This needs more discussion internally , and doesn't seem to be a required soln. We'll also have to look whether autoscaler supports node-grps with fall-back instance types.

himanshu-kun · 2023-02-28T06:59:10Z

@anbrsap if you want to achieve the fallback to another instance type in the same zone, then with the current logic in-place , you can define multiple node grps in the same zone with different machine types . Then you use priority expander to define the order of fallback.

himanshu-kun · 2023-02-28T06:59:54Z

/close as the second soln could be implemented using the already existing first solution with the help of priority expander.

anbrsap · 2023-02-28T16:29:37Z

@anbrsap if you want to achieve the fallback to another instance type in the same zone, then with the current logic in-place , you can define multiple node grps in the same zone with different machine types . Then you use priority expander to define the order of fallback.

@himanshu-kun Please explain how the priority expander should solve this! To my understanding, expanders are codified strategies to decide which node group out of possible candidates to expand. But they cannot do any fallback, i.e. choose another node group in case expanding on one node group failed (e.g. due to instance type unavailability).

himanshu-kun · 2023-03-01T06:35:25Z

Every expander type (priority , least-waste , maxPods and others) is designed to ignore the backed off node group and then decide the next one from other node group.

anbrsap added the kind/enhancement Enhancement, improvement, extension label Aug 3, 2021

gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Jan 31, 2022

gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Jul 30, 2022

gardener-robot closed this as completed Feb 28, 2023

gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Feb 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mitigate unavailability of instance type in AZ #92

Mitigate unavailability of instance type in AZ #92

anbrsap commented Aug 3, 2021

himanshu-kun commented Feb 28, 2023

himanshu-kun commented Feb 28, 2023

himanshu-kun commented Feb 28, 2023

anbrsap commented Feb 28, 2023

himanshu-kun commented Mar 1, 2023

Mitigate unavailability of instance type in AZ #92

Mitigate unavailability of instance type in AZ #92

Comments

anbrsap commented Aug 3, 2021

himanshu-kun commented Feb 28, 2023

Post grooming discussion

himanshu-kun commented Feb 28, 2023

himanshu-kun commented Feb 28, 2023

anbrsap commented Feb 28, 2023

himanshu-kun commented Mar 1, 2023