Early abort/backoff support for Gardener nodegroups a.k.a machinedeployments #154

himanshu-kun · 2022-09-16T06:57:09Z

What would you like to be added:

Usage of ErrorClass is needed in the mcm implementation of NodeGroup.Node() interface method. Currently the Error types supported are OutOfResourcesErrorClass and OtherErrorClass.

Why is this needed:
Currently our CA backoff on a nodegrp only when (max-node-provision-time=20min) have elapsed since the last scale-up request of the node-grp. But it has drawback , in cases

when there are multiple pods requesting constant scale-up and so 20min timer keeps on getting reset but the node doesn't join as we have capacity issues example
where node grps are of spot instances and we are out of resources for spot instances and there are many such spot instances grps example.

In these kind of situations , ideally autoscaler should get to know why such delay in joining of node is happening and CA should backoff quickly.
CA core provides ErrorClass for this , so that for any node (registered, unregistered) we could assign (or leave empty) ErrorClass to the node and based on that CA could quickly backoff . See this comment for details. Comment is little old, now backoff is supported for OtherErrorClass as well.

This needs changes in mcm-providers also so that we could detect OutOfQuota issues from provider side and put it somewhere is machine obj , so that CA NodeGroup.Node() could assign proper error class to it.
Also we would need to distinguish node not joining due to networking issues into separate error class like OtherErrorClass.

Live/Canary issues demanding this solutions:

Canary # 3510
Similar upstream issues -> kubernetes#3490, kubernetes#4900

The text was updated successfully, but these errors were encountered:

gardener-robot · 2022-09-16T06:57:13Z

@himanshu-kun You have mentioned internal references in the public. Please check.

sunir1 · 2022-10-12T18:07:06Z

@himanshu-kun As discussed, in HANA-Cloud we rely on the quick fallback capability to address capacity issues with cloud providers, by having multiple WorkerGroups, where we start with the least expensive one, and only if needed, due to limited capacity fallback to more expensive ones.

In addition, a significant flux of pods is a very relevant scenario, which can happen during cluster update and/or when draining nodes, etc..

Encountering this issue may cause significant downtime for pods needing a new node scale-out.

Hence we request to raise the priority of providing a solution for this issue.

himanshu-kun · 2023-01-27T11:02:16Z

Currently we are involved with other higher prio issues across MCM and CA.
this particular issue requires some design decisions also , so we would only be able to pick this up once we are finished with the current urgent tasks and get time to plan.
Will update on the issue once we have planned it.

elankath · 2023-03-01T06:30:32Z

Grooming Decisions

We should ensure that the MCM provider (mcm_cloud_provider.go) is in line with other provider implementations and implement CloudProvider.Nodes() ([]Instance, error) correctly with the cloudprovider.InstanceStatus struct fully populated with State with ErrorInfo inner structs populated. See https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/aws_cloud_provider.go#L343
We might need additional information in the MachineStatus to assist with this. To be elaborated in design discussion.

himanshu-kun · 2023-07-12T09:15:52Z

#58 -> implemented the currently available slow backoff mechanism

himanshu-kun · 2023-07-13T04:52:47Z

Grooming discussion results:

It is essential to first improve the error code handling by MCM , so that on CA level , only the error code becomes relevant and no provider specific error message regular expressions have to be maintained. An issue around this is already opened Add support for error codes machine-controller-manager#590
The current repushing to queue for reconcile would still be kept, when a machine enters CrashLoopBackoff state on OutOfCapacity. This can increase the time in which the first-tried-node-group is scaled back down (example case is where capacity is back in next reconcile and waiting starts for VM creation) , but this won't delay CA backoff as CA just scales down the first-tried-machineDeployment , marks it for backoff and then will try the next machineDeployment... means it won't wait for the actual removal of machine object.

himanshu-kun added the kind/enhancement Enhancement, improvement, extension label Sep 16, 2022

himanshu-kun mentioned this issue Sep 16, 2022

Refactor MCM to leverage controller-runtime gardener/machine-controller-manager#724

Open

49 tasks

himanshu-kun mentioned this issue Feb 28, 2023

Mitigate unavailability of instance type in AZ #92

Closed

himanshu-kun added priority/1 Priority (lower number equals higher priority) needs/planning Needs (more) planning with other MCM maintainers and removed kind/discussion Discussion (enaging others in deciding about multiple options) important-soon labels Mar 1, 2023

unmarshall mentioned this issue Mar 16, 2023

[Feature] Support multiple machine types for a single worker pool gardener/machine-controller-manager#801

Open

himanshu-kun pinned this issue May 12, 2023

himanshu-kun assigned rishabh-11 and himanshu-kun Jun 2, 2023

himanshu-kun added this to the 2023-Q3 milestone Jul 13, 2023

himanshu-kun mentioned this issue Jul 13, 2023

Add support for error codes gardener/machine-controller-manager#590

Closed

gardener deleted a comment from gardener-robot Jul 19, 2023

himanshu-kun mentioned this issue Aug 24, 2023

Allow all worker groups with minimum 0 gardener/gardener#7857

Closed

himanshu-kun mentioned this issue Aug 25, 2023

ResourceExhausted error code mapping introduced for GCP gardener/machine-controller-manager-provider-gcp#92

Merged

rishabh-11 mentioned this issue Sep 6, 2023

ResourceExhausted error code mapping introduced for AWS gardener/machine-controller-manager-provider-aws#129

Merged

himanshu-kun mentioned this issue Sep 13, 2023

ResourceExhausted error code mapping introduced for Alicloud gardener/machine-controller-manager-provider-alicloud#57

Merged

This was referenced Sep 15, 2023

Add errorCode field in LastOperation struct gardener/machine-controller-manager#851

Merged

Re-introduce ResourceExhausted error code for openstack gardener/machine-controller-manager-provider-openstack#97

Merged

This was referenced Sep 25, 2023

CA-MCM overhaul #251

Open

Activate Early backoff functionality #253

Merged

himanshu-kun mentioned this issue Oct 3, 2023

Update CA image for k8s>=1.27 gardener/gardener#8586

Closed

rishabh-11 closed this as completed in #253 Oct 3, 2023

gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Oct 3, 2023

rishabh-11 unpinned this issue Jul 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Early abort/backoff support for Gardener nodegroups a.k.a machinedeployments #154

Early abort/backoff support for Gardener nodegroups a.k.a machinedeployments #154

himanshu-kun commented Sep 16, 2022 •

edited

Loading

gardener-robot commented Sep 16, 2022

sunir1 commented Oct 12, 2022

himanshu-kun commented Jan 27, 2023

elankath commented Mar 1, 2023 •

edited

Loading

himanshu-kun commented Jul 12, 2023

himanshu-kun commented Jul 13, 2023

Early abort/backoff support for Gardener nodegroups a.k.a machinedeployments #154

Early abort/backoff support for Gardener nodegroups a.k.a machinedeployments #154

Comments

himanshu-kun commented Sep 16, 2022 • edited Loading

gardener-robot commented Sep 16, 2022

sunir1 commented Oct 12, 2022

himanshu-kun commented Jan 27, 2023

elankath commented Mar 1, 2023 • edited Loading

Grooming Decisions

himanshu-kun commented Jul 12, 2023

himanshu-kun commented Jul 13, 2023

himanshu-kun commented Sep 16, 2022 •

edited

Loading

elankath commented Mar 1, 2023 •

edited

Loading