Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Early abort/backoff support for Gardener nodegroups a.k.a machinedeployments #154

Closed
Tracked by #724
himanshu-kun opened this issue Sep 16, 2022 · 6 comments · Fixed by #253
Closed
Tracked by #724

Early abort/backoff support for Gardener nodegroups a.k.a machinedeployments #154

himanshu-kun opened this issue Sep 16, 2022 · 6 comments · Fixed by #253
Assignees
Labels
area/auto-scaling Auto-scaling (CA/HPA/VPA/HVPA, predominantly control plane, but also otherwise) related area/high-availability High availability related area/performance Performance (across all domains, such as control plane, networking, storage, etc.) related kind/enhancement Enhancement, improvement, extension needs/planning Needs (more) planning with other MCM maintainers priority/1 Priority (lower number equals higher priority) status/closed Issue is closed (either delivered or triaged)
Milestone

Comments

@himanshu-kun
Copy link

himanshu-kun commented Sep 16, 2022

What would you like to be added:

Usage of ErrorClass is needed in the mcm implementation of NodeGroup.Node() interface method. Currently the Error types supported are OutOfResourcesErrorClass and OtherErrorClass.

Why is this needed:
Currently our CA backoff on a nodegrp only when (max-node-provision-time=20min) have elapsed since the last scale-up request of the node-grp. But it has drawback , in cases

  • when there are multiple pods requesting constant scale-up and so 20min timer keeps on getting reset but the node doesn't join as we have capacity issues example
  • where node grps are of spot instances and we are out of resources for spot instances and there are many such spot instances grps example.

In these kind of situations , ideally autoscaler should get to know why such delay in joining of node is happening and CA should backoff quickly.
CA core provides ErrorClass for this , so that for any node (registered, unregistered) we could assign (or leave empty) ErrorClass to the node and based on that CA could quickly backoff . See this comment for details. Comment is little old, now backoff is supported for OtherErrorClass as well.

This needs changes in mcm-providers also so that we could detect OutOfQuota issues from provider side and put it somewhere is machine obj , so that CA NodeGroup.Node() could assign proper error class to it.
Also we would need to distinguish node not joining due to networking issues into separate error class like OtherErrorClass.

Live/Canary issues demanding this solutions:

Canary # 3510
Similar upstream issues -> kubernetes#3490, kubernetes#4900

@himanshu-kun himanshu-kun added the kind/enhancement Enhancement, improvement, extension label Sep 16, 2022
@gardener-robot
Copy link

@himanshu-kun You have mentioned internal references in the public. Please check.

@himanshu-kun himanshu-kun added area/performance Performance (across all domains, such as control plane, networking, storage, etc.) related kind/discussion Discussion (enaging others in deciding about multiple options) area/high-availability High availability related important-soon area/auto-scaling Auto-scaling (CA/HPA/VPA/HVPA, predominantly control plane, but also otherwise) related labels Sep 16, 2022
@sunir1
Copy link

sunir1 commented Oct 12, 2022

@himanshu-kun As discussed, in HANA-Cloud we rely on the quick fallback capability to address capacity issues with cloud providers, by having multiple WorkerGroups, where we start with the least expensive one, and only if needed, due to limited capacity fallback to more expensive ones.

In addition, a significant flux of pods is a very relevant scenario, which can happen during cluster update and/or when draining nodes, etc..

Encountering this issue may cause significant downtime for pods needing a new node scale-out.

Hence we request to raise the priority of providing a solution for this issue.

@himanshu-kun
Copy link
Author

Currently we are involved with other higher prio issues across MCM and CA.
this particular issue requires some design decisions also , so we would only be able to pick this up once we are finished with the current urgent tasks and get time to plan.
Will update on the issue once we have planned it.

@elankath
Copy link

elankath commented Mar 1, 2023

Grooming Decisions

@himanshu-kun himanshu-kun added priority/1 Priority (lower number equals higher priority) needs/planning Needs (more) planning with other MCM maintainers and removed kind/discussion Discussion (enaging others in deciding about multiple options) important-soon labels Mar 1, 2023
@himanshu-kun himanshu-kun pinned this issue May 12, 2023
@himanshu-kun
Copy link
Author

#58 -> implemented the currently available slow backoff mechanism

@himanshu-kun himanshu-kun added this to the 2023-Q3 milestone Jul 13, 2023
@himanshu-kun
Copy link
Author

Grooming discussion results:

  • It is essential to first improve the error code handling by MCM , so that on CA level , only the error code becomes relevant and no provider specific error message regular expressions have to be maintained. An issue around this is already opened Add support for error codes machine-controller-manager#590

  • The current repushing to queue for reconcile would still be kept, when a machine enters CrashLoopBackoff state on OutOfCapacity. This can increase the time in which the first-tried-node-group is scaled back down (example case is where capacity is back in next reconcile and waiting starts for VM creation) , but this won't delay CA backoff as CA just scales down the first-tried-machineDeployment , marks it for backoff and then will try the next machineDeployment... means it won't wait for the actual removal of machine object.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/auto-scaling Auto-scaling (CA/HPA/VPA/HVPA, predominantly control plane, but also otherwise) related area/high-availability High availability related area/performance Performance (across all domains, such as control plane, networking, storage, etc.) related kind/enhancement Enhancement, improvement, extension needs/planning Needs (more) planning with other MCM maintainers priority/1 Priority (lower number equals higher priority) status/closed Issue is closed (either delivered or triaged)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants