-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Early abort/backoff support for Gardener nodegroups a.k.a machinedeployments #154
Comments
@himanshu-kun You have mentioned internal references in the public. Please check. |
@himanshu-kun As discussed, in HANA-Cloud we rely on the quick fallback capability to address capacity issues with cloud providers, by having multiple WorkerGroups, where we start with the least expensive one, and only if needed, due to limited capacity fallback to more expensive ones. In addition, a significant flux of pods is a very relevant scenario, which can happen during cluster update and/or when draining nodes, etc.. Encountering this issue may cause significant downtime for pods needing a new node scale-out. Hence we request to raise the priority of providing a solution for this issue. |
Currently we are involved with other higher prio issues across MCM and CA. |
Grooming Decisions
|
#58 -> implemented the currently available slow backoff mechanism |
Grooming discussion results:
|
What would you like to be added:
Usage of
ErrorClass
is needed in the mcm implementation ofNodeGroup.Node()
interface method. Currently the Error types supported areOutOfResourcesErrorClass
andOtherErrorClass
.Why is this needed:
Currently our CA backoff on a nodegrp only when (
max-node-provision-time=20min
) have elapsed since the last scale-up request of the node-grp. But it has drawback , in casesIn these kind of situations , ideally autoscaler should get to know why such delay in joining of node is happening and CA should backoff quickly.
CA core provides
ErrorClass
for this , so that for any node (registered, unregistered) we could assign (or leave empty)ErrorClass
to the node and based on that CA could quickly backoff . See this comment for details. Comment is little old, now backoff is supported forOtherErrorClass
as well.This needs changes in mcm-providers also so that we could detect
OutOfQuota
issues from provider side and put it somewhere is machine obj , so that CANodeGroup.Node()
could assign proper error class to it.Also we would need to distinguish node not joining due to networking issues into separate error class like
OtherErrorClass
.Live/Canary issues demanding this solutions:
Canary # 3510
Similar upstream issues -> kubernetes#3490, kubernetes#4900
The text was updated successfully, but these errors were encountered: