-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modify GetMachineDeploymentNodes to return unregistered nodes #58
Modify GetMachineDeploymentNodes to return unregistered nodes #58
Conversation
This is needed so the autoscaler can track nodes that fail to provision. With this change the autoscaler will correctly try other node groups if a node group is failing, such as due to InsufficentInstanceCapacity.
@jsravn Thank you for your contribution. |
Thank you @jsravn for your contribution. Before I can start building your PR, a member of the organization must set the required label(s) {'reviewed/ok-to-test'}. Once started, you can check the build status in the PR checks section below. |
I've been testing this a bunch, and it seems to fix the main issue I was experiencing where the autoscaler would get stuck if any MachineDeployment fails to scale. By returning the unregistered nodes, the autoscaler will eventually time out the failing node pools (using the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the PR @jsravn .
A minor doubt/query looks great anyways.
Gave it a quick try, and it solves the fundamental issue, that another machine-deployment is chosen after ~15mins.[ Basically, scale-up is backed-off for the failed machine-deployment.
|
@prashanth26 Would you want to take a quick look at the PR? |
I think it's possible but will need to be done on the MCM side. It needs to report the machine deployment as unable to scale further and then the MCM autoscaler cloud provider should adjust its max size appropriately (freezing it if it can't scale up). Something like this. |
We implemented something very similar at MCM in past, but the PR couldn't make it through. We felt it's bringing complexity, but maybe something to reconsider on that front. See: gardener/machine-controller-manager#454
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/assign
/lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm apologies for the delay in review.
This is needed so the autoscaler can track nodes that fail to provision. With this change the autoscaler will correctly try other node groups if a node group is failing, such as due to InsufficentInstanceCapacity.
Related to #37 and #35.
Release note: