Modify GetMachineDeploymentNodes to return unregistered nodes #58

jsravn · 2020-10-02T14:02:08Z

This is needed so the autoscaler can track nodes that fail to provision. With this change the autoscaler will correctly try other node groups if a node group is failing, such as due to InsufficentInstanceCapacity.

Related to #37 and #35.

Release note:

MCM provider now also returns unregistered nodes to Autoscaler. This change enables autoscaler to pick up an alternate worker-pool if the chosen one can't be scaled-up.

This is needed so the autoscaler can track nodes that fail to provision. With this change the autoscaler will correctly try other node groups if a node group is failing, such as due to InsufficentInstanceCapacity.

gardener-robot · 2020-10-02T14:02:13Z

@jsravn Thank you for your contribution.

CLAassistant · 2020-10-02T14:02:14Z

All committers have signed the CLA.

gardener-robot-ci-2 · 2020-10-02T14:02:24Z

Thank you @jsravn for your contribution. Before I can start building your PR, a member of the organization must set the required label(s) {'reviewed/ok-to-test'}. Once started, you can check the build status in the PR checks section below.

jsravn · 2020-10-02T14:04:40Z

I've been testing this a bunch, and it seems to fix the main issue I was experiencing where the autoscaler would get stuck if any MachineDeployment fails to scale. By returning the unregistered nodes, the autoscaler will eventually time out the failing node pools (using the -max-node-provision-time) and scale up other node pools instead. This should be a good step to supporting spot instance pools which have worse availability than the on demand instance types.

hardikdr

Thanks a lot for the PR @jsravn .
A minor doubt/query looks great anyways.

cluster-autoscaler/cloudprovider/mcm/mcm_manager.go

hardikdr · 2020-10-05T06:28:56Z

Gave it a quick try, and it solves the fundamental issue, that another machine-deployment is chosen after ~15mins.[MaxNodeProvisionTime].

Basically, scale-up is backed-off for the failed machine-deployment.
The failed-machine deployment is reconsidered for scale-up after ~30mins, and then it can again be used for the usual workload.

I am curious if there could be an elegant solution, where we could fail faster when insufficient-capacity is seen. But that'll definitely bring a bit of complexity.

hardikdr · 2020-10-05T10:38:36Z

@prashanth26 Would you want to take a quick look at the PR?

jsravn · 2020-10-05T10:40:11Z

I am curious if there could be an elegant solution, where we could fail faster when insufficient-capacity is seen. But that'll definitely bring a bit of complexity.

I think it's possible but will need to be done on the MCM side. It needs to report the machine deployment as unable to scale further and then the MCM autoscaler cloud provider should adjust its max size appropriately (freezing it if it can't scale up). Something like this.

hardikdr · 2020-10-05T10:46:02Z

I think it's possible but will need to be done on the MCM side. It needs to report the machine deployment as unable to scale further and then the MCM autoscaler cloud provider should adjust its max size appropriately (freezing it if it can't scale up). Something like this.

We implemented something very similar at MCM in past, but the PR couldn't make it through. We felt it's bringing complexity, but maybe something to reconsider on that front. See: gardener/machine-controller-manager#454
@prashanth26 had a few concerns and a proposal for another approach.

MCM also post FailedMachineSummary in the Status, which can be parsed by Autoscaler, but that summary is also short-lived and will be gone once machines are healthy. An annotation that stays for a long time[~5-6 hours] could actually be useful.

hardikdr

/assign
/lgtm

prashanth26

/lgtm apologies for the delay in review.

Modify GetMachineDeploymentNodes to return unregistered nodes

ff768b4

This is needed so the autoscaler can track nodes that fail to provision. With this change the autoscaler will correctly try other node groups if a node group is failing, such as due to InsufficentInstanceCapacity.

jsravn requested review from hardikdr and prashanth26 as code owners October 2, 2020 14:02

jsravn mentioned this pull request Oct 2, 2020

Skipping scale-up if MachineTypeNotAvailableAnnotation is present on MachineDeployment #37

Closed

hardikdr added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Oct 4, 2020

gardener-robot-ci-3 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Oct 4, 2020

hardikdr reviewed Oct 5, 2020

View reviewed changes

cluster-autoscaler/cloudprovider/mcm/mcm_manager.go Show resolved Hide resolved

hardikdr approved these changes Oct 7, 2020

View reviewed changes

gardener-robot assigned hardikdr Oct 7, 2020

gardener-robot added the reviewed/lgtm Has approval for merging label Oct 7, 2020

prashanth26 approved these changes Oct 7, 2020

View reviewed changes

hardikdr merged commit cd33a30 into gardener:machine-controller-manager-provider Oct 7, 2020

hardikdr mentioned this pull request Oct 14, 2020

Autoscaler should pick alternate worker-pool if chosen one can't be scaled-up. #35

Closed

hardikdr mentioned this pull request Oct 22, 2020

Unexpected replica count increase when deleting one machine #61

Closed

himanshu-kun mentioned this pull request Jul 12, 2023

Early abort/backoff support for Gardener nodegroups a.k.a machinedeployments #154

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modify GetMachineDeploymentNodes to return unregistered nodes #58

Modify GetMachineDeploymentNodes to return unregistered nodes #58

jsravn commented Oct 2, 2020 •

edited by hardikdr

Loading

gardener-robot commented Oct 2, 2020

CLAassistant commented Oct 2, 2020 •

edited

Loading

gardener-robot-ci-2 commented Oct 2, 2020

jsravn commented Oct 2, 2020

hardikdr left a comment

hardikdr commented Oct 5, 2020

hardikdr commented Oct 5, 2020

jsravn commented Oct 5, 2020

hardikdr commented Oct 5, 2020

hardikdr left a comment

prashanth26 left a comment •

edited

Loading

Modify GetMachineDeploymentNodes to return unregistered nodes #58

Modify GetMachineDeploymentNodes to return unregistered nodes #58

Conversation

jsravn commented Oct 2, 2020 • edited by hardikdr Loading

gardener-robot commented Oct 2, 2020

CLAassistant commented Oct 2, 2020 • edited Loading

gardener-robot-ci-2 commented Oct 2, 2020

jsravn commented Oct 2, 2020

hardikdr left a comment

Choose a reason for hiding this comment

hardikdr commented Oct 5, 2020

hardikdr commented Oct 5, 2020

jsravn commented Oct 5, 2020

hardikdr commented Oct 5, 2020

hardikdr left a comment

Choose a reason for hiding this comment

prashanth26 left a comment • edited Loading

Choose a reason for hiding this comment

jsravn commented Oct 2, 2020 •

edited by hardikdr

Loading

CLAassistant commented Oct 2, 2020 •

edited

Loading

prashanth26 left a comment •

edited

Loading