Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify GetMachineDeploymentNodes to return unregistered nodes #58

Merged
merged 1 commit into from
Oct 7, 2020
Merged

Modify GetMachineDeploymentNodes to return unregistered nodes #58

merged 1 commit into from
Oct 7, 2020

Conversation

jsravn
Copy link

@jsravn jsravn commented Oct 2, 2020

This is needed so the autoscaler can track nodes that fail to provision. With this change the autoscaler will correctly try other node groups if a node group is failing, such as due to InsufficentInstanceCapacity.

Related to #37 and #35.

Release note:

MCM provider now also returns unregistered nodes to Autoscaler. This change enables autoscaler to pick up an alternate worker-pool if the chosen one can't be scaled-up.

This is needed so the autoscaler can track nodes that fail to provision.
With this change the autoscaler will correctly try other node groups
if a node group is failing, such as due to InsufficentInstanceCapacity.
@gardener-robot
Copy link

@jsravn Thank you for your contribution.

@CLAassistant
Copy link

CLAassistant commented Oct 2, 2020

CLA assistant check
All committers have signed the CLA.

@gardener-robot-ci-2
Copy link

Thank you @jsravn for your contribution. Before I can start building your PR, a member of the organization must set the required label(s) {'reviewed/ok-to-test'}. Once started, you can check the build status in the PR checks section below.

@jsravn
Copy link
Author

jsravn commented Oct 2, 2020

I've been testing this a bunch, and it seems to fix the main issue I was experiencing where the autoscaler would get stuck if any MachineDeployment fails to scale. By returning the unregistered nodes, the autoscaler will eventually time out the failing node pools (using the -max-node-provision-time) and scale up other node pools instead. This should be a good step to supporting spot instance pools which have worse availability than the on demand instance types.

@hardikdr hardikdr added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Oct 4, 2020
@gardener-robot-ci-3 gardener-robot-ci-3 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Oct 4, 2020
Copy link
Member

@hardikdr hardikdr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the PR @jsravn .
A minor doubt/query looks great anyways.

@hardikdr
Copy link
Member

hardikdr commented Oct 5, 2020

Gave it a quick try, and it solves the fundamental issue, that another machine-deployment is chosen after ~15mins.[MaxNodeProvisionTime].

Basically, scale-up is backed-off for the failed machine-deployment.
The failed-machine deployment is reconsidered for scale-up after ~30mins, and then it can again be used for the usual workload.

  • I am curious if there could be an elegant solution, where we could fail faster when insufficient-capacity is seen. But that'll definitely bring a bit of complexity.

@hardikdr
Copy link
Member

hardikdr commented Oct 5, 2020

@prashanth26 Would you want to take a quick look at the PR?

@jsravn
Copy link
Author

jsravn commented Oct 5, 2020

I am curious if there could be an elegant solution, where we could fail faster when insufficient-capacity is seen. But that'll definitely bring a bit of complexity.

I think it's possible but will need to be done on the MCM side. It needs to report the machine deployment as unable to scale further and then the MCM autoscaler cloud provider should adjust its max size appropriately (freezing it if it can't scale up). Something like this.

@hardikdr
Copy link
Member

hardikdr commented Oct 5, 2020

I think it's possible but will need to be done on the MCM side. It needs to report the machine deployment as unable to scale further and then the MCM autoscaler cloud provider should adjust its max size appropriately (freezing it if it can't scale up). Something like this.

We implemented something very similar at MCM in past, but the PR couldn't make it through. We felt it's bringing complexity, but maybe something to reconsider on that front. See: gardener/machine-controller-manager#454
@prashanth26 had a few concerns and a proposal for another approach.

  • MCM also post FailedMachineSummary in the Status, which can be parsed by Autoscaler, but that summary is also short-lived and will be gone once machines are healthy. An annotation that stays for a long time[~5-6 hours] could actually be useful.

Copy link
Member

@hardikdr hardikdr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/assign
/lgtm

@gardener-robot gardener-robot added the reviewed/lgtm Has approval for merging label Oct 7, 2020
Copy link

@prashanth26 prashanth26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm apologies for the delay in review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) reviewed/lgtm Has approval for merging
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants