Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An option to throttle deletion of Failed Machines #482

Closed
wants to merge 9 commits into from

Conversation

zuzzas
Copy link
Contributor

@zuzzas zuzzas commented Jun 29, 2020

What this PR does / why we need it:

This PR adds an ability to throttle Failed Machine deletion. Why? It's time for a story!

Imagine a Kubernetes cluster, custom-built for a customer. Virtual machines are created via MCM, masters are static. A tragically under-tested infrastructure platform release that severs the routing between Master and Worker Nodes via improperly configured routing tables in a CNI provider.

After losing connection, Machines transfer to the Unknown state, and, after a timeout, into the Failed state, and then MCM deletes them all. All of them at the same time. The recovery time of 80 Nodes in a vSphere environment is abysmal. Disks are slow to detach and attach, the hypervisors become overloaded due to a lot of new machines springing to life.

Moral of the story: if Nodes become detached from Master, the workload would not immediately falter, it's not healthy to remove all Machines from the cluster if they become unavailable. They can still work well, since properly written microservices cache Service Discovery and other information upon startup.

Special notes for your reviewer:

If you agree with the overall approach, I'd like to direct your attention to the new table tests. Please, verify the exhaustiveness of the provided test cases.

Release note:

A new option "--failed-machine-deletion-ratio" that configures throttling of Failed Machines deletion process until new Machines transition into Running state.

@zuzzas zuzzas requested review from ggaurav10 and a team as code owners June 29, 2020 17:57
@gardener-robot
Copy link

@zuzzas Thank you for your contribution.

@hardikdr hardikdr added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jul 4, 2020
@gardener-robot-ci-1 gardener-robot-ci-1 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jul 4, 2020
@hardikdr
Copy link
Member

hardikdr commented Jul 4, 2020

I had a first cut review, overall looks good, and thanks for the table based tests.

I'll try to test this PR mid next week on our environment.

pkg/controller/machineset.go Outdated Show resolved Hide resolved
pkg/controller/machineset.go Outdated Show resolved Hide resolved
pkg/controller/machineset.go Outdated Show resolved Hide resolved
pkg/controller/machineset.go Outdated Show resolved Hide resolved
pkg/controller/machineset.go Outdated Show resolved Hide resolved
@hardikdr
Copy link
Member

hardikdr commented Aug 4, 2020

/assign

pkg/controller/machineset.go Outdated Show resolved Hide resolved
pkg/controller/machineset.go Outdated Show resolved Hide resolved
pkg/controller/machineset.go Outdated Show resolved Hide resolved
pkg/controller/machineset.go Outdated Show resolved Hide resolved
pkg/controller/machineset.go Outdated Show resolved Hide resolved
pkg/controller/machineset.go Outdated Show resolved Hide resolved
pkg/controller/machineset.go Outdated Show resolved Hide resolved
pkg/controller/machineset.go Outdated Show resolved Hide resolved
@zuzzas zuzzas force-pushed the upstreaming-smart-deletion branch from ce81101 to 323d4ca Compare August 10, 2020 21:31
Comment on lines 364 to 366
if len(staleMachines) > 0 && diff > machineDeletionWindow {
diff = machineDeletionWindow
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will block scale up if .spec.replicas were increased when there were stale machines also.
Here should we throttle the machine creation by the number of stale machines that were not deleted instead?

Trying to explain my thought through code and comments below:

Suggested change
if len(staleMachines) > 0 && diff > machineDeletionWindow {
diff = machineDeletionWindow
}
if len(staleMachines) > 0 && machineDeletionWindow > 0 {
// Count the leftover stale machines against .spec.replicas to prevent sudden surge in machines which can happen if large number of stale machines are left over
diff = diff - (len(staleMachines) - len(machineDeletionWindow))
if diff < 0 {
// Typically diff >= len(staleMachines)
// but this can happen when there are lot of failed machines and scale down happens. eg: allMachines list is 12. Failed machines = 10. deletionWindow = 3. spec.replicas is changed to 6. So active machines = 12-10 = 2. Here diff before throttling = -(2-6) = 4. diff after throttling = 4 - (10 - 3) = -3
// Here we should ideally scale down.. It would be better if we adjust for the leftover machines before entering the top level if block
return nil
}
}

Similarly, when scaling down (if diff > 0 case), we delete only from the list of activeMachines, whereas we will be left with some of the staleMachines also. We should include the stale machines too in the list of machinesToDelete when scaling down.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. This and other suggestions should be encoded into unit tests. I'll get right to it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, we should recompute the diff , with diff = diff - (len(staleMachines) - len(machineDeletionWindow)) to avoid throttled machine-creations.
This is clearly a bit mind-boggling, thanks for @ggaurav10 for catching and @zuzzas for implementing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. @ggaurav10 can you please take a quick look at this aspect.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, now we are recomputing the diff at the start itself, and the new diff[which considers the left out stale machines]. At the first sight, it looks good, I'd like to test this aspect separately.

@hardikdr
Copy link
Member

hardikdr commented Aug 24, 2020

@ggaurav10 @prashanth26 Can you please resolve the comments, if you think they are taken care of, else it would be great if you could follow-up :)

@hardikdr hardikdr added this to the v0.35.0 milestone Aug 25, 2020
Signed-off-by: Andrey Klimentyev <andrey.klimentyev@flant.com>
2. Take MachineSet's scale-up into account

Signed-off-by: Andrey Klimentyev <andrey.klimentyev@flant.com>
Signed-off-by: Andrey Klimentyev <andrey.klimentyev@flant.com>
Signed-off-by: Andrey Klimentyev <andrey.klimentyev@flant.com>
Signed-off-by: Andrey Klimentyev <andrey.klimentyev@flant.com>
Signed-off-by: Andrey Klimentyev <andrey.klimentyev@flant.com>
@zuzzas zuzzas force-pushed the upstreaming-smart-deletion branch from cacd376 to e24e0a0 Compare September 7, 2020 07:33
Copy link
Contributor Author

@zuzzas zuzzas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to direct the attention of reviewers to a set of test cases. If I have forgotten a case, please, remind me of it. I've refactored this part into a table form.

https://github.com/gardener/machine-controller-manager/pull/482/files/e24e0a033587548a9cd56b4dfb023ccf63e2f1dc#diff-fd309453fe2270abc5c24d3dde7e7a8fR299-R311

@ggaurav10 @prashanth26 @hardikdr

@hardikdr
Copy link
Member

I'd like to direct the attention of reviewers to a set of test cases. If I have forgotten a case, please, remind me of it. I've refactored this part into a table form.

Thanks for moving to table format, looks sufficient to me.


diff := len(activeMachines) - int(machineSet.Spec.Replicas)
if staleMachinesLeft > 0 {
// Count the leftover stale machines against .spec.replicas to prevent sudden surge in machines which can happen if large number of stale machines are left over
diff -= staleMachinesLeft
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
diff -= staleMachinesLeft
diff += staleMachinesLeft

The idea is to count left over machines as active machines. in #482 (comment) it was subtracted because there was diff *= -1 statement few lines above earlier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that won't work, because it'll panic in getMachinesToDelete() since diff will be bigger than activeMachines count.

It's my fault for moving this calculation outside of the if branch. This condition should suffice.

pkg/controller/machineset.go Outdated Show resolved Hide resolved
pkg/controller/machineset.go Outdated Show resolved Hide resolved

diff := len(activeMachines) - int(machineSet.Spec.Replicas)
if staleMachinesLeft > 0 {
// Count the leftover stale machines against .spec.replicas to prevent sudden surge in machines which can happen if large number of stale machines are left over
diff -= staleMachinesLeft
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If diff is already negative, the further substraction will increase the number. This step should either be below in the if block or some other alternative.
Simple example from Gaurav:

active machines = 12, spec = 15, deleted 2, left over = 1,
we should now be creating only 2 machines
….and not 4

@zuzzas Can you please take care of this part?
credits @ggaurav10

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taken care of.

zuzzas and others added 2 commits September 11, 2020 09:34
Co-authored-by: Gaurav Gupta <gaurav.gupta07@sap.com>
Signed-off-by: Andrey Klimentyev <andrey.klimentyev@flant.com>
@CLAassistant
Copy link

CLAassistant commented Sep 16, 2020

CLA assistant check
All committers have signed the CLA.

@hardikdr
Copy link
Member

hardikdr commented Sep 21, 2020

@aylei Would you want to take a brief look as well, this seems to be related to what you proposed here: #449 . - sorry for the late invite.

@zuzzas Did you finally get any experience from running these changes in your environment?


// clamp machine creation to the count of recently deleted stale machines, so that we don't overshoot
if staleMachinesLeft > 0 && diff > deletedStaleMachines {
diff = deletedStaleMachines
Copy link
Member

@hardikdr hardikdr Sep 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, this seems to be breaking the scale-up situation.
Eg. total machines are 10, the desired is 13, and stale machines are 4, and threshold: 50%.
Then I would want to create 2+3=5 machines and not 2.

@hardikdr
Copy link
Member

I have opened this with the help from @ggaurav10 : flant#1 , I have tried to convey the core idea.

@hardikdr
Copy link
Member

hardikdr commented Oct 4, 2020

Trying to put down the overall approach for the feature, from what we learned so far,

  • A flag should be introduced to determine the expected portion of the Failed machines be removed.
  • The Ratio should be calculated based on the current number of inactiveMachines and not Spec.Replicas. The calculation should be re-done with every reconciliation of machine-set as the number of Failed machines could change with time.
  • There should be a lower threshold, after which we simply delete all of them, and not re-calculate based on ratio - eg last 3 machines should be simply deleted altogether.
  • The next iteration of the deletion of the failed machines should wait till the previous bunch has joined the cluster.
  • Overall the logic should also work with ongoing scale-up or scale-down scenarios, eg machine-deployment is scaled-up/down while few machines are already being Failed.

Most of the above requirements have already been taken care of in the PR.
@zuzzas Wdyt, does the description above work for you?

@gardener-robot-ci-2 gardener-robot-ci-2 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Oct 4, 2020
@gardener-robot-ci-1 gardener-robot-ci-1 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Oct 4, 2020
@hardikdr hardikdr modified the milestones: v0.35.0, v0.36.0 Oct 10, 2020
@gardener-robot gardener-robot added lifecycle/stale Nobody worked on this for 6 months (will further age) needs/changes Needs (more) changes labels Dec 10, 2020
@gardener-robot gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels May 19, 2021
@prashanth26
Copy link
Contributor

/close due to inactivity. Please reopen if/when required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Nobody worked on this for 12 months (final aging stage) needs/changes Needs (more) changes needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants