An option to throttle deletion of Failed Machines #482

zuzzas · 2020-06-29T17:57:21Z

What this PR does / why we need it:

This PR adds an ability to throttle Failed Machine deletion. Why? It's time for a story!

Imagine a Kubernetes cluster, custom-built for a customer. Virtual machines are created via MCM, masters are static. A tragically under-tested infrastructure platform release that severs the routing between Master and Worker Nodes via improperly configured routing tables in a CNI provider.

After losing connection, Machines transfer to the Unknown state, and, after a timeout, into the Failed state, and then MCM deletes them all. All of them at the same time. The recovery time of 80 Nodes in a vSphere environment is abysmal. Disks are slow to detach and attach, the hypervisors become overloaded due to a lot of new machines springing to life.

Moral of the story: if Nodes become detached from Master, the workload would not immediately falter, it's not healthy to remove all Machines from the cluster if they become unavailable. They can still work well, since properly written microservices cache Service Discovery and other information upon startup.

Special notes for your reviewer:

If you agree with the overall approach, I'd like to direct your attention to the new table tests. Please, verify the exhaustiveness of the provided test cases.

Release note:

A new option "--failed-machine-deletion-ratio" that configures throttling of Failed Machines deletion process until new Machines transition into Running state.

gardener-robot · 2020-06-29T17:57:26Z

@zuzzas Thank you for your contribution.

pkg/controller/machineset.go

hardikdr · 2020-07-04T17:30:51Z

I had a first cut review, overall looks good, and thanks for the table based tests.

I'll try to test this PR mid next week on our environment.

pkg/controller/machineset.go

hardikdr · 2020-08-04T10:11:20Z

/assign

pkg/controller/machineset.go

ggaurav10 · 2020-08-12T05:47:07Z

pkg/controller/machineset.go

+		if len(staleMachines) > 0 && diff > machineDeletionWindow {
+			diff = machineDeletionWindow
+		}


It will block scale up if .spec.replicas were increased when there were stale machines also.
Here should we throttle the machine creation by the number of stale machines that were not deleted instead?

Trying to explain my thought through code and comments below:

Suggested change

if len(staleMachines) > 0 && diff > machineDeletionWindow {

diff = machineDeletionWindow

}

if len(staleMachines) > 0 && machineDeletionWindow > 0 {

// Count the leftover stale machines against .spec.replicas to prevent sudden surge in machines which can happen if large number of stale machines are left over

diff = diff - (len(staleMachines) - len(machineDeletionWindow))

if diff < 0 {

// Typically diff >= len(staleMachines)

// but this can happen when there are lot of failed machines and scale down happens. eg: allMachines list is 12. Failed machines = 10. deletionWindow = 3. spec.replicas is changed to 6. So active machines = 12-10 = 2. Here diff before throttling = -(2-6) = 4. diff after throttling = 4 - (10 - 3) = -3

// Here we should ideally scale down.. It would be better if we adjust for the leftover machines before entering the top level if block

return nil

}

}

Similarly, when scaling down (if diff > 0 case), we delete only from the list of activeMachines, whereas we will be left with some of the staleMachines also. We should include the stale machines too in the list of machinesToDelete when scaling down.

Thanks. This and other suggestions should be encoded into unit tests. I'll get right to it.

Agreed, we should recompute the diff , with diff = diff - (len(staleMachines) - len(machineDeletionWindow)) to avoid throttled machine-creations.
This is clearly a bit mind-boggling, thanks for @ggaurav10 for catching and @zuzzas for implementing.

It is done.
https://github.com/gardener/machine-controller-manager/pull/482/files#diff-16f7b87f40829e0c81cb6c486f3997ccR345-R359

thanks. @ggaurav10 can you please take a quick look at this aspect.

I see, now we are recomputing the diff at the start itself, and the new diff[which considers the left out stale machines]. At the first sight, it looks good, I'd like to test this aspect separately.

hardikdr · 2020-08-24T12:36:42Z

@ggaurav10 @prashanth26 Can you please resolve the comments, if you think they are taken care of, else it would be great if you could follow-up :)

Signed-off-by: Andrey Klimentyev <andrey.klimentyev@flant.com>

2. Take MachineSet's scale-up into account Signed-off-by: Andrey Klimentyev <andrey.klimentyev@flant.com>

Signed-off-by: Andrey Klimentyev <andrey.klimentyev@flant.com>

zuzzas

I'd like to direct the attention of reviewers to a set of test cases. If I have forgotten a case, please, remind me of it. I've refactored this part into a table form.

https://github.com/gardener/machine-controller-manager/pull/482/files/e24e0a033587548a9cd56b4dfb023ccf63e2f1dc#diff-fd309453fe2270abc5c24d3dde7e7a8fR299-R311

@ggaurav10 @prashanth26 @hardikdr

hardikdr · 2020-09-10T14:03:58Z

I'd like to direct the attention of reviewers to a set of test cases. If I have forgotten a case, please, remind me of it. I've refactored this part into a table form.

Thanks for moving to table format, looks sufficient to me.

pkg/controller/machineset.go

ggaurav10 · 2020-09-11T01:28:23Z

pkg/controller/machineset.go


 	diff := len(activeMachines) - int(machineSet.Spec.Replicas)
+	if staleMachinesLeft > 0 {
+		// Count the leftover stale machines against .spec.replicas to prevent sudden surge in machines which can happen if large number of stale machines are left over
+		diff -= staleMachinesLeft


Suggested change

diff -= staleMachinesLeft

diff += staleMachinesLeft

The idea is to count left over machines as active machines. in #482 (comment) it was subtracted because there was diff *= -1 statement few lines above earlier.

But that won't work, because it'll panic in getMachinesToDelete() since diff will be bigger than activeMachines count.

It's my fault for moving this calculation outside of the if branch. This condition should suffice.

pkg/controller/machineset.go

hardikdr · 2020-09-11T04:46:04Z

pkg/controller/machineset.go


 	diff := len(activeMachines) - int(machineSet.Spec.Replicas)
+	if staleMachinesLeft > 0 {
+		// Count the leftover stale machines against .spec.replicas to prevent sudden surge in machines which can happen if large number of stale machines are left over
+		diff -= staleMachinesLeft


If diff is already negative, the further substraction will increase the number. This step should either be below in the if block or some other alternative.
Simple example from Gaurav:

active machines = 12, spec = 15, deleted 2, left over = 1, we should now be creating only 2 machines ….and not 4

@zuzzas Can you please take care of this part?
credits @ggaurav10

Taken care of.

Co-authored-by: Gaurav Gupta <gaurav.gupta07@sap.com>

Signed-off-by: Andrey Klimentyev <andrey.klimentyev@flant.com>

CLAassistant · 2020-09-16T10:36:35Z

All committers have signed the CLA.

hardikdr · 2020-09-21T03:19:15Z

@aylei Would you want to take a brief look as well, this seems to be related to what you proposed here: #449 . - sorry for the late invite.

@zuzzas Did you finally get any experience from running these changes in your environment?

hardikdr · 2020-09-21T04:16:10Z

pkg/controller/machineset.go

+
+		// clamp machine creation to the count of recently deleted stale machines, so that we don't overshoot
+		if staleMachinesLeft > 0 && diff > deletedStaleMachines {
+			diff = deletedStaleMachines


Hm, this seems to be breaking the scale-up situation.
Eg. total machines are 10, the desired is 13, and stale machines are 4, and threshold: 50%.
Then I would want to create 2+3=5 machines and not 2.

hardikdr · 2020-09-21T06:29:41Z

I have opened this with the help from @ggaurav10 : flant#1 , I have tried to convey the core idea.

hardikdr · 2020-10-04T14:54:35Z

Trying to put down the overall approach for the feature, from what we learned so far,

A flag should be introduced to determine the expected portion of the Failed machines be removed.
- In fact, even better if it's configurable via MachineDeployment.Spec.FailedDeletionRatio. Similar to what we have for healthTimeout, see: Enable controller configurations via MachineDeployment #478 . This allows us to set it per MD.
The Ratio should be calculated based on the current number of inactiveMachines and not Spec.Replicas. The calculation should be re-done with every reconciliation of machine-set as the number of Failed machines could change with time.
There should be a lower threshold, after which we simply delete all of them, and not re-calculate based on ratio - eg last 3 machines should be simply deleted altogether.
The next iteration of the deletion of the failed machines should wait till the previous bunch has joined the cluster.
Overall the logic should also work with ongoing scale-up or scale-down scenarios, eg machine-deployment is scaled-up/down while few machines are already being Failed.

Most of the above requirements have already been taken care of in the PR.
@zuzzas Wdyt, does the description above work for you?

prashanth26 · 2021-09-29T03:22:55Z

/close due to inactivity. Please reopen if/when required.

zuzzas requested review from ggaurav10 and a team as code owners June 29, 2020 17:57

hardikdr added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jul 4, 2020

gardener-robot-ci-1 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jul 4, 2020

hardikdr reviewed Jul 4, 2020

View reviewed changes

pkg/controller/machineset.go Outdated Show resolved Hide resolved

ggaurav10 suggested changes Jul 6, 2020

View reviewed changes

gardener-robot assigned hardikdr Aug 4, 2020

hardikdr reviewed Aug 4, 2020

View reviewed changes

pkg/controller/machineset.go Outdated Show resolved Hide resolved

hardikdr reviewed Aug 4, 2020

View reviewed changes

pkg/controller/machineset.go Outdated Show resolved Hide resolved

pkg/controller/machineset.go Outdated Show resolved Hide resolved

pkg/controller/machineset.go Outdated Show resolved Hide resolved

pkg/controller/machineset.go Outdated Show resolved Hide resolved

prashanth26 reviewed Aug 8, 2020

View reviewed changes

pkg/controller/machineset.go Outdated Show resolved Hide resolved

pkg/controller/machineset.go Outdated Show resolved Hide resolved

pkg/controller/machineset.go Outdated Show resolved Hide resolved

pkg/controller/machineset.go Outdated Show resolved Hide resolved

zuzzas force-pushed the upstreaming-smart-deletion branch from ce81101 to 323d4ca Compare August 10, 2020 21:31

ggaurav10 reviewed Aug 11, 2020

View reviewed changes

pkg/controller/machineset.go Outdated Show resolved Hide resolved

ggaurav10 suggested changes Aug 12, 2020

View reviewed changes

hardikdr added this to the v0.35.0 milestone Aug 25, 2020

zuzzas added 6 commits September 6, 2020 01:03

Smart deletion of Failed Machines

b178c45

Signed-off-by: Andrey Klimentyev <andrey.klimentyev@flant.com>

1. Calculate ratio relative to MachineSet's "spec.replicas"

5356550

2. Take MachineSet's scale-up into account Signed-off-by: Andrey Klimentyev <andrey.klimentyev@flant.com>

Improvements based on GitHub reviews

acc8d94

Signed-off-by: Andrey Klimentyev <andrey.klimentyev@flant.com>

Clamp stale Machine deletion to machineDeletionWindow

4f02715

Signed-off-by: Andrey Klimentyev <andrey.klimentyev@flant.com>

Clamp stale Machine creation AND deletion to machineDeletionWindow

8b185ad

Signed-off-by: Andrey Klimentyev <andrey.klimentyev@flant.com>

Latest fixes

e24e0a0

Signed-off-by: Andrey Klimentyev <andrey.klimentyev@flant.com>

zuzzas force-pushed the upstreaming-smart-deletion branch from cacd376 to e24e0a0 Compare September 7, 2020 07:33

zuzzas commented Sep 7, 2020

View reviewed changes

hardikdr reviewed Sep 11, 2020

View reviewed changes

pkg/controller/machineset.go Outdated Show resolved Hide resolved

ggaurav10 suggested changes Sep 11, 2020

View reviewed changes

hardikdr reviewed Sep 11, 2020

View reviewed changes

zuzzas and others added 2 commits September 11, 2020 09:34

Apply suggestions from code review

f7ea217

Co-authored-by: Gaurav Gupta <gaurav.gupta07@sap.com>

More changes

d7e72b3

Signed-off-by: Andrey Klimentyev <andrey.klimentyev@flant.com>

hardikdr reviewed Sep 21, 2020

View reviewed changes

Update smart deletion

e253556

gardener-robot-ci-2 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Oct 4, 2020

gardener-robot-ci-1 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Oct 4, 2020

hardikdr modified the milestones: v0.35.0, v0.36.0 Oct 10, 2020

gardener-robot added lifecycle/stale Nobody worked on this for 6 months (will further age) needs/changes Needs (more) changes labels Dec 10, 2020

gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels May 19, 2021

gardener-robot closed this Sep 29, 2021

This was referenced Sep 20, 2022

Refactor MCM to leverage controller-runtime #724

Open

Make MaxReplacement configurable #688

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

An option to throttle deletion of Failed Machines #482

An option to throttle deletion of Failed Machines #482

zuzzas commented Jun 29, 2020 •

edited

Loading

gardener-robot commented Jun 29, 2020

hardikdr commented Jul 4, 2020

hardikdr commented Aug 4, 2020

ggaurav10 Aug 12, 2020

zuzzas Aug 12, 2020

hardikdr Aug 26, 2020

zuzzas Sep 7, 2020

hardikdr Sep 8, 2020

hardikdr Sep 11, 2020

hardikdr commented Aug 24, 2020 •

edited

Loading

zuzzas left a comment •

edited

Loading

hardikdr commented Sep 10, 2020

ggaurav10 Sep 11, 2020

zuzzas Sep 11, 2020

hardikdr Sep 11, 2020

zuzzas Sep 11, 2020

CLAassistant commented Sep 16, 2020 •

edited

Loading

hardikdr commented Sep 21, 2020 •

edited

Loading

hardikdr Sep 21, 2020 •

edited

Loading

hardikdr commented Sep 21, 2020

hardikdr commented Oct 4, 2020 •

edited

Loading

prashanth26 commented Sep 29, 2021

-		if len(staleMachines) > 0 && diff > machineDeletionWindow {
-			diff = machineDeletionWindow
-		}
+		if len(staleMachines) > 0 && machineDeletionWindow > 0 {
+		        // Count the leftover stale machines against .spec.replicas to prevent sudden surge in machines which can happen if large number of stale machines are left over
+			diff = diff - (len(staleMachines) - len(machineDeletionWindow))
+			if diff < 0 {
+			        // Typically diff >= len(staleMachines)
+			        // but this can happen when there are lot of failed machines and scale down happens. eg: allMachines list is 12. Failed machines = 10. deletionWindow = 3. spec.replicas is changed to 6. So active machines = 12-10 = 2. Here diff before throttling = -(2-6) = 4. diff after throttling = 4 - (10 - 3) = -3
+			        // Here we should ideally scale down.. It would be better if we adjust for the leftover machines before entering the top level if block
+			        return nil
+			}
+		}

An option to throttle deletion of Failed Machines #482

An option to throttle deletion of Failed Machines #482

Conversation

zuzzas commented Jun 29, 2020 • edited Loading

gardener-robot commented Jun 29, 2020

hardikdr commented Jul 4, 2020

hardikdr commented Aug 4, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hardikdr commented Aug 24, 2020 • edited Loading

zuzzas left a comment • edited Loading

Choose a reason for hiding this comment

hardikdr commented Sep 10, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CLAassistant commented Sep 16, 2020 • edited Loading

hardikdr commented Sep 21, 2020 • edited Loading

hardikdr Sep 21, 2020 • edited Loading

Choose a reason for hiding this comment

hardikdr commented Sep 21, 2020

hardikdr commented Oct 4, 2020 • edited Loading

prashanth26 commented Sep 29, 2021

zuzzas commented Jun 29, 2020 •

edited

Loading

hardikdr commented Aug 24, 2020 •

edited

Loading

zuzzas left a comment •

edited

Loading

CLAassistant commented Sep 16, 2020 •

edited

Loading

hardikdr commented Sep 21, 2020 •

edited

Loading

hardikdr Sep 21, 2020 •

edited

Loading

hardikdr commented Oct 4, 2020 •

edited

Loading