-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
An option to throttle deletion of Failed Machines #482
Conversation
@zuzzas Thank you for your contribution. |
I had a first cut review, overall looks good, and thanks for the table based tests. I'll try to test this PR mid next week on our environment. |
/assign |
ce81101
to
323d4ca
Compare
pkg/controller/machineset.go
Outdated
if len(staleMachines) > 0 && diff > machineDeletionWindow { | ||
diff = machineDeletionWindow | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will block scale up if .spec.replicas
were increased when there were stale machines also.
Here should we throttle the machine creation by the number of stale machines that were not deleted instead?
Trying to explain my thought through code and comments below:
if len(staleMachines) > 0 && diff > machineDeletionWindow { | |
diff = machineDeletionWindow | |
} | |
if len(staleMachines) > 0 && machineDeletionWindow > 0 { | |
// Count the leftover stale machines against .spec.replicas to prevent sudden surge in machines which can happen if large number of stale machines are left over | |
diff = diff - (len(staleMachines) - len(machineDeletionWindow)) | |
if diff < 0 { | |
// Typically diff >= len(staleMachines) | |
// but this can happen when there are lot of failed machines and scale down happens. eg: allMachines list is 12. Failed machines = 10. deletionWindow = 3. spec.replicas is changed to 6. So active machines = 12-10 = 2. Here diff before throttling = -(2-6) = 4. diff after throttling = 4 - (10 - 3) = -3 | |
// Here we should ideally scale down.. It would be better if we adjust for the leftover machines before entering the top level if block | |
return nil | |
} | |
} |
Similarly, when scaling down (if diff > 0
case), we delete only from the list of activeMachines
, whereas we will be left with some of the staleMachines
also. We should include the stale machines too in the list of machinesToDelete
when scaling down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. This and other suggestions should be encoded into unit tests. I'll get right to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, we should recompute the diff
, with diff = diff - (len(staleMachines) - len(machineDeletionWindow))
to avoid throttled machine-creations.
This is clearly a bit mind-boggling, thanks for @ggaurav10 for catching and @zuzzas for implementing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks. @ggaurav10 can you please take a quick look at this aspect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, now we are recomputing the diff at the start itself, and the new diff[which considers the left out stale machines]. At the first sight, it looks good, I'd like to test this aspect separately.
@ggaurav10 @prashanth26 Can you please resolve the comments, if you think they are taken care of, else it would be great if you could follow-up :) |
Signed-off-by: Andrey Klimentyev <andrey.klimentyev@flant.com>
2. Take MachineSet's scale-up into account Signed-off-by: Andrey Klimentyev <andrey.klimentyev@flant.com>
Signed-off-by: Andrey Klimentyev <andrey.klimentyev@flant.com>
Signed-off-by: Andrey Klimentyev <andrey.klimentyev@flant.com>
Signed-off-by: Andrey Klimentyev <andrey.klimentyev@flant.com>
Signed-off-by: Andrey Klimentyev <andrey.klimentyev@flant.com>
cacd376
to
e24e0a0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to direct the attention of reviewers to a set of test cases. If I have forgotten a case, please, remind me of it. I've refactored this part into a table form.
Thanks for moving to table format, looks sufficient to me. |
pkg/controller/machineset.go
Outdated
|
||
diff := len(activeMachines) - int(machineSet.Spec.Replicas) | ||
if staleMachinesLeft > 0 { | ||
// Count the leftover stale machines against .spec.replicas to prevent sudden surge in machines which can happen if large number of stale machines are left over | ||
diff -= staleMachinesLeft |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
diff -= staleMachinesLeft | |
diff += staleMachinesLeft |
The idea is to count left over machines as active machines. in #482 (comment) it was subtracted because there was diff *= -1
statement few lines above earlier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But that won't work, because it'll panic in getMachinesToDelete()
since diff will be bigger than activeMachines count.
It's my fault for moving this calculation outside of the if
branch. This condition should suffice.
pkg/controller/machineset.go
Outdated
|
||
diff := len(activeMachines) - int(machineSet.Spec.Replicas) | ||
if staleMachinesLeft > 0 { | ||
// Count the leftover stale machines against .spec.replicas to prevent sudden surge in machines which can happen if large number of stale machines are left over | ||
diff -= staleMachinesLeft |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If diff
is already negative, the further substraction will increase the number. This step should either be below in the if block or some other alternative.
Simple example from Gaurav:
active machines = 12, spec = 15, deleted 2, left over = 1,
we should now be creating only 2 machines
….and not 4
@zuzzas Can you please take care of this part?
credits @ggaurav10
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taken care of.
Co-authored-by: Gaurav Gupta <gaurav.gupta07@sap.com>
Signed-off-by: Andrey Klimentyev <andrey.klimentyev@flant.com>
pkg/controller/machineset.go
Outdated
|
||
// clamp machine creation to the count of recently deleted stale machines, so that we don't overshoot | ||
if staleMachinesLeft > 0 && diff > deletedStaleMachines { | ||
diff = deletedStaleMachines |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, this seems to be breaking the scale-up situation.
Eg. total machines are 10, the desired is 13, and stale machines are 4, and threshold: 50%.
Then I would want to create 2+3=5 machines and not 2.
I have opened this with the help from @ggaurav10 : flant#1 , I have tried to convey the core idea. |
Trying to put down the overall approach for the feature, from what we learned so far,
Most of the above requirements have already been taken care of in the PR. |
/close due to inactivity. Please reopen if/when required. |
What this PR does / why we need it:
This PR adds an ability to throttle Failed Machine deletion. Why? It's time for a story!
Imagine a Kubernetes cluster, custom-built for a customer. Virtual machines are created via MCM, masters are static. A tragically under-tested infrastructure platform release that severs the routing between Master and Worker Nodes via improperly configured routing tables in a CNI provider.
After losing connection, Machines transfer to the Unknown state, and, after a timeout, into the Failed state, and then MCM deletes them all. All of them at the same time. The recovery time of 80 Nodes in a vSphere environment is abysmal. Disks are slow to detach and attach, the hypervisors become overloaded due to a lot of new machines springing to life.
Moral of the story: if Nodes become detached from Master, the workload would not immediately falter, it's not healthy to remove all Machines from the cluster if they become unavailable. They can still work well, since properly written microservices cache Service Discovery and other information upon startup.
Special notes for your reviewer:
If you agree with the overall approach, I'd like to direct your attention to the new table tests. Please, verify the exhaustiveness of the provided test cases.
Release note: