-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Controlled Eviction of Pods during Cluster Rollout #470
Comments
Using pause/unpause is one option. There could be others (e.g. slow down machine rollout by dynamically tweaking various timeouts configurations). But I think the more important part is to have a decoupling between the component that detects the issue and the component that takes corrective action. The same component detecting and taking action has two limitations.
I think it is better to decouple the issue detection and action taking into separate components (as I mentioned in the yet-to-be-reopened gardener/gardener#1797). We might want to slow down/pause machine rollouts for more reasons than just evicted pods taking too long to come up elsewhere. Also, someone other than MCM (gardenlet for starters) might be interested in the fact that there are infra issues in a shoot/seed. |
Thanks @hardikdr @amshuman-kr . As a quick remedy, would it make sense to already change the behaviour (from a fixed timeout) and check that at least the volume was detached before going on? Then again, @amshuman-kr haven't you reported (somewhere else), that you saw erratic behaviour, so it wasn't possible to even do that properly, right? |
Also, overall sounds good to have separate components to have detection and action-taker. Can you please elaborate a little more around where do you think they could be hosted later? |
@hardikdr I fixed the link above. |
@vlerenc Please bear with me if I have already said this before. But there are two parts to the eviction and waiting.
@hardikdr @prashanth26 Apart from this, are there any conflicts in timeout values for drain and machine rollout? |
@amshuman-kr All good, that's what I meant. I wrote my question before the out-of-band sync, I believe. As for the reason for this erratic behaviour. I tend to believe its not an IaaS issue (would surprise me), but an update problem/race condition in K8s. Anyhow, during the out-of-band sync, we discussed whether we can consider it detached only if the status detached was observed for a given length of time (e.g. 5s or whatever you have observed as "stable"). Would it then make sense to include this check earlier? As for a low number of volumes, we have seen that Azure copes not well/is very slow even with a few volumes. We ran into this practically whenever we touch an Azure cluster. Then again, it depends how often we do that/when. |
/priority critical |
See also gardener/gardener#87 for a similar problem on Gardener level. We should develop something like a scoring function that tells us whether we have destabilised a landscape or shoot with an update or not (beyond a certain threshold). Threshold, because we cannot expect all shoots/pods to come up again (what this means is not even clear), so there needs to be some fuzzyness. Like e.g. "the number of pods that are not running (pending, crash-looping, container-creating, etc.) hasn't gone up by more than 30% of the number of pods that were drained from the affected nodes". Only then we continue or else we brace ourselves (pause) and watch whether the situation improves. It's more complicated than that (pods are not guaranteed to come up in the same number or may fail for many other reasons), but I believe we need to develop this notion/formula. It's also what we as human operators do. We don't expect all shoots/pods to be running again after an update, but we watch the tendency and in case of issues we then intervene, pause, and analyse what's going on. Automation can't analyse, but can pause and alert. |
Just cross-linking gardener/gardener#87 (comment) here. It is just a limited point about how to propagate the information calculated by MCM because it might be relevant at higher levels (extensions, gardenlet). |
Summary of grooming discussionPortions of this have been addressed through #561 , but we still need to deal with scenarios where we end up with lot of unschedulable pods during rollout and would wish to dynamically slow down drain on noticing such case |
What would you like to be added:
During rolling-updates, we decrease the replicas of the old-machine-set and gradually increase the ones in the new-machine-set respecting
maxSurge
/maxUnavailable
. Currently, when the newer machines join, we don't wait till either of the following happens:This behavior can pose certain issues for infrastructures with slow volume-detachments, where in-flight workload may pile up with time. Considerable issues could be:
Probable solution:
A probable solution to tackle such a situation could be to control the flow of machines being drained. MCM already supports a field
pause
, it allows us to pause the on-going rolling-update.We could think of a small sub-controller in MCM which does the following:
Pause
the machine-deployment if the count goes beyond a certain configurable number.Unpause
the machine-deployment when the count goes below the threshold number.Open for further discussions around possible ideas.
The text was updated successfully, but these errors were encountered: