Controlled Eviction of Pods during Cluster Rollout #470

hardikdr · 2020-06-08T14:05:44Z

What would you like to be added:
During rolling-updates, we decrease the replicas of the old-machine-set and gradually increase the ones in the new-machine-set respecting maxSurge/maxUnavailable. Currently, when the newer machines join, we don't wait till either of the following happens:

the older-machines are completely terminated and
the workload on them has been evicted and placed/running successfully on newer machines.

This behavior can pose certain issues for infrastructures with slow volume-detachments, where in-flight workload may pile up with time. Considerable issues could be:

Many critical pods being unavailable at the same time due to slow volume detachments.
May hit API-rate limits on cloud due to CCM constantly retrying the operations.

Probable solution:
A probable solution to tackle such a situation could be to control the flow of machines being drained. MCM already supports a field pause, it allows us to pause the on-going rolling-update.

We could think of a small sub-controller in MCM which does the following:

Check the number of pods stuck in a certain condition, depicting slow infra-operation.
Pause the machine-deployment if the count goes beyond a certain configurable number.
Unpause the machine-deployment when the count goes below the threshold number.

Open for further discussions around possible ideas.

The text was updated successfully, but these errors were encountered:

hardikdr · 2020-06-08T14:07:48Z

cc @vlerenc @amshuman-kr @rfranzke @prashanth26

amshuman-kr · 2020-06-09T07:44:07Z

Using pause/unpause is one option. There could be others (e.g. slow down machine rollout by dynamically tweaking various timeouts configurations).

But I think the more important part is to have a decoupling between the component that detects the issue and the component that takes corrective action. The same component detecting and taking action has two limitations.

For a given issue/condition It restricts us to only a few fixed set of actions being taken as hard-coded into the component. Some other components cannot take any actions unless they also detect the issue themselves.
For a given set of fixed actions, the set of issues/conditions that trigger the actions are also fixed/hard-coded into the component.

I think it is better to decouple the issue detection and action taking into separate components (as I mentioned in the yet-to-be-reopened gardener/gardener#1797).

We might want to slow down/pause machine rollouts for more reasons than just evicted pods taking too long to come up elsewhere. Also, someone other than MCM (gardenlet for starters) might be interested in the fact that there are infra issues in a shoot/seed.

vlerenc · 2020-06-09T15:01:18Z

Thanks @hardikdr @amshuman-kr . As a quick remedy, would it make sense to already change the behaviour (from a fixed timeout) and check that at least the volume was detached before going on? Then again, @amshuman-kr haven't you reported (somewhere else), that you saw erratic behaviour, so it wasn't possible to even do that properly, right?

hardikdr · 2020-06-10T03:56:25Z

as I mentioned in the yet-to-be-reopened #1797
@amshuman-kr the link of not clickable, can you please point to the correct one.

Also, overall sounds good to have separate components to have detection and action-taker. Can you please elaborate a little more around where do you think they could be hosted later?

amshuman-kr · 2020-06-11T11:17:45Z

@hardikdr I fixed the link above.

amshuman-kr · 2020-06-11T11:29:42Z

As a quick remedy, would it make sense to already change the behaviour (from a fixed timeout) and check that at least the volume was detached before going on? Then again, @amshuman-kr haven't you reported (somewhere else), that you saw erratic behaviour, so it wasn't possible to even do that properly, right?

@vlerenc Please bear with me if I have already said this before. But there are two parts to the eviction and waiting.

Pod eviction itself and waiting for the pod to terminate. This is already handled and MCM waits for the terminationGracePeriodSeconds as specified the pod itself.
Waiting for the volume to be detached. This is problematic. As reported in Erratic serialized drain if there are large number of volumes attached per node #468, the volume being detached goes in and out of node.status.volumesAttached a few times before going away for good. Perhaps, this is a race condition in the KCM/kublet. If so, we could think about contributing a fix. Or worse, it might be the behaviour at the infra API itself. Then no point in contributing and we can think of doing a few retries at fixed intervals to account for this behaviour. But for small number of disks per node (say 10-15?), I would expect drain to happen normally during a machine rollout already today, since there is a check on node.status.volumesAttached already.

@hardikdr @prashanth26 Apart from this, are there any conflicts in timeout values for drain and machine rollout?

vlerenc · 2020-06-13T15:00:38Z

@amshuman-kr All good, that's what I meant. I wrote my question before the out-of-band sync, I believe. As for the reason for this erratic behaviour. I tend to believe its not an IaaS issue (would surprise me), but an update problem/race condition in K8s. Anyhow, during the out-of-band sync, we discussed whether we can consider it detached only if the status detached was observed for a given length of time (e.g. 5s or whatever you have observed as "stable").

Would it then make sense to include this check earlier? As for a low number of volumes, we have seen that Azure copes not well/is very slow even with a few volumes. We ran into this practically whenever we touch an Azure cluster. Then again, it depends how often we do that/when.

hardikdr · 2020-11-13T12:14:23Z

/priority critical

vlerenc · 2020-11-14T06:44:31Z

See also gardener/gardener#87 for a similar problem on Gardener level.

We should develop something like a scoring function that tells us whether we have destabilised a landscape or shoot with an update or not (beyond a certain threshold). Threshold, because we cannot expect all shoots/pods to come up again (what this means is not even clear), so there needs to be some fuzzyness. Like e.g. "the number of pods that are not running (pending, crash-looping, container-creating, etc.) hasn't gone up by more than 30% of the number of pods that were drained from the affected nodes". Only then we continue or else we brace ourselves (pause) and watch whether the situation improves. It's more complicated than that (pods are not guaranteed to come up in the same number or may fail for many other reasons), but I believe we need to develop this notion/formula. It's also what we as human operators do. We don't expect all shoots/pods to be running again after an update, but we watch the tendency and in case of issues we then intervene, pause, and analyse what's going on. Automation can't analyse, but can pause and alert.

amshuman-kr · 2020-11-17T08:07:47Z

Just cross-linking gardener/gardener#87 (comment) here. It is just a limited point about how to propagate the information calculated by MCM because it might be relevant at higher levels (extensions, gardenlet).

himanshu-kun · 2023-02-21T11:33:36Z

Summary of grooming discussion

Portions of this have been addressed through #561 , but we still need to deal with scenarios where we end up with lot of unschedulable pods during rollout and would wish to dynamically slow down drain on noticing such case

hardikdr added the kind/enhancement Enhancement, improvement, extension label Jun 8, 2020

gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Sep 19, 2020

gardener-robot added the priority/critical Needs to be resolved soon, because it impacts users negatively label Nov 13, 2020

vlerenc mentioned this issue Nov 14, 2020

Gardener Canary Update/Melt-Down Protection gardener/gardener#87

Open

7 tasks

vlerenc changed the title ~~Controlled eviction of pods during cluster-rollout.~~ Controlled Eviction of Pods during Cluster Rollout Nov 17, 2020

gardener-robot added priority/2 Priority (lower number equals higher priority) and removed priority/critical Needs to be resolved soon, because it impacts users negatively labels Mar 8, 2021

prashanth26 mentioned this issue Mar 30, 2021

Wait until volume is attached on another node #561

Closed

prashanth26 added priority/4 Priority (lower number equals higher priority) area/high-availability High availability related and removed priority/2 Priority (lower number equals higher priority) labels Jul 21, 2021

gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Jan 18, 2022

himanshu-kun mentioned this issue Feb 21, 2023

Refactor MCM to leverage controller-runtime #724

Open

49 tasks

himanshu-kun added needs/planning Needs (more) planning with other MCM maintainers and removed lifecycle/rotten Nobody worked on this for 12 months (final aging stage) labels Feb 21, 2023

gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Oct 31, 2023

gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Jul 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Controlled Eviction of Pods during Cluster Rollout #470

Controlled Eviction of Pods during Cluster Rollout #470

hardikdr commented Jun 8, 2020

hardikdr commented Jun 8, 2020

amshuman-kr commented Jun 9, 2020 •

edited

Loading

vlerenc commented Jun 9, 2020

hardikdr commented Jun 10, 2020

amshuman-kr commented Jun 11, 2020

amshuman-kr commented Jun 11, 2020

vlerenc commented Jun 13, 2020

hardikdr commented Nov 13, 2020

vlerenc commented Nov 14, 2020

amshuman-kr commented Nov 17, 2020

himanshu-kun commented Feb 21, 2023

Controlled Eviction of Pods during Cluster Rollout #470

Controlled Eviction of Pods during Cluster Rollout #470

Comments

hardikdr commented Jun 8, 2020

hardikdr commented Jun 8, 2020

amshuman-kr commented Jun 9, 2020 • edited Loading

vlerenc commented Jun 9, 2020

hardikdr commented Jun 10, 2020

amshuman-kr commented Jun 11, 2020

amshuman-kr commented Jun 11, 2020

vlerenc commented Jun 13, 2020

hardikdr commented Nov 13, 2020

vlerenc commented Nov 14, 2020

amshuman-kr commented Nov 17, 2020

himanshu-kun commented Feb 21, 2023

Summary of grooming discussion

amshuman-kr commented Jun 9, 2020 •

edited

Loading