Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Controlled Eviction of Pods during Cluster Rollout #470

Open
hardikdr opened this issue Jun 8, 2020 · 11 comments
Open

Controlled Eviction of Pods during Cluster Rollout #470

hardikdr opened this issue Jun 8, 2020 · 11 comments
Labels
area/high-availability High availability related kind/enhancement Enhancement, improvement, extension lifecycle/rotten Nobody worked on this for 12 months (final aging stage) needs/planning Needs (more) planning with other MCM maintainers priority/4 Priority (lower number equals higher priority)

Comments

@hardikdr
Copy link
Member

hardikdr commented Jun 8, 2020

What would you like to be added:
During rolling-updates, we decrease the replicas of the old-machine-set and gradually increase the ones in the new-machine-set respecting maxSurge/maxUnavailable. Currently, when the newer machines join, we don't wait till either of the following happens:

  • the older-machines are completely terminated and
  • the workload on them has been evicted and placed/running successfully on newer machines.

This behavior can pose certain issues for infrastructures with slow volume-detachments, where in-flight workload may pile up with time. Considerable issues could be:

  • Many critical pods being unavailable at the same time due to slow volume detachments.
  • May hit API-rate limits on cloud due to CCM constantly retrying the operations.

Probable solution:
A probable solution to tackle such a situation could be to control the flow of machines being drained. MCM already supports a field pause, it allows us to pause the on-going rolling-update.

We could think of a small sub-controller in MCM which does the following:

  • Check the number of pods stuck in a certain condition, depicting slow infra-operation.
  • Pause the machine-deployment if the count goes beyond a certain configurable number.
  • Unpause the machine-deployment when the count goes below the threshold number.

Open for further discussions around possible ideas.

@hardikdr hardikdr added the kind/enhancement Enhancement, improvement, extension label Jun 8, 2020
@hardikdr
Copy link
Member Author

hardikdr commented Jun 8, 2020

@amshuman-kr
Copy link

amshuman-kr commented Jun 9, 2020

Using pause/unpause is one option. There could be others (e.g. slow down machine rollout by dynamically tweaking various timeouts configurations).

But I think the more important part is to have a decoupling between the component that detects the issue and the component that takes corrective action. The same component detecting and taking action has two limitations.

  1. For a given issue/condition It restricts us to only a few fixed set of actions being taken as hard-coded into the component. Some other components cannot take any actions unless they also detect the issue themselves.
  2. For a given set of fixed actions, the set of issues/conditions that trigger the actions are also fixed/hard-coded into the component.

I think it is better to decouple the issue detection and action taking into separate components (as I mentioned in the yet-to-be-reopened gardener/gardener#1797).

We might want to slow down/pause machine rollouts for more reasons than just evicted pods taking too long to come up elsewhere. Also, someone other than MCM (gardenlet for starters) might be interested in the fact that there are infra issues in a shoot/seed.

@vlerenc
Copy link
Member

vlerenc commented Jun 9, 2020

Thanks @hardikdr @amshuman-kr . As a quick remedy, would it make sense to already change the behaviour (from a fixed timeout) and check that at least the volume was detached before going on? Then again, @amshuman-kr haven't you reported (somewhere else), that you saw erratic behaviour, so it wasn't possible to even do that properly, right?

@hardikdr
Copy link
Member Author

as I mentioned in the yet-to-be-reopened #1797
@amshuman-kr the link of not clickable, can you please point to the correct one.

Also, overall sounds good to have separate components to have detection and action-taker. Can you please elaborate a little more around where do you think they could be hosted later?

@amshuman-kr
Copy link

@hardikdr I fixed the link above.

@amshuman-kr
Copy link

As a quick remedy, would it make sense to already change the behaviour (from a fixed timeout) and check that at least the volume was detached before going on? Then again, @amshuman-kr haven't you reported (somewhere else), that you saw erratic behaviour, so it wasn't possible to even do that properly, right?

@vlerenc Please bear with me if I have already said this before. But there are two parts to the eviction and waiting.

  1. Pod eviction itself and waiting for the pod to terminate. This is already handled and MCM waits for the terminationGracePeriodSeconds as specified the pod itself.
  2. Waiting for the volume to be detached. This is problematic. As reported in Erratic serialized drain if there are large number of volumes attached per node #468, the volume being detached goes in and out of node.status.volumesAttached a few times before going away for good. Perhaps, this is a race condition in the KCM/kublet. If so, we could think about contributing a fix. Or worse, it might be the behaviour at the infra API itself. Then no point in contributing and we can think of doing a few retries at fixed intervals to account for this behaviour. But for small number of disks per node (say 10-15?), I would expect drain to happen normally during a machine rollout already today, since there is a check on node.status.volumesAttached already.

@hardikdr @prashanth26 Apart from this, are there any conflicts in timeout values for drain and machine rollout?

@vlerenc
Copy link
Member

vlerenc commented Jun 13, 2020

@amshuman-kr All good, that's what I meant. I wrote my question before the out-of-band sync, I believe. As for the reason for this erratic behaviour. I tend to believe its not an IaaS issue (would surprise me), but an update problem/race condition in K8s. Anyhow, during the out-of-band sync, we discussed whether we can consider it detached only if the status detached was observed for a given length of time (e.g. 5s or whatever you have observed as "stable").

Would it then make sense to include this check earlier? As for a low number of volumes, we have seen that Azure copes not well/is very slow even with a few volumes. We ran into this practically whenever we touch an Azure cluster. Then again, it depends how often we do that/when.

@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Sep 19, 2020
@hardikdr
Copy link
Member Author

/priority critical

@gardener-robot gardener-robot added the priority/critical Needs to be resolved soon, because it impacts users negatively label Nov 13, 2020
@vlerenc
Copy link
Member

vlerenc commented Nov 14, 2020

See also gardener/gardener#87 for a similar problem on Gardener level.

We should develop something like a scoring function that tells us whether we have destabilised a landscape or shoot with an update or not (beyond a certain threshold). Threshold, because we cannot expect all shoots/pods to come up again (what this means is not even clear), so there needs to be some fuzzyness. Like e.g. "the number of pods that are not running (pending, crash-looping, container-creating, etc.) hasn't gone up by more than 30% of the number of pods that were drained from the affected nodes". Only then we continue or else we brace ourselves (pause) and watch whether the situation improves. It's more complicated than that (pods are not guaranteed to come up in the same number or may fail for many other reasons), but I believe we need to develop this notion/formula. It's also what we as human operators do. We don't expect all shoots/pods to be running again after an update, but we watch the tendency and in case of issues we then intervene, pause, and analyse what's going on. Automation can't analyse, but can pause and alert.

@vlerenc vlerenc changed the title Controlled eviction of pods during cluster-rollout. Controlled Eviction of Pods during Cluster Rollout Nov 17, 2020
@amshuman-kr
Copy link

Just cross-linking gardener/gardener#87 (comment) here. It is just a limited point about how to propagate the information calculated by MCM because it might be relevant at higher levels (extensions, gardenlet).

@gardener-robot gardener-robot added priority/2 Priority (lower number equals higher priority) and removed priority/critical Needs to be resolved soon, because it impacts users negatively labels Mar 8, 2021
@prashanth26 prashanth26 added priority/4 Priority (lower number equals higher priority) area/high-availability High availability related and removed priority/2 Priority (lower number equals higher priority) labels Jul 21, 2021
@gardener-robot gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Jan 18, 2022
@himanshu-kun
Copy link
Contributor

Summary of grooming discussion

Portions of this have been addressed through #561 , but we still need to deal with scenarios where we end up with lot of unschedulable pods during rollout and would wish to dynamically slow down drain on noticing such case

@himanshu-kun himanshu-kun added needs/planning Needs (more) planning with other MCM maintainers and removed lifecycle/rotten Nobody worked on this for 12 months (final aging stage) labels Feb 21, 2023
@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Oct 31, 2023
@gardener-robot gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Jul 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/high-availability High availability related kind/enhancement Enhancement, improvement, extension lifecycle/rotten Nobody worked on this for 12 months (final aging stage) needs/planning Needs (more) planning with other MCM maintainers priority/4 Priority (lower number equals higher priority)
Projects
None yet
Development

No branches or pull requests

6 participants