deschedule pods that fail to start or restart too often #62

kabakaev · 2017-12-06T11:50:14Z

It is not uncommon that pods get scheduled on nodes that are not able to start it.
For example, a node may have network issues and unable to mount a networked persistent volume, or cannot pull a docker image, or has some docker configuration issue which is seen only on container startup.

Another common issue is when a container gets restarted by liveliness check because of some local node issue (e.g. wrong routing table, slow storage, network latency or packet-drop). In that case, a pod is unhealthy most of the time and hangs in a restart state forever without a chance of being migrated to another node.

As of now, there is no possibility to re-schedule pods with faulty containers. It may be helpful to introduce two new Strategies:

container-restart-rate: re-schedule a pod if it is unhealthy since $notReadyPeriod seconds and one of its containers was restarted $maxRestartCount times.
pod-startup-failure: a pod was scheduled on a node, but was unable to start all of its containers since $maxStartupTime seconds.

The similar issue is filed against kubernetes: kubernetes/kubernetes#13385

The text was updated successfully, but these errors were encountered:

ravisantoshgudimetla · 2018-01-03T07:42:47Z

Seems like a reasonable ask. @kabakaev I am planning to defer this to 0.6 release or later. Hope you are ok with that.

bgrant0607 · 2018-01-22T17:53:37Z

Ref also kubernetes/kubernetes#14796

fejta-bot · 2019-04-22T12:56:37Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2019-05-22T13:39:08Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2019-06-21T14:29:46Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2019-06-21T14:29:53Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mbdas · 2019-09-09T04:14:12Z

A whole bunch of issues were referred to this and then this gets auto closed. Should the users just write a controller and delete pod after too many restarts etc

mbdas · 2019-09-09T04:14:39Z

/reopen

k8s-ci-robot · 2019-09-09T04:14:46Z

@mbdas: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k82cn · 2019-09-10T07:51:08Z

/reopen

k8s-ci-robot · 2019-09-10T07:51:09Z

@k82cn: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

fejta-bot · 2019-10-10T07:58:51Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2019-10-10T07:58:59Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ravisantoshgudimetla · 2019-12-04T06:16:53Z

/reopen

k8s-ci-robot · 2019-12-04T06:16:55Z

@ravisantoshgudimetla: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ravisantoshgudimetla · 2019-12-04T06:19:01Z

#89 tried addressing this. Let's make sure that we're getting this in before the next release

fejta-bot · 2020-01-03T07:10:06Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2020-01-03T07:10:14Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

damemi · 2020-01-29T17:43:07Z

/reopen
/remove-lifecycle rotten

pohly · 2020-02-07T12:55:45Z

This looks like a reasonable proposal. Ephemeral inline volumes have the same problem: a pod gets scheduled onto a node, then the CSI driver's NodePublishVolume finds that it cannot create the volume, and the pod is stuck.

pohly · 2020-02-07T12:58:08Z

For my own understanding: is the proposal to delete such a failed pod and then let a higher-level controller (like a stateful set) create a new pod, or is the proposal to just move the pod off the node and schedule it again?

mbdas · 2020-02-07T18:51:08Z

If not mistaken a pod gets scheduled only once for its entire lifetime. So unless its deleted and replaced from a controller/operator, new scheduling will not happen. Now there is a chance the pod maybe scheduled back to the bad node (for that specific use case) but proper fleet management will essentially remove a node that has high rate of failure. But in most cases it will land in a good node. For use cases where a fresh pod launch required any node is fine.

damemi · 2020-02-07T18:59:55Z

It should also be noted that the descheduler only considers pods for eviction that have an ownerreference (unless this is explicitly overridden), so pods that aren't managed by a controller which would attempt to reschedule them would not be evicted by default.

pohly · 2020-02-10T07:43:30Z

If not mistaken a pod gets scheduled only once for its entire lifetime.

That's what I thought, thanks for the confirmation.

Now there is a chance the pod maybe scheduled back to the bad node (for that specific use case) but proper fleet management will essentially remove a node that has high rate of failure.

That may be true for "broken" nodes, but not for a node that simply doesn't have enough storage capacity left for a certain kind of inline ephemeral volume. I was proposing to add capacity tracking to ensure that Kubernetes will eventually pick a node that has enough capacity, but that KEP has been postponed.

seanmalloy · 2020-04-27T20:03:41Z

New strategy RemovePodsHavingTooManyRestarts was added in PR #254.

Looking at the original requirements provided in the issue description there is request to add a strategy that can ...

pod-startup-failure: a pod was scheduled on a node, but was unable to start all of its containers since $maxStartupTime seconds.

@kabakaev do you still have a need for this feature?

fejta-bot · 2020-07-26T20:20:44Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

kabakaev · 2020-08-21T15:26:07Z

@seanmalloy, i've tested the PodsHavingTooManyRestarts policy on v0.18.0.
It works well for the case of too many restarts triggered by a livenessProbe. Supposedly, it helps with CrashLoopBackoff too. Thus, first part container-restart-rate of the feature request is implemented.

Unfortunately, the second part pod-startup-failure is not addressed by PR #254.

I've tested the second case by breaking the CSI node plugin on one of k8s nodes. It led to a new pod hanging in ContainerCreating state forever. Descheduler did not evict the pod.
(Let me know if a "how to reproduce" guide is needed.)

It seems, all necessary info is already written in pod object:

metadata.creationTimestamp;
ownerReferences;
status.phase = Pending.

I'd imagine an extra descheduler policy, which evicts a pod in status.phase != Running for more than a configured period since metadata.creationTimestamp.

/remove-lifecycle stale

seanmalloy · 2020-08-25T03:55:16Z

I'd imagine an extra descheduler policy, which evicts a pod in status.phase != Running for more than a configured period since metadata.creationTimestamp

@kabakaev thanks for the info. How about using the PodLifeTime strategy? We would need to add an additional strategy parameter to handle status.phase != Running.

Maybe something like this ...

---
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "PodLifeTime":
     enabled: true
     params:
        maxPodLifeTimeSeconds: 300
        podStatusPhase:
        - pending

@damemi @ingvagabund @lixiang233 please add any additional ideas you have. Thanks!

lixiang233 · 2020-08-25T08:13:36Z

@seanmalloy I think PodLifeTime already has the ability to evict not running pods(except Succeeded and Failed pods), do you mean only evict certain StatusPhase pods?

ingvagabund · 2020-08-25T13:54:57Z

PodLifeTime checks meta.CreationTimestamp field, nothing else. Also, any pods that's not Failed and not Succeeded is processed. So PodLifeTime should work from scratch. No need to modify it.

seanmalloy · 2020-08-26T03:21:00Z

@seanmalloy I think PodLifeTime already has the ability to evict not running pods(except Succeeded and Failed pods), do you mean only evict certain StatusPhase pods?

@lixiang233 yes based on my understanding of the problem I think it might be reasonable to have an option to only consider pods with a certain StatusPhase for eviction(i.e. StatusePhase is pending).

PodLifeTime checks meta.CreationTimestamp field, nothing else. Also, any pods that's not Failed and not Succeeded is processed. So PodLifeTime should work from scratch. No need to modify it.

@ingvagabund I think the use case is to deschedule pods that are Pending for a short period of time. For example evict all pods that are in status pending and are more that 5 minutes old. Right now configuring PodLifeTime to evict pods that are more than 5 minutes old will probably cause too much disruption in a cluster.

I know we have a lot of recently added configuration options for including/excluding pods based on different criteria(i.e. namespace and priority). But what do you think of adding one more? We could try adding this to only the PodLifeTime strategy. What do you think?

@kabakaev do my above comments make sense to you? Would this newly proposed feature handle your use case?

ingvagabund · 2020-08-26T10:08:14Z

@ingvagabund I think the use case is to deschedule pods that are Pending for a short period of time.

Yeah, with such a short period of time, it makes sense to limit the phase. Though, maybe not to every phase. Pending is the first phase when a pod is accepted. I can't find any field in pod' status saying when a pod transitioned into a given phase. Also, other phases are completely ignored (Failed, Succeeded) which leaves only Running and Unknown. Running is the default one in most cases. podStatusPhase field is fine though I would just limit it to Pending and Running right now.

seanmalloy · 2020-08-26T15:52:32Z

I think this can be implemented now. The consensus is to add a new strategy parameter to the PodLifeTime strategy. The new strategy parameter will filter pod considered for eviction based on podStatusPhase This strategy parameter should only allow filtering the Pending and Running phases.

Maybe something like this ...

---
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "PodLifeTime":
     enabled: true
     params:
        maxPodLifeTimeSeconds: 300
        podStatusPhases:                       # <=== this is the default if not specified
        - pending
        - running

Another example ...

---
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "PodLifeTime":
     enabled: true
     params:
        maxPodLifeTimeSeconds: 300
        podStatusPhases:
        - pending                                     # <==== only evict pending pods

lixiang233 · 2020-09-02T01:54:29Z

@seanmalloy @ingvagabund @kabakaev Does anyone plan to work on this? If not, I'd love to help with the feature.

lixiang233 · 2020-09-02T01:59:37Z

/assign

seanmalloy · 2020-09-02T02:07:16Z

@seanmalloy @ingvagabund @kabakaev Does anyone plan to work on this? If not, I'd love to help with the feature.

@lixiang233 this feature enhancement is all yours. Thanks!

damemi · 2020-09-03T15:41:53Z

I am not sure if the implementation proposed here is addressing what was actually requested in #62 (comment). Correct me if I'm wrong, but it seems like we're talking about adding a parameter that will only evict Pending pods (and not Running pods) which was added in #393.

Which led to the request of

I'd imagine an extra descheduler policy, which evicts a pod in status.phase != Running

But my understanding of the problem above is more that a pod that was in Pending state didn't get evicted at all. I think that points to a bigger bug, because the current pod selector should only exclude Succeeded and Failed pods:

descheduler/pkg/descheduler/pod/pods.go

Line 75 in 03dbc93

    
           fieldSelectorString := "spec.nodeName=" + node.Name + ",status.phase!=" + string(v1.PodSucceeded) + ",status.phase!=" + string(v1.PodFailed)

Is there somewhere else in the code that we are only selecting Running pods for eviction?

Also, is there a use case for excluding all Running pods from eviction with this strategy?

lixiang233 · 2020-09-04T02:26:16Z

If CNI/CSI plugin failed to set up a pod or the pod's image is not available on a node, the pod will be in ContainerCreating or ImagePullBackOff status and its status.phase will be Pending, if we want to recreate these pods immediately, I think we can only use PodLifeTime strategy and set maxPodLifeTimeSeconds to a short period of time, and we should limit the phase to protect running pods, so here we add a new parameter to include only Pending pods.

@damemi Do you mean we should let every strategy to custom its exclude phases?

kabakaev · 2020-09-04T11:55:15Z

...a pod that was in Pending state didn't get evicted at all. I think that points to a bigger bug...

@damemi, first statement is true, but it is because I didn't enable PodLifeTime strategy while testing this. With PodLifeTime enabled, the pending pod would probably have been deleted as well, after some ridiculous delay configured in maxPodLifeTimeSeconds. So there should be no bug in descheduler, if your mean my test outcome.

we're talking about adding a parameter that will only evict Pending pods (and not Running pods)

Yes, my understanding is that PodLifeTime with podStatusPhases=Pending parameter should evict only pending pods.

damemi · 2020-09-04T12:37:13Z

Ah I see, you want to set a short LifeTime for these pending pods and not be evicting many running pods because of that. Sounds good, I understand now. Thanks for clarifying!

…ream bug 1950026: Sync with upstream

ravisantoshgudimetla added this to the Future-release milestone Jan 3, 2018

This was referenced Jan 22, 2018

Pods that fail health checks always restarting on the same minion instead of others? kubernetes/kubernetes#13385

Closed

Pod rescheduling when the pods' container can not be started on the node kubernetes/kubernetes#14796

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 22, 2019

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 22, 2019

k8s-ci-robot closed this as completed Jun 21, 2019

k8s-ci-robot reopened this Sep 10, 2019

k8s-ci-robot closed this as completed Oct 10, 2019

k8s-ci-robot reopened this Dec 4, 2019

k8s-ci-robot closed this as completed Jan 3, 2020

morimoto-cybozu mentioned this issue Mar 11, 2020

Pods get scheduled to nodes with insufficient space when using Inline Ephemeral Volumes topolvm/topolvm#117

Closed

damemi mentioned this issue Mar 30, 2020

Add RemoveTooManyRestarts policy #254

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 26, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 21, 2020

k8s-ci-robot assigned lixiang233 Sep 2, 2020

lixiang233 mentioned this issue Sep 3, 2020

PodLifeTime: allow custom podStatusPhases #393

Merged

k8s-ci-robot closed this as completed in #393 Sep 11, 2020

lovejoy mentioned this issue May 28, 2021

podStatusPhases support containercreating status #580

Closed

damemi pushed a commit to damemi/descheduler that referenced this issue Sep 7, 2021

Merge pull request kubernetes-sigs#62 from ingvagabund/sync-with-upst…

37691a4

…ream bug 1950026: Sync with upstream

deschedule pods that fail to start or restart too often #62

deschedule pods that fail to start or restart too often #62

Comments

kabakaev commented Dec 6, 2017

ravisantoshgudimetla commented Jan 3, 2018

bgrant0607 commented Jan 22, 2018

fejta-bot commented Apr 22, 2019

fejta-bot commented May 22, 2019

fejta-bot commented Jun 21, 2019

k8s-ci-robot commented Jun 21, 2019

mbdas commented Sep 9, 2019

mbdas commented Sep 9, 2019

k8s-ci-robot commented Sep 9, 2019

k82cn commented Sep 10, 2019

k8s-ci-robot commented Sep 10, 2019

fejta-bot commented Oct 10, 2019

k8s-ci-robot commented Oct 10, 2019

ravisantoshgudimetla commented Dec 4, 2019

k8s-ci-robot commented Dec 4, 2019

ravisantoshgudimetla commented Dec 4, 2019 • edited Loading

fejta-bot commented Jan 3, 2020

k8s-ci-robot commented Jan 3, 2020

damemi commented Jan 29, 2020

pohly commented Feb 7, 2020

pohly commented Feb 7, 2020

mbdas commented Feb 7, 2020

damemi commented Feb 7, 2020

pohly commented Feb 10, 2020 • edited Loading

seanmalloy commented Apr 27, 2020

fejta-bot commented Jul 26, 2020

kabakaev commented Aug 21, 2020

seanmalloy commented Aug 25, 2020

lixiang233 commented Aug 25, 2020

ingvagabund commented Aug 25, 2020

seanmalloy commented Aug 26, 2020

ingvagabund commented Aug 26, 2020 • edited Loading

seanmalloy commented Aug 26, 2020

lixiang233 commented Sep 2, 2020

lixiang233 commented Sep 2, 2020

seanmalloy commented Sep 2, 2020

damemi commented Sep 3, 2020

lixiang233 commented Sep 4, 2020

kabakaev commented Sep 4, 2020

damemi commented Sep 4, 2020

ravisantoshgudimetla commented Dec 4, 2019 •

edited

Loading

pohly commented Feb 10, 2020 •

edited

Loading

ingvagabund commented Aug 26, 2020 •

edited

Loading