add remove too many restarts policy #89

liubin · 2018-04-25T02:46:22Z

This PR will eviction pods having too many restarts, due to the host-originated problem ( network/storage/firewall ). Usually, these pods should be re-scheduled to other pods.

aveshagarwal · 2018-04-25T02:46:23Z

Can one of the admins verify this patch?

aveshagarwal · 2018-04-25T12:44:56Z

@liubin thanks for this PR. I will review it soon.

aveshagarwal · 2018-04-25T12:45:02Z

/ok-to-test

ingvagabund · 2018-04-26T08:44:18Z

$ hack/verify-gofmt.sh
!!! 'gofmt -s' needs to be run on the following files: 
./pkg/descheduler/strategies/toomanyrestarts_test.go

ingvagabund · 2018-04-26T09:04:16Z

pkg/descheduler/strategies/toomanyrestarts.go

+				}
+			} else if restarts <= strategy.Params.PodsHavingTooManyRestarts.PodeRestartThresholds {
+				continue
+			}


Init container is a special case of a container, so the calcContainerRestarts could be defined as:

func calcContainerRestarts(pod *v1.Pod, countInitContainers bool) int32 { var restarts int32 = 0 for _, cs := range pod.Status.ContainerStatuses { restarts += cs.RestartCount } if countInitContainers { for _, cs := range pod.Status.InitContainerStatuses { restarts += cs.RestartCount } } return restarts }

which simplifies the code to:

restarts := calcContainerRestarts(pod, params.IncludingInitContainers) params := strategy.Params.PodsHavingTooManyRestarts if restarts <= params.PodeRestartThresholds { continue }

ingvagabund · 2018-04-26T09:08:31Z

pkg/descheduler/strategies/toomanyrestarts.go

+				continue
+			}
+
+			glog.V(1).Infof("RemovePodsHavingTooManyRestarts will evicted pod: %#v, container restarts: %d, initContainer restarts: %d", pod.Name, restarts, initRestarts)


@aveshagarwal do we need to distinguish between ordinary and init containers? We could just report the total number of containers that got restarted. WDYT?

Yeah thats fine, it's helpful to distinguish in the log message.

ingvagabund · 2018-04-26T09:09:14Z

pkg/descheduler/strategies/toomanyrestarts_test.go

+		pod := test.BuildTestPod(fmt.Sprintf("pod-%d", i), 100, 0, node.Name)
+		pod.ObjectMeta.OwnerReferences = test.GetNormalPodOwnerRefList()
+
+		// pod i will has 25 * i restarts.


aveshagarwal · 2018-04-26T13:25:58Z

@liubin In general, it looks good to me but still need to review the code more thoroughly.

One question I have that what is this strategy relying on that makes sure that evicted pods would get scheduled to some other nodes by the default scheduler but not on the same node? I am NOT looking for a guarantee but some higher probability, as there is no predicate/priority function that takes restart into account. So for the same, I wonder, if you really need to look into the cause of restart and then decide which pod to evict than evicting any pod with several restarts?

liubin · 2018-04-27T03:16:17Z

@aveshagarwal Good question. I have the sam question, and some others: restarts is a cumulative value, 1000 restarts in one year and 1000 restarts in 1 hour have different meanings.

I will do more research and try to find the most practical answer.

liubin · 2018-04-27T03:43:39Z

@ingvagabund I tried:

# ./hack/verify-gofmt.sh 
# echo $?
0
# go version
go version go1.8.7 linux/amd64

# ./hack/verify-gofmt.sh 
# echo $?
0
# go version
go version go1.8.3 linux/amd64

$ hack/verify-gofmt.sh 
$ go version
go version go1.10.1 darwin/amd64

Is the CI has some different settings, I can find the failed reason.

aveshagarwal · 2018-04-27T03:47:20Z

@liubin meanwhile could you split it in 2 commits during your next update: 1 commit for auto-generated files and 1 commit for main code changes for easier review.

ingvagabund · 2018-05-22T10:57:57Z

@aveshagarwal @ravisantoshgudimetla we should bump to go version to 1.10.*.

ingvagabund · 2018-05-22T11:34:35Z

#94

ingvagabund · 2018-05-22T11:35:36Z

@liubin

$ go version
go version go1.10.2 linux/amd64
$ ./hack/verify-gofmt.sh
$ echo $?
0

openshift-merge-robot · 2019-02-06T06:29:25Z

/retest

openshift-merge-robot · 2019-02-06T14:44:33Z

/retest

fejta-bot · 2019-05-07T14:50:40Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2019-06-06T15:33:38Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2019-07-06T16:21:37Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2019-07-06T16:21:44Z

@fejta-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ravisantoshgudimetla · 2019-12-04T06:18:33Z

@liubin would you interested in picking this up again as I see there are some use-cases around it?

ravisantoshgudimetla · 2019-12-04T06:18:40Z

/reopen

k8s-ci-robot · 2019-12-04T06:18:41Z

@ravisantoshgudimetla: Reopened this PR.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2019-12-04T06:18:47Z

@liubin: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2019-12-04T06:19:03Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: liubin
To complete the pull request process, please assign ravisantoshgudimetla
You can assign the PR to them by writing /assign @ravisantoshgudimetla in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

fejta-bot · 2020-01-03T07:10:04Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2020-01-03T07:10:12Z

@fejta-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

damemi · 2020-03-18T18:52:11Z

/reopen
/remove-lifecycle-rotten

k8s-ci-robot · 2020-03-18T18:52:13Z

@damemi: Reopened this PR.

In response to this:

/reopen
/remove-lifecycle-rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

damemi · 2020-03-18T19:03:47Z

@liubin sorry to resurrect an old thread, but are you still interested in rebasing this PR? If not, we can still pick your commits into a new PR

seanmalloy · 2020-03-26T02:44:48Z

/kind feature

damemi · 2020-03-30T19:05:16Z

Since there's still interest in this, I've picked @liubin's commits into a new, rebased PR here: #254

Thank you for this work @liubin! If it's okay, I'm going to close this PR now and we can move discussion to the new one.

/close

k8s-ci-robot · 2020-03-30T19:05:23Z

@damemi: Closed this PR.

In response to this:

Since there's still interest in this, I've picked @liubin's commits into a new, rebased PR here: #254

Thank you for this work @liubin! If it's okay, I'm going to close this PR now and we can move discussion to the new one.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

…ncy-openshift-4.14-atomic-openshift-descheduler Updating atomic-openshift-descheduler images to be consistent with ART

openshift-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 25, 2018

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 25, 2018

add remove too many restarts policy

d8ccf67

liubin force-pushed the feature/add-toomanyrestarts-strategy branch from 4d08495 to d8ccf67 Compare April 25, 2018 03:08

openshift-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 25, 2018

ingvagabund reviewed Apr 26, 2018

View reviewed changes

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 7, 2019

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 6, 2019

k8s-ci-robot closed this Jul 6, 2019

k8s-ci-robot reopened this Dec 4, 2019

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 4, 2019

ravisantoshgudimetla mentioned this pull request Dec 4, 2019

deschedule pods that fail to start or restart too often #62

Closed

k8s-ci-robot closed this Jan 3, 2020

k8s-ci-robot reopened this Mar 18, 2020

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Mar 26, 2020

damemi mentioned this pull request Mar 30, 2020

Add RemoveTooManyRestarts policy #254

Merged

k8s-ci-robot closed this Mar 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add remove too many restarts policy #89

add remove too many restarts policy #89

liubin commented Apr 25, 2018

aveshagarwal commented Apr 25, 2018

aveshagarwal commented Apr 25, 2018

aveshagarwal commented Apr 25, 2018

ingvagabund commented Apr 26, 2018

ingvagabund Apr 26, 2018

ingvagabund Apr 26, 2018

aveshagarwal Apr 26, 2018

ingvagabund Apr 26, 2018

aveshagarwal commented Apr 26, 2018

liubin commented Apr 27, 2018

liubin commented Apr 27, 2018

aveshagarwal commented Apr 27, 2018

ingvagabund commented May 22, 2018 •

edited

Loading

ingvagabund commented May 22, 2018

ingvagabund commented May 22, 2018

openshift-merge-robot commented Feb 6, 2019

openshift-merge-robot commented Feb 6, 2019

fejta-bot commented May 7, 2019

fejta-bot commented Jun 6, 2019

fejta-bot commented Jul 6, 2019

k8s-ci-robot commented Jul 6, 2019

ravisantoshgudimetla commented Dec 4, 2019

ravisantoshgudimetla commented Dec 4, 2019

k8s-ci-robot commented Dec 4, 2019

k8s-ci-robot commented Dec 4, 2019

k8s-ci-robot commented Dec 4, 2019

fejta-bot commented Jan 3, 2020

k8s-ci-robot commented Jan 3, 2020

damemi commented Mar 18, 2020

k8s-ci-robot commented Mar 18, 2020

damemi commented Mar 18, 2020

seanmalloy commented Mar 26, 2020

damemi commented Mar 30, 2020

k8s-ci-robot commented Mar 30, 2020

add remove too many restarts policy #89

add remove too many restarts policy #89

Conversation

liubin commented Apr 25, 2018

aveshagarwal commented Apr 25, 2018

aveshagarwal commented Apr 25, 2018

aveshagarwal commented Apr 25, 2018

ingvagabund commented Apr 26, 2018

ingvagabund Apr 26, 2018

Choose a reason for hiding this comment

ingvagabund Apr 26, 2018

Choose a reason for hiding this comment

aveshagarwal Apr 26, 2018

Choose a reason for hiding this comment

ingvagabund Apr 26, 2018

Choose a reason for hiding this comment

aveshagarwal commented Apr 26, 2018

liubin commented Apr 27, 2018

liubin commented Apr 27, 2018

aveshagarwal commented Apr 27, 2018

ingvagabund commented May 22, 2018 • edited Loading

ingvagabund commented May 22, 2018

ingvagabund commented May 22, 2018

openshift-merge-robot commented Feb 6, 2019

openshift-merge-robot commented Feb 6, 2019

fejta-bot commented May 7, 2019

fejta-bot commented Jun 6, 2019

fejta-bot commented Jul 6, 2019

k8s-ci-robot commented Jul 6, 2019

ravisantoshgudimetla commented Dec 4, 2019

ravisantoshgudimetla commented Dec 4, 2019

k8s-ci-robot commented Dec 4, 2019

k8s-ci-robot commented Dec 4, 2019

k8s-ci-robot commented Dec 4, 2019

fejta-bot commented Jan 3, 2020

k8s-ci-robot commented Jan 3, 2020

damemi commented Mar 18, 2020

k8s-ci-robot commented Mar 18, 2020

damemi commented Mar 18, 2020

seanmalloy commented Mar 26, 2020

damemi commented Mar 30, 2020

k8s-ci-robot commented Mar 30, 2020

ingvagabund commented May 22, 2018 •

edited

Loading