Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add remove too many restarts policy #89

Conversation

liubin
Copy link

@liubin liubin commented Apr 25, 2018

This PR will eviction pods having too many restarts, due to the host-originated problem ( network/storage/firewall ). Usually, these pods should be re-scheduled to other pods.

@aveshagarwal
Copy link
Contributor

Can one of the admins verify this patch?

@openshift-ci-robot openshift-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 25, 2018
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 25, 2018
@liubin liubin force-pushed the feature/add-toomanyrestarts-strategy branch from 4d08495 to d8ccf67 Compare April 25, 2018 03:08
@aveshagarwal
Copy link
Contributor

@liubin thanks for this PR. I will review it soon.

@aveshagarwal
Copy link
Contributor

/ok-to-test

@openshift-ci-robot openshift-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 25, 2018
@ingvagabund
Copy link
Contributor

$ hack/verify-gofmt.sh
!!! 'gofmt -s' needs to be run on the following files: 
./pkg/descheduler/strategies/toomanyrestarts_test.go

}
} else if restarts <= strategy.Params.PodsHavingTooManyRestarts.PodeRestartThresholds {
continue
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Init container is a special case of a container, so the calcContainerRestarts could be defined as:

func calcContainerRestarts(pod *v1.Pod, countInitContainers bool) int32 {
	var restarts int32 = 0

	for _, cs := range pod.Status.ContainerStatuses {
		restarts += cs.RestartCount
	}

	if countInitContainers {
		for _, cs := range pod.Status.InitContainerStatuses {
			restarts += cs.RestartCount
		}
	}

	return restarts
}

which simplifies the code to:

			restarts := calcContainerRestarts(pod, params.IncludingInitContainers)
			params := strategy.Params.PodsHavingTooManyRestarts
			if restarts <= params.PodeRestartThresholds {
				continue
			}

continue
}

glog.V(1).Infof("RemovePodsHavingTooManyRestarts will evicted pod: %#v, container restarts: %d, initContainer restarts: %d", pod.Name, restarts, initRestarts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aveshagarwal do we need to distinguish between ordinary and init containers? We could just report the total number of containers that got restarted. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah thats fine, it's helpful to distinguish in the log message.

pod := test.BuildTestPod(fmt.Sprintf("pod-%d", i), 100, 0, node.Name)
pod.ObjectMeta.OwnerReferences = test.GetNormalPodOwnerRefList()

// pod i will has 25 * i restarts.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/has/have

@aveshagarwal
Copy link
Contributor

@liubin In general, it looks good to me but still need to review the code more thoroughly.

One question I have that what is this strategy relying on that makes sure that evicted pods would get scheduled to some other nodes by the default scheduler but not on the same node? I am NOT looking for a guarantee but some higher probability, as there is no predicate/priority function that takes restart into account. So for the same, I wonder, if you really need to look into the cause of restart and then decide which pod to evict than evicting any pod with several restarts?

@liubin
Copy link
Author

liubin commented Apr 27, 2018

@aveshagarwal Good question. I have the sam question, and some others: restarts is a cumulative value, 1000 restarts in one year and 1000 restarts in 1 hour have different meanings.

I will do more research and try to find the most practical answer.

@liubin
Copy link
Author

liubin commented Apr 27, 2018

@ingvagabund I tried:

# ./hack/verify-gofmt.sh 
# echo $?
0
# go version
go version go1.8.7 linux/amd64

# ./hack/verify-gofmt.sh 
# echo $?
0
# go version
go version go1.8.3 linux/amd64

$ hack/verify-gofmt.sh 
$ go version
go version go1.10.1 darwin/amd64

Is the CI has some different settings, I can find the failed reason.

@aveshagarwal
Copy link
Contributor

@liubin meanwhile could you split it in 2 commits during your next update: 1 commit for auto-generated files and 1 commit for main code changes for easier review.

@ingvagabund
Copy link
Contributor

ingvagabund commented May 22, 2018

@aveshagarwal @ravisantoshgudimetla we should bump to go version to 1.10.*.

@ingvagabund
Copy link
Contributor

#94

@ingvagabund
Copy link
Contributor

@liubin

$ go version
go version go1.10.2 linux/amd64
$ ./hack/verify-gofmt.sh
$ echo $?
0

@openshift-merge-robot
Copy link

/retest

1 similar comment
@openshift-merge-robot
Copy link

/retest

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 7, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 6, 2019
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ravisantoshgudimetla
Copy link
Contributor

@liubin would you interested in picking this up again as I see there are some use-cases around it?

@ravisantoshgudimetla
Copy link
Contributor

/reopen

@k8s-ci-robot k8s-ci-robot reopened this Dec 4, 2019
@k8s-ci-robot
Copy link
Contributor

@ravisantoshgudimetla: Reopened this PR.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot
Copy link
Contributor

@liubin: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 4, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: liubin
To complete the pull request process, please assign ravisantoshgudimetla
You can assign the PR to them by writing /assign @ravisantoshgudimetla in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@damemi
Copy link
Contributor

damemi commented Mar 18, 2020

/reopen
/remove-lifecycle-rotten

@k8s-ci-robot k8s-ci-robot reopened this Mar 18, 2020
@k8s-ci-robot
Copy link
Contributor

@damemi: Reopened this PR.

In response to this:

/reopen
/remove-lifecycle-rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@damemi
Copy link
Contributor

damemi commented Mar 18, 2020

@liubin sorry to resurrect an old thread, but are you still interested in rebasing this PR? If not, we can still pick your commits into a new PR

@seanmalloy
Copy link
Member

/kind feature

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Mar 26, 2020
@damemi
Copy link
Contributor

damemi commented Mar 30, 2020

Since there's still interest in this, I've picked @liubin's commits into a new, rebased PR here: #254

Thank you for this work @liubin! If it's okay, I'm going to close this PR now and we can move discussion to the new one.

/close

@k8s-ci-robot
Copy link
Contributor

@damemi: Closed this PR.

In response to this:

Since there's still interest in this, I've picked @liubin's commits into a new, rebased PR here: #254

Thank you for this work @liubin! If it's okay, I'm going to close this PR now and we can move discussion to the new one.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

knelasevero pushed a commit to knelasevero/descheduler that referenced this pull request May 12, 2023
…ncy-openshift-4.14-atomic-openshift-descheduler

Updating atomic-openshift-descheduler images to be consistent with ART
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants