Improve podEvictor statistics #503

damemi · 2021-02-19T21:51:10Z

As suggested in #501 (comment), it would be nice to improve the pod evictor type to report eviction statistics for individual strategies. Some suggestions were:

Number of evicted pods (in this strategy): XX
Number of evicted pods in this run: XX
Total number of evicted pods in all strategies: XX

x-ref this could also be reported as Prometheus metrics (#348)

The text was updated successfully, but these errors were encountered:

ingvagabund · 2021-02-22T11:07:43Z

I prefer to report the statistics in metrics. So we don't have to cumulative much in the pod evictor itself.

damemi · 2021-02-22T13:43:35Z

I am just suggesting that, since those metrics will have to be calculated somewhere, doing it in podEvictor makes sense because it already has access to the information. Metrics can then use the podEvictor instance to report them when requested.

fejta-bot · 2021-05-23T13:44:21Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

a7i · 2021-06-15T20:14:03Z

@damemi I would be happy to contribute to this. Any docs highlighting the decision made to-date?

damemi · 2021-06-15T20:35:28Z

@a7i nothing concrete, though if you would like to put some ideas together and share a doc that would be a great place to start the discussion. Right now we have 1 metric pods_evicted that's reported by the PodEvictor after a run.

As suggested above, it would be good to have some similar reports on a per-strategy basis. From there we could probably even come up with some additional meta metrics that are specific to the different strategies themselves.

fejta-bot · 2021-07-17T02:05:37Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

damemi · 2021-07-30T18:10:17Z

/remove-lifecycle rotten

k8s-triage-robot · 2021-08-29T18:52:35Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2021-08-29T18:52:42Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ingvagabund · 2021-08-30T09:54:41Z

/reopen

k8s-ci-robot · 2021-08-30T09:54:49Z

@ingvagabund: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ingvagabund · 2021-08-30T09:54:54Z

/remove-lifecycle rotten

pravarag · 2021-09-14T08:44:24Z

I'd like to work on this issue if someone is not working on it already 🙂 @damemi @ingvagabund

ingvagabund · 2021-09-14T09:40:58Z

Not aware of anyone working on this atm. Although, this requires some design and probably starting some discussion (e.g. in a google doc). @damemi wdyt?

damemi · 2021-09-14T18:48:06Z

Yeah I think we have some good patterns started in the code for metrics reporting already that could be fleshed out more. @pravarag feel free to take this on if you'd like

a7i · 2021-09-15T02:03:26Z

It would be great to have the following:

pods evicted success
pods evicted failed
pods skipped
total pods under consideration

Overall and per strategy.

pravarag · 2021-09-15T05:45:20Z

/assign

pravarag · 2021-09-20T14:39:56Z

@damemi @ingvagabund I'm trying to replicate the eviction of pods in a local cluster for better understanding of the way statistics are currently being represented. I have a 3 node cluster with the resources not that much heavily utilized for the three nodes cluster, the stats stand here:

NAME            CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
10.177.140.38   161m         4%     3460Mi          26%
10.208.40.245   182m         4%     3849Mi          28%
10.74.193.204   149m         3%     4002Mi          30%

And here are the logs from descheduler pod,

->  k logs descheduler-7bdbc8f9b7-d9r46 -nkube-system
I0920 14:27:38.995798       1 named_certificates.go:53] "Loaded SNI cert" index=0 certName="self-signed loopback" certDetail="\"apiserver-loopback-client@1632148058\" [serving] validServingFor=[apiserver-loopback-client] issuer=\"apiserver-loopback-client-ca@1632148058\" (2021-09-20 13:27:38 +0000 UTC to 2022-09-20 13:27:38 +0000 UTC (now=2021-09-20 14:27:38.995739889 +0000 UTC))"
I0920 14:27:38.995912       1 secure_serving.go:195] Serving securely on [::]:10258
I0920 14:27:38.996045       1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
I0920 14:27:40.554774       1 node.go:46] "Node lister returned empty list, now fetch directly"
I0920 14:27:40.973812       1 duplicates.go:99] "Processing node" node="10.177.140.38"
I0920 14:27:41.225473       1 duplicates.go:99] "Processing node" node="10.208.40.245"
I0920 14:27:41.500405       1 duplicates.go:99] "Processing node" node="10.74.193.204"
I0920 14:27:41.717340       1 pod_antiaffinity.go:81] "Processing node" node="10.177.140.38"
I0920 14:27:41.823705       1 pod_antiaffinity.go:81] "Processing node" node="10.208.40.245"
I0920 14:27:41.879063       1 pod_antiaffinity.go:81] "Processing node" node="10.74.193.204"
I0920 14:27:42.198284       1 nodeutilization.go:170] "Node is appropriately utilized" node="10.177.140.38" usage=map[cpu:1172m memory:1327634Ki pods:20] usagePercentage=map[cpu:29.974424552429667 memory:9.74255252448638 pods:18.181818181818183]
I0920 14:27:42.198333       1 nodeutilization.go:170] "Node is appropriately utilized" node="10.208.40.245" usage=map[cpu:1044m memory:1137170Ki pods:12] usagePercentage=map[cpu:26.70076726342711 memory:8.344874004635447 pods:10.909090909090908]
I0920 14:27:42.198354       1 nodeutilization.go:170] "Node is appropriately utilized" node="10.74.193.204" usage=map[cpu:1355m memory:1552914Ki pods:15] usagePercentage=map[cpu:34.65473145780051 memory:11.395720666245547 pods:13.636363636363637]
I0920 14:27:42.198369       1 lownodeutilization.go:100] "Criteria for a node under utilization" CPU=20 Mem=20 Pods=20
I0920 14:27:42.198380       1 lownodeutilization.go:101] "Number of underutilized nodes" totalNumber=0
I0920 14:27:42.198392       1 lownodeutilization.go:114] "Criteria for a node above target utilization" CPU=50 Mem=50 Pods=50
I0920 14:27:42.198403       1 lownodeutilization.go:115] "Number of overutilized nodes" totalNumber=0
I0920 14:27:42.198415       1 lownodeutilization.go:118] "No node is underutilized, nothing to do here, you might tune your thresholds further"
I0920 14:27:42.198439       1 descheduler.go:152] "Number of evicted pods" totalEvicted=0
I0920 14:32:42.198973       1 node.go:46] "Node lister returned empty list, now fetch directly"
I0920 14:32:42.261831       1 pod_antiaffinity.go:81] "Processing node" node="10.177.140.38"
I0920 14:32:42.295166       1 pod_antiaffinity.go:81] "Processing node" node="10.208.40.245"
I0920 14:32:42.336749       1 pod_antiaffinity.go:81] "Processing node" node="10.74.193.204"
I0920 14:32:42.479844       1 nodeutilization.go:170] "Node is appropriately utilized" node="10.177.140.38" usage=map[cpu:1172m memory:1327634Ki pods:20] usagePercentage=map[cpu:29.974424552429667 memory:9.74255252448638 pods:18.181818181818183]
I0920 14:32:42.479892       1 nodeutilization.go:170] "Node is appropriately utilized" node="10.208.40.245" usage=map[cpu:1044m memory:1137170Ki pods:12] usagePercentage=map[cpu:26.70076726342711 memory:8.344874004635447 pods:10.909090909090908]
I0920 14:32:42.479914       1 nodeutilization.go:170] "Node is appropriately utilized" node="10.74.193.204" usage=map[cpu:1355m memory:1552914Ki pods:15] usagePercentage=map[cpu:34.65473145780051 memory:11.395720666245547 pods:13.636363636363637]
I0920 14:32:42.479930       1 lownodeutilization.go:100] "Criteria for a node under utilization" CPU=20 Mem=20 Pods=20
I0920 14:32:42.479941       1 lownodeutilization.go:101] "Number of underutilized nodes" totalNumber=0
I0920 14:32:42.479953       1 lownodeutilization.go:114] "Criteria for a node above target utilization" CPU=50 Mem=50 Pods=50
I0920 14:32:42.479963       1 lownodeutilization.go:115] "Number of overutilized nodes" totalNumber=0
I0920 14:32:42.479982       1 lownodeutilization.go:118] "No node is underutilized, nothing to do here, you might tune your thresholds further"
I0920 14:32:42.480009       1 duplicates.go:99] "Processing node" node="10.177.140.38"
I0920 14:32:42.516420       1 duplicates.go:99] "Processing node" node="10.208.40.245"
I0920 14:32:42.549396       1 duplicates.go:99] "Processing node" node="10.74.193.204"
I0920 14:32:42.595868       1 descheduler.go:152] "Number of evicted pods" totalEvicted=0

I wanted to check if I decrease the threshold values to 10 here, will that be a good way to replicate the pod evictions so that I can look at the current statistics log?

damemi · 2021-09-27T13:26:50Z

@pravarag in your logs you don't have any underutilized nodes, so lowering threshold won't help (there is already no nodes with all 3 below the set values). Instead, you want to raise the threshold values, so anything with usage under those values will be underutilized.

You also don't have any overutilized nodes, so you should lower the targetThresholds as well. For replicating evictions, cordoning certain nodes while you create test pods will help create the uneven distribution you want.

pravarag · 2021-09-30T11:28:07Z

Thanks @damemi for the above suggestions. Also had few doubts about adding newer metrics. I've identified the changes will mainly take place in these files:

metrics.go - which will mainly include newer metrics that we want to put.
evictions.go - where the calculation of newer metrics will happen just like for pods_evicted

Now, do we also want to modify the logging w.r.t the new metrics that are to be added? Something to include in every strategy like this log?

And one more question: I could see that for the metric pods_evicted, the help says that we can calculate number of pods evicted per strategy and namespace as well. And I'm guessing the code for calculation needs to be added so, do we need an extra metric per strategy like pods_evicted_per_strategy ?

So far, I'm working on adding few new metrics like, pods_evicted_success, pods_evicted_failed, pods_skipped.

Dentrax · 2022-12-21T14:29:44Z

/remove-lifecycle stale

k8s-triage-robot · 2023-03-21T15:24:05Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2023-06-19T18:45:57Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

pravarag · 2023-07-12T03:18:06Z

/remove-lifecycle stale

k8s-triage-robot · 2024-01-24T07:57:02Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

pravarag · 2024-01-25T03:52:21Z

/remove-lifecycle stale

k8s-triage-robot · 2024-04-24T04:50:37Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

seanmalloy · 2024-04-24T14:58:11Z

/remove-lifecycle stale

k8s-triage-robot · 2024-07-23T15:13:53Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-08-22T15:55:31Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-09-21T16:56:21Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2024-09-21T16:56:26Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Dentrax · 2024-10-02T10:36:37Z

/reopen

k8s-ci-robot · 2024-10-02T10:36:41Z

@Dentrax: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

damemi added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 19, 2021

This was referenced Feb 19, 2021

Descheduler | Log Data Evicted Pod Totals - Specific Strategy & Total of all Strategies #501

Closed

Add Support For Prometheus Metrics #348

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 23, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 17, 2021

k8s-ci-robot closed this as completed Aug 29, 2021

k8s-ci-robot reopened this Aug 30, 2021

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 30, 2021

k8s-ci-robot assigned pravarag Sep 15, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 16, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 21, 2022

Dentrax mentioned this issue Jan 20, 2023

[Tracking Issue] Enrich the Descheduler Metrics #1047

Closed

6 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 21, 2023

ingvagabund removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 21, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 19, 2023

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 12, 2023

a7i mentioned this issue Oct 15, 2023

Add "Pod" or "Controller" label to Prometheus metric descheduler_pods_evicted #1262

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 24, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 25, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 24, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 24, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 23, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 22, 2024

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve podEvictor statistics #503

Improve podEvictor statistics #503

damemi commented Feb 19, 2021

ingvagabund commented Feb 22, 2021

damemi commented Feb 22, 2021

fejta-bot commented May 23, 2021

a7i commented Jun 15, 2021

damemi commented Jun 15, 2021

fejta-bot commented Jul 17, 2021

damemi commented Jul 30, 2021

k8s-triage-robot commented Aug 29, 2021

k8s-ci-robot commented Aug 29, 2021

ingvagabund commented Aug 30, 2021

k8s-ci-robot commented Aug 30, 2021

ingvagabund commented Aug 30, 2021

pravarag commented Sep 14, 2021 •

edited

Loading

ingvagabund commented Sep 14, 2021

damemi commented Sep 14, 2021

a7i commented Sep 15, 2021

pravarag commented Sep 15, 2021

pravarag commented Sep 20, 2021

damemi commented Sep 27, 2021

pravarag commented Sep 30, 2021

Dentrax commented Dec 21, 2022

k8s-triage-robot commented Mar 21, 2023

k8s-triage-robot commented Jun 19, 2023

pravarag commented Jul 12, 2023

k8s-triage-robot commented Jan 24, 2024

pravarag commented Jan 25, 2024

k8s-triage-robot commented Apr 24, 2024

seanmalloy commented Apr 24, 2024

k8s-triage-robot commented Jul 23, 2024

k8s-triage-robot commented Aug 22, 2024

k8s-triage-robot commented Sep 21, 2024

k8s-ci-robot commented Sep 21, 2024

Dentrax commented Oct 2, 2024

k8s-ci-robot commented Oct 2, 2024

Improve podEvictor statistics #503

Improve podEvictor statistics #503

Comments

damemi commented Feb 19, 2021

ingvagabund commented Feb 22, 2021

damemi commented Feb 22, 2021

fejta-bot commented May 23, 2021

a7i commented Jun 15, 2021

damemi commented Jun 15, 2021

fejta-bot commented Jul 17, 2021

damemi commented Jul 30, 2021

k8s-triage-robot commented Aug 29, 2021

k8s-ci-robot commented Aug 29, 2021

ingvagabund commented Aug 30, 2021

k8s-ci-robot commented Aug 30, 2021

ingvagabund commented Aug 30, 2021

pravarag commented Sep 14, 2021 • edited Loading

ingvagabund commented Sep 14, 2021

damemi commented Sep 14, 2021

a7i commented Sep 15, 2021

pravarag commented Sep 15, 2021

pravarag commented Sep 20, 2021

damemi commented Sep 27, 2021

pravarag commented Sep 30, 2021

Dentrax commented Dec 21, 2022

k8s-triage-robot commented Mar 21, 2023

k8s-triage-robot commented Jun 19, 2023

pravarag commented Jul 12, 2023

k8s-triage-robot commented Jan 24, 2024

pravarag commented Jan 25, 2024

k8s-triage-robot commented Apr 24, 2024

seanmalloy commented Apr 24, 2024

k8s-triage-robot commented Jul 23, 2024

k8s-triage-robot commented Aug 22, 2024

k8s-triage-robot commented Sep 21, 2024

k8s-ci-robot commented Sep 21, 2024

Dentrax commented Oct 2, 2024

k8s-ci-robot commented Oct 2, 2024

pravarag commented Sep 14, 2021 •

edited

Loading