Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve podEvictor statistics #503

Closed
damemi opened this issue Feb 19, 2021 · 45 comments
Closed

Improve podEvictor statistics #503

damemi opened this issue Feb 19, 2021 · 45 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@damemi
Copy link
Contributor

damemi commented Feb 19, 2021

As suggested in #501 (comment), it would be nice to improve the pod evictor type to report eviction statistics for individual strategies. Some suggestions were:

Number of evicted pods (in this strategy): XX
Number of evicted pods in this run: XX
Total number of evicted pods in all strategies: XX

x-ref this could also be reported as Prometheus metrics (#348)

@damemi damemi added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 19, 2021
@ingvagabund
Copy link
Contributor

I prefer to report the statistics in metrics. So we don't have to cumulative much in the pod evictor itself.

@damemi
Copy link
Contributor Author

damemi commented Feb 22, 2021

I am just suggesting that, since those metrics will have to be calculated somewhere, doing it in podEvictor makes sense because it already has access to the information. Metrics can then use the podEvictor instance to report them when requested.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 23, 2021
@a7i
Copy link
Contributor

a7i commented Jun 15, 2021

@damemi I would be happy to contribute to this. Any docs highlighting the decision made to-date?

@damemi
Copy link
Contributor Author

damemi commented Jun 15, 2021

@a7i nothing concrete, though if you would like to put some ideas together and share a doc that would be a great place to start the discussion. Right now we have 1 metric pods_evicted that's reported by the PodEvictor after a run.

As suggested above, it would be good to have some similar reports on a per-strategy basis. From there we could probably even come up with some additional meta metrics that are specific to the different strategies themselves.

@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 17, 2021
@damemi
Copy link
Contributor Author

damemi commented Jul 30, 2021

/remove-lifecycle rotten

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ingvagabund
Copy link
Contributor

/reopen

@k8s-ci-robot k8s-ci-robot reopened this Aug 30, 2021
@k8s-ci-robot
Copy link
Contributor

@ingvagabund: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ingvagabund
Copy link
Contributor

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 30, 2021
@pravarag
Copy link
Contributor

pravarag commented Sep 14, 2021

I'd like to work on this issue if someone is not working on it already 🙂 @damemi @ingvagabund

@ingvagabund
Copy link
Contributor

Not aware of anyone working on this atm. Although, this requires some design and probably starting some discussion (e.g. in a google doc). @damemi wdyt?

@damemi
Copy link
Contributor Author

damemi commented Sep 14, 2021

Yeah I think we have some good patterns started in the code for metrics reporting already that could be fleshed out more. @pravarag feel free to take this on if you'd like

@a7i
Copy link
Contributor

a7i commented Sep 15, 2021

It would be great to have the following:

  • pods evicted success
  • pods evicted failed
  • pods skipped
  • total pods under consideration

Overall and per strategy.

@pravarag
Copy link
Contributor

/assign

@pravarag
Copy link
Contributor

@damemi @ingvagabund I'm trying to replicate the eviction of pods in a local cluster for better understanding of the way statistics are currently being represented. I have a 3 node cluster with the resources not that much heavily utilized for the three nodes cluster, the stats stand here:

NAME            CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
10.177.140.38   161m         4%     3460Mi          26%
10.208.40.245   182m         4%     3849Mi          28%
10.74.193.204   149m         3%     4002Mi          30%

And here are the logs from descheduler pod,

->  k logs descheduler-7bdbc8f9b7-d9r46 -nkube-system
I0920 14:27:38.995798       1 named_certificates.go:53] "Loaded SNI cert" index=0 certName="self-signed loopback" certDetail="\"apiserver-loopback-client@1632148058\" [serving] validServingFor=[apiserver-loopback-client] issuer=\"apiserver-loopback-client-ca@1632148058\" (2021-09-20 13:27:38 +0000 UTC to 2022-09-20 13:27:38 +0000 UTC (now=2021-09-20 14:27:38.995739889 +0000 UTC))"
I0920 14:27:38.995912       1 secure_serving.go:195] Serving securely on [::]:10258
I0920 14:27:38.996045       1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
I0920 14:27:40.554774       1 node.go:46] "Node lister returned empty list, now fetch directly"
I0920 14:27:40.973812       1 duplicates.go:99] "Processing node" node="10.177.140.38"
I0920 14:27:41.225473       1 duplicates.go:99] "Processing node" node="10.208.40.245"
I0920 14:27:41.500405       1 duplicates.go:99] "Processing node" node="10.74.193.204"
I0920 14:27:41.717340       1 pod_antiaffinity.go:81] "Processing node" node="10.177.140.38"
I0920 14:27:41.823705       1 pod_antiaffinity.go:81] "Processing node" node="10.208.40.245"
I0920 14:27:41.879063       1 pod_antiaffinity.go:81] "Processing node" node="10.74.193.204"
I0920 14:27:42.198284       1 nodeutilization.go:170] "Node is appropriately utilized" node="10.177.140.38" usage=map[cpu:1172m memory:1327634Ki pods:20] usagePercentage=map[cpu:29.974424552429667 memory:9.74255252448638 pods:18.181818181818183]
I0920 14:27:42.198333       1 nodeutilization.go:170] "Node is appropriately utilized" node="10.208.40.245" usage=map[cpu:1044m memory:1137170Ki pods:12] usagePercentage=map[cpu:26.70076726342711 memory:8.344874004635447 pods:10.909090909090908]
I0920 14:27:42.198354       1 nodeutilization.go:170] "Node is appropriately utilized" node="10.74.193.204" usage=map[cpu:1355m memory:1552914Ki pods:15] usagePercentage=map[cpu:34.65473145780051 memory:11.395720666245547 pods:13.636363636363637]
I0920 14:27:42.198369       1 lownodeutilization.go:100] "Criteria for a node under utilization" CPU=20 Mem=20 Pods=20
I0920 14:27:42.198380       1 lownodeutilization.go:101] "Number of underutilized nodes" totalNumber=0
I0920 14:27:42.198392       1 lownodeutilization.go:114] "Criteria for a node above target utilization" CPU=50 Mem=50 Pods=50
I0920 14:27:42.198403       1 lownodeutilization.go:115] "Number of overutilized nodes" totalNumber=0
I0920 14:27:42.198415       1 lownodeutilization.go:118] "No node is underutilized, nothing to do here, you might tune your thresholds further"
I0920 14:27:42.198439       1 descheduler.go:152] "Number of evicted pods" totalEvicted=0
I0920 14:32:42.198973       1 node.go:46] "Node lister returned empty list, now fetch directly"
I0920 14:32:42.261831       1 pod_antiaffinity.go:81] "Processing node" node="10.177.140.38"
I0920 14:32:42.295166       1 pod_antiaffinity.go:81] "Processing node" node="10.208.40.245"
I0920 14:32:42.336749       1 pod_antiaffinity.go:81] "Processing node" node="10.74.193.204"
I0920 14:32:42.479844       1 nodeutilization.go:170] "Node is appropriately utilized" node="10.177.140.38" usage=map[cpu:1172m memory:1327634Ki pods:20] usagePercentage=map[cpu:29.974424552429667 memory:9.74255252448638 pods:18.181818181818183]
I0920 14:32:42.479892       1 nodeutilization.go:170] "Node is appropriately utilized" node="10.208.40.245" usage=map[cpu:1044m memory:1137170Ki pods:12] usagePercentage=map[cpu:26.70076726342711 memory:8.344874004635447 pods:10.909090909090908]
I0920 14:32:42.479914       1 nodeutilization.go:170] "Node is appropriately utilized" node="10.74.193.204" usage=map[cpu:1355m memory:1552914Ki pods:15] usagePercentage=map[cpu:34.65473145780051 memory:11.395720666245547 pods:13.636363636363637]
I0920 14:32:42.479930       1 lownodeutilization.go:100] "Criteria for a node under utilization" CPU=20 Mem=20 Pods=20
I0920 14:32:42.479941       1 lownodeutilization.go:101] "Number of underutilized nodes" totalNumber=0
I0920 14:32:42.479953       1 lownodeutilization.go:114] "Criteria for a node above target utilization" CPU=50 Mem=50 Pods=50
I0920 14:32:42.479963       1 lownodeutilization.go:115] "Number of overutilized nodes" totalNumber=0
I0920 14:32:42.479982       1 lownodeutilization.go:118] "No node is underutilized, nothing to do here, you might tune your thresholds further"
I0920 14:32:42.480009       1 duplicates.go:99] "Processing node" node="10.177.140.38"
I0920 14:32:42.516420       1 duplicates.go:99] "Processing node" node="10.208.40.245"
I0920 14:32:42.549396       1 duplicates.go:99] "Processing node" node="10.74.193.204"
I0920 14:32:42.595868       1 descheduler.go:152] "Number of evicted pods" totalEvicted=0

I wanted to check if I decrease the threshold values to 10 here, will that be a good way to replicate the pod evictions so that I can look at the current statistics log?

@damemi
Copy link
Contributor Author

damemi commented Sep 27, 2021

@pravarag in your logs you don't have any underutilized nodes, so lowering threshold won't help (there is already no nodes with all 3 below the set values). Instead, you want to raise the threshold values, so anything with usage under those values will be underutilized.

You also don't have any overutilized nodes, so you should lower the targetThresholds as well. For replicating evictions, cordoning certain nodes while you create test pods will help create the uneven distribution you want.

@pravarag
Copy link
Contributor

Thanks @damemi for the above suggestions. Also had few doubts about adding newer metrics. I've identified the changes will mainly take place in these files:

  1. metrics.go - which will mainly include newer metrics that we want to put.
  2. evictions.go - where the calculation of newer metrics will happen just like for pods_evicted

Now, do we also want to modify the logging w.r.t the new metrics that are to be added? Something to include in every strategy like this log?

And one more question: I could see that for the metric pods_evicted, the help says that we can calculate number of pods evicted per strategy and namespace as well. And I'm guessing the code for calculation needs to be added so, do we need an extra metric per strategy like pods_evicted_per_strategy ?

So far, I'm working on adding few new metrics like, pods_evicted_success, pods_evicted_failed, pods_skipped.

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 16, 2022
@Dentrax
Copy link
Contributor

Dentrax commented Dec 21, 2022

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 21, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 21, 2023
@ingvagabund ingvagabund removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 21, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 19, 2023
@pravarag
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 12, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 24, 2024
@pravarag
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 25, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 24, 2024
@seanmalloy
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 24, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 23, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 22, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 21, 2024
@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@Dentrax
Copy link
Contributor

Dentrax commented Oct 2, 2024

/reopen

@k8s-ci-robot
Copy link
Contributor

@Dentrax: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants