fix: update metric when there are zero disruption candidates #1187

jmdeal · 2024-04-16T00:10:26Z

Fixes #N/A

Description
Moves the eligible node metric update to the top-level disruption controller from the individual consolidation implementations. This both reduces code duplication and, more importantly, fixes a bug where the karpenter_disruption_eligible_nodes metric is not updated if the number of candidates is zero. This results in the last non-zero number of candidates being reported indefinitely.

How was this change tested?
Tested in a personal cluster via the Karpenter AWS provider

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

coveralls · 2024-04-16T00:27:44Z

Pull Request Test Coverage Report for Build 8980552943

Details

47 of 48 (97.92%) changed or added relevant lines in 5 files are covered.
9 unchanged lines in 2 files lost coverage.
Overall coverage decreased (-0.06%) to 78.758%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pkg/test/expectations/expectations.go	26	27	96.3%

Files with Coverage Reduction	New Missed Lines	%
pkg/controllers/disruption/drift.go	2	89.29%
pkg/controllers/provisioning/scheduling/preferences.go	7	86.67%

Totals
Change from base Build 8975908762:	-0.06%
Covered Lines:	8335
Relevant Lines:	10583

💛 - Coveralls

pkg/controllers/disruption/controller.go

pkg/controllers/disruption/consolidation_test.go

pkg/controllers/disruption/drift_test.go

pkg/controllers/disruption/consolidation_test.go

pkg/controllers/disruption/drift_test.go

pkg/controllers/disruption/emptiness_test.go

pkg/controllers/disruption/expiration_test.go

pkg/test/expectations/expectations.go

pkg/controllers/disruption/expiration_test.go

njtran · 2024-04-23T16:32:42Z

pkg/controllers/disruption/expiration_test.go

+		BeforeEach(func() {
+			eligibleNodesMetric = ExpectFullyQualifiedNameFromCollector(disruption.EligibleNodesGauge)
+		})
+		It("should correctly report eligible nodes", func() {


I understand making this its own separate test is nice to test individual things, but I'd rather not add more tests (when we're already having to increase the timeouts). Can you just add this eligible metric check into the existing disruption tests? If you can do this for each of the disruption tests, it'll make sure that the metric is working properly in all the different ways we're testing the codepaths too.

I had a pretty large set of updates for the disruption suite in general in my original consolidation race condition fix PR that I'm going to incorporate into a new PR. The main change was a rework of how we handle faking the clock which significantly sped up the test suite (~5x speed improvement IIRC). I think with that coming as well we can justify a standalone test, but I could also see incorporating this elsewhere.

Synced offline, opting for followups to solve this problem

pkg/test/expectations/expectations.go

pkg/controllers/state/suite_test.go

jonathan-innis · 2024-04-24T06:56:14Z

Nice work tracking this down! This is a great simplifying change and some solid testing added!

njtran

/lgtm
/approve

k8s-ci-robot · 2024-05-07T22:35:58Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jmdeal, njtran

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [njtran]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot requested review from engedaam and tallaxes April 16, 2024 00:10

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 16, 2024

jonathan-innis reviewed Apr 16, 2024

View reviewed changes

pkg/controllers/disruption/controller.go Show resolved Hide resolved

jmdeal mentioned this pull request Apr 16, 2024

Modification of Karpenter Grafana dashboards capacity and performance aws/karpenter-provider-aws#5935

Merged

3 tasks

jmdeal force-pushed the disruption-metric-fix branch from c7b8845 to 91f95b4 Compare April 16, 2024 05:46

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 16, 2024

jmdeal force-pushed the disruption-metric-fix branch 4 times, most recently from a7ee1a2 to ae89213 Compare April 16, 2024 19:18

njtran reviewed Apr 22, 2024

View reviewed changes

fix: always update eligible disruption metric

220c58f

jmdeal force-pushed the disruption-metric-fix branch 2 times, most recently from e9446b6 to ff1b84b Compare April 23, 2024 02:09

refactor metrics test

db934c3

jmdeal force-pushed the disruption-metric-fix branch from ff1b84b to db934c3 Compare April 23, 2024 02:12

njtran reviewed Apr 23, 2024

View reviewed changes

jonathan-innis reviewed Apr 24, 2024

View reviewed changes

pkg/test/expectations/expectations.go Outdated Show resolved Hide resolved

pkg/controllers/state/suite_test.go Outdated Show resolved Hide resolved

jonathan-innis mentioned this pull request Apr 30, 2024

cleanup: clean redundant condition checking #1188

Closed

jmdeal added 2 commits May 6, 2024 18:07

PR comments

282ac91

Merge branch 'main' into disruption-metric-fix

c74d32f

njtran reviewed May 7, 2024

View reviewed changes

k8s-ci-robot assigned njtran May 7, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 7, 2024

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 7, 2024

k8s-ci-robot merged commit 6197752 into kubernetes-sigs:main May 7, 2024
12 checks passed

jmdeal deleted the disruption-metric-fix branch May 9, 2024 23:58

jonathan-innis pushed a commit that referenced this pull request May 16, 2024

fix: update metric when there are zero disruption candidates (#1187)

97d1c84

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: update metric when there are zero disruption candidates #1187

fix: update metric when there are zero disruption candidates #1187

jmdeal commented Apr 16, 2024

coveralls commented Apr 16, 2024 •

edited

Loading

njtran Apr 23, 2024

jmdeal Apr 30, 2024

njtran May 7, 2024

jonathan-innis commented Apr 24, 2024

njtran left a comment

k8s-ci-robot commented May 7, 2024

fix: update metric when there are zero disruption candidates #1187

fix: update metric when there are zero disruption candidates #1187

Conversation

jmdeal commented Apr 16, 2024

coveralls commented Apr 16, 2024 • edited Loading

Pull Request Test Coverage Report for Build 8980552943

Details

💛 - Coveralls

njtran Apr 23, 2024

Choose a reason for hiding this comment

jmdeal Apr 30, 2024

Choose a reason for hiding this comment

njtran May 7, 2024

Choose a reason for hiding this comment

jonathan-innis commented Apr 24, 2024

njtran left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented May 7, 2024

coveralls commented Apr 16, 2024 •

edited

Loading