Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: update metric when there are zero disruption candidates #1187

Merged
merged 4 commits into from
May 7, 2024

Conversation

jmdeal
Copy link
Member

@jmdeal jmdeal commented Apr 16, 2024

Fixes #N/A

Description
Moves the eligible node metric update to the top-level disruption controller from the individual consolidation implementations. This both reduces code duplication and, more importantly, fixes a bug where the karpenter_disruption_eligible_nodes metric is not updated if the number of candidates is zero. This results in the last non-zero number of candidates being reported indefinitely.

How was this change tested?
Tested in a personal cluster via the Karpenter AWS provider

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 16, 2024
@coveralls
Copy link

coveralls commented Apr 16, 2024

Pull Request Test Coverage Report for Build 8980552943

Details

  • 47 of 48 (97.92%) changed or added relevant lines in 5 files are covered.
  • 9 unchanged lines in 2 files lost coverage.
  • Overall coverage decreased (-0.06%) to 78.758%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/test/expectations/expectations.go 26 27 96.3%
Files with Coverage Reduction New Missed Lines %
pkg/controllers/disruption/drift.go 2 89.29%
pkg/controllers/provisioning/scheduling/preferences.go 7 86.67%
Totals Coverage Status
Change from base Build 8975908762: -0.06%
Covered Lines: 8335
Relevant Lines: 10583

💛 - Coveralls

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 16, 2024
@jmdeal jmdeal force-pushed the disruption-metric-fix branch 4 times, most recently from a7ee1a2 to ae89213 Compare April 16, 2024 19:18
pkg/controllers/disruption/consolidation_test.go Outdated Show resolved Hide resolved
pkg/controllers/disruption/drift_test.go Outdated Show resolved Hide resolved
pkg/controllers/disruption/drift_test.go Show resolved Hide resolved
pkg/controllers/disruption/consolidation_test.go Outdated Show resolved Hide resolved
pkg/controllers/disruption/drift_test.go Outdated Show resolved Hide resolved
pkg/controllers/disruption/emptiness_test.go Outdated Show resolved Hide resolved
pkg/controllers/disruption/expiration_test.go Outdated Show resolved Hide resolved
pkg/controllers/disruption/expiration_test.go Outdated Show resolved Hide resolved
@jmdeal jmdeal force-pushed the disruption-metric-fix branch 2 times, most recently from e9446b6 to ff1b84b Compare April 23, 2024 02:09
pkg/test/expectations/expectations.go Outdated Show resolved Hide resolved
pkg/controllers/disruption/expiration_test.go Outdated Show resolved Hide resolved
BeforeEach(func() {
eligibleNodesMetric = ExpectFullyQualifiedNameFromCollector(disruption.EligibleNodesGauge)
})
It("should correctly report eligible nodes", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand making this its own separate test is nice to test individual things, but I'd rather not add more tests (when we're already having to increase the timeouts). Can you just add this eligible metric check into the existing disruption tests? If you can do this for each of the disruption tests, it'll make sure that the metric is working properly in all the different ways we're testing the codepaths too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a pretty large set of updates for the disruption suite in general in my original consolidation race condition fix PR that I'm going to incorporate into a new PR. The main change was a rework of how we handle faking the clock which significantly sped up the test suite (~5x speed improvement IIRC). I think with that coming as well we can justify a standalone test, but I could also see incorporating this elsewhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Synced offline, opting for followups to solve this problem

pkg/test/expectations/expectations.go Outdated Show resolved Hide resolved
pkg/controllers/state/suite_test.go Outdated Show resolved Hide resolved
@jonathan-innis
Copy link
Member

Nice work tracking this down! This is a great simplifying change and some solid testing added!

Copy link
Contributor

@njtran njtran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 7, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jmdeal, njtran

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 7, 2024
@k8s-ci-robot k8s-ci-robot merged commit 6197752 into kubernetes-sigs:main May 7, 2024
12 checks passed
@jmdeal jmdeal deleted the disruption-metric-fix branch May 9, 2024 23:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants