Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: increase scheduler performance #982

Merged
merged 1 commit into from
Feb 14, 2024
Merged

Conversation

tzneal
Copy link
Contributor

@tzneal tzneal commented Jan 30, 2024

Fixes #N/A

Description

A few changes that overall increase performance 10-60% depending on the number of pods/nodes. Its around a ~34% overall improvement:

name               old pods/sec  new pods/sec  delta
Scheduling1-12       3.89k ± 2%    4.36k ± 4%  +12.10%  (p=0.000 n=10+9)
Scheduling50-12      1.44k ±11%    1.70k ± 5%  +18.29%  (p=0.000 n=9+9)
Scheduling100-12     1.30k ±10%    2.09k ± 3%  +60.63%  (p=0.000 n=9+9)
Scheduling500-12     1.47k ± 3%    2.06k ± 1%  +40.65%  (p=0.000 n=8+9)
Scheduling1000-12    1.42k ± 5%    1.99k ± 1%  +39.32%  (p=0.000 n=10+10)
Scheduling2000-12    1.43k ± 1%    1.83k ± 2%  +28.36%  (p=0.000 n=9+9)
Scheduling5000-12    1.09k ± 6%    1.67k ± 2%  +53.75%  (p=0.000 n=9+10)

How was this change tested?

Benchmarking and unit testing/

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

A few changes that overall increase performance 10-60% depending on the
number of pods/nodes. Its around a ~34% overall improvement:

name               old pods/sec  new pods/sec  delta
Scheduling1-12       3.89k ± 2%    4.36k ± 4%  +12.10%  (p=0.000 n=10+9)
Scheduling50-12      1.44k ±11%    1.70k ± 5%  +18.29%  (p=0.000 n=9+9)
Scheduling100-12     1.30k ±10%    2.09k ± 3%  +60.63%  (p=0.000 n=9+9)
Scheduling500-12     1.47k ± 3%    2.06k ± 1%  +40.65%  (p=0.000 n=8+9)
Scheduling1000-12    1.42k ± 5%    1.99k ± 1%  +39.32%  (p=0.000 n=10+10)
Scheduling2000-12    1.43k ± 1%    1.83k ± 2%  +28.36%  (p=0.000 n=9+9)
Scheduling5000-12    1.09k ± 6%    1.67k ± 2%  +53.75%  (p=0.000 n=9+10)
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 30, 2024
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 30, 2024
@coveralls
Copy link

Pull Request Test Coverage Report for Build 7715418157

  • 0 of 33 (100.0%) changed or added relevant lines in 4 files are covered.
  • 4 unchanged lines in 2 files lost coverage.
  • Overall coverage increased (+0.03%) to 80.289%

Files with Coverage Reduction New Missed Lines %
pkg/controllers/disruption/drift.go 2 87.34%
pkg/scheduling/requirement.go 2 98.99%
Totals Coverage Status
Change from base Build 7706153571: 0.03%
Covered Lines: 7788
Relevant Lines: 9700

💛 - Coveralls

@@ -56,10 +56,22 @@ const PrintStats = false
//nolint:gosec
var r = rand.New(rand.NewSource(42))

// To run the benchmarks use:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 🎉

@@ -237,9 +237,38 @@ func labelHint(r Requirements, key string, allowedUndefined sets.Set[string]) st
return ""
}

// badKeyError allows lazily generating the error string in the case of a bad key error. When requirements fail
// to match, we are most often interested in the failure and not why it fails.
type badKeyError struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's surprising to me that templating out those error values into a string is so expensive. What's the performance hit that we take from not taking this change in and just returning errors like normal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the errors is a large part of the performance increase for anti-affinity pods in particular due to the repeated failures to match topologies. String interpolation is expensive and it happens a lot in these cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. This is good to know 👍🏼

return options
}
for domain := range t.domains {
for domain := range t.emptyDomains {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 🎉

@jonathan-innis
Copy link
Member

Worth sharing the CPU profile that you generated from pprof as part of the PR just for posterity?

@tzneal
Copy link
Contributor Author

tzneal commented Feb 5, 2024

Worth sharing the CPU profile that you generated from pprof as part of the PR just for posterity?

I don't think it would be that helpful. We should instead set some minimum benchmarks and enable the CI based tests for performance.

Copy link
Member

@jonathan-innis jonathan-innis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 14, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jonathan-innis, tzneal

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [jonathan-innis,tzneal]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 686b75d into main Feb 14, 2024
12 checks passed
jigisha620 pushed a commit to jigisha620/karpenter that referenced this pull request Feb 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants