perf: increase scheduler performance #982

tzneal · 2024-01-30T17:58:51Z

Fixes #N/A

Description

A few changes that overall increase performance 10-60% depending on the number of pods/nodes. Its around a ~34% overall improvement:

name               old pods/sec  new pods/sec  delta
Scheduling1-12       3.89k ± 2%    4.36k ± 4%  +12.10%  (p=0.000 n=10+9)
Scheduling50-12      1.44k ±11%    1.70k ± 5%  +18.29%  (p=0.000 n=9+9)
Scheduling100-12     1.30k ±10%    2.09k ± 3%  +60.63%  (p=0.000 n=9+9)
Scheduling500-12     1.47k ± 3%    2.06k ± 1%  +40.65%  (p=0.000 n=8+9)
Scheduling1000-12    1.42k ± 5%    1.99k ± 1%  +39.32%  (p=0.000 n=10+10)
Scheduling2000-12    1.43k ± 1%    1.83k ± 2%  +28.36%  (p=0.000 n=9+9)
Scheduling5000-12    1.09k ± 6%    1.67k ± 2%  +53.75%  (p=0.000 n=9+10)

How was this change tested?

Benchmarking and unit testing/

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

A few changes that overall increase performance 10-60% depending on the number of pods/nodes. Its around a ~34% overall improvement: name old pods/sec new pods/sec delta Scheduling1-12 3.89k ± 2% 4.36k ± 4% +12.10% (p=0.000 n=10+9) Scheduling50-12 1.44k ±11% 1.70k ± 5% +18.29% (p=0.000 n=9+9) Scheduling100-12 1.30k ±10% 2.09k ± 3% +60.63% (p=0.000 n=9+9) Scheduling500-12 1.47k ± 3% 2.06k ± 1% +40.65% (p=0.000 n=8+9) Scheduling1000-12 1.42k ± 5% 1.99k ± 1% +39.32% (p=0.000 n=10+10) Scheduling2000-12 1.43k ± 1% 1.83k ± 2% +28.36% (p=0.000 n=9+9) Scheduling5000-12 1.09k ± 6% 1.67k ± 2% +53.75% (p=0.000 n=9+10)

coveralls · 2024-01-30T18:13:48Z

Pull Request Test Coverage Report for Build 7715418157

0 of 33 (100.0%) changed or added relevant lines in 4 files are covered.
4 unchanged lines in 2 files lost coverage.
Overall coverage increased (+0.03%) to 80.289%

Files with Coverage Reduction	New Missed Lines	%
pkg/controllers/disruption/drift.go	2	87.34%
pkg/scheduling/requirement.go	2	98.99%

Totals
Change from base Build 7706153571:	0.03%
Covered Lines:	7788
Relevant Lines:	9700

💛 - Coveralls

jonathan-innis · 2024-01-30T20:22:27Z

pkg/controllers/provisioning/scheduling/scheduling_benchmark_test.go

@@ -56,10 +56,22 @@ const PrintStats = false
 //nolint:gosec
 var r = rand.New(rand.NewSource(42))

+// To run the benchmarks use:


jonathan-innis · 2024-01-30T20:24:10Z

pkg/scheduling/requirements.go

@@ -237,9 +237,38 @@ func labelHint(r Requirements, key string, allowedUndefined sets.Set[string]) st
 	return ""
 }

+// badKeyError allows lazily generating the error string in the case of a bad key error. When requirements fail
+// to match, we are most often interested in the failure and not why it fails.
+type badKeyError struct {


It's surprising to me that templating out those error values into a string is so expensive. What's the performance hit that we take from not taking this change in and just returning errors like normal?

Changing the errors is a large part of the performance increase for anti-affinity pods in particular due to the repeated failures to match topologies. String interpolation is expensive and it happens a lot in these cases.

Got it. This is good to know 👍🏼

jonathan-innis · 2024-01-30T20:24:58Z

pkg/controllers/provisioning/scheduling/topologygroup.go

-		return options
-	}
-	for domain := range t.domains {
+	for domain := range t.emptyDomains {


jonathan-innis · 2024-01-30T20:26:12Z

Worth sharing the CPU profile that you generated from pprof as part of the PR just for posterity?

tzneal · 2024-02-05T16:03:46Z

Worth sharing the CPU profile that you generated from pprof as part of the PR just for posterity?

I don't think it would be that helpful. We should instead set some minimum benchmarks and enable the CI based tests for performance.

jonathan-innis

/lgtm
/approve

k8s-ci-robot · 2024-02-14T01:32:31Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jonathan-innis, tzneal

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jonathan-innis,tzneal]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 30, 2024

k8s-ci-robot requested review from jackfrancis and jmdeal January 30, 2024 17:58

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 30, 2024

jonathan-innis reviewed Jan 30, 2024

View reviewed changes

jonathan-innis approved these changes Feb 14, 2024

View reviewed changes

k8s-ci-robot assigned jonathan-innis Feb 14, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 14, 2024

k8s-ci-robot merged commit 686b75d into main Feb 14, 2024
12 checks passed

jigisha620 pushed a commit to jigisha620/karpenter that referenced this pull request Feb 16, 2024

perf: increase scheduler performance (kubernetes-sigs#982)

cf9c5c9

jigisha620 mentioned this pull request Feb 16, 2024

chore: Adding cherry picked commits to release-v0.34.x #1020

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: increase scheduler performance #982

perf: increase scheduler performance #982

tzneal commented Jan 30, 2024

coveralls commented Jan 30, 2024

jonathan-innis Jan 30, 2024

jonathan-innis Jan 30, 2024

tzneal Feb 5, 2024

jonathan-innis Feb 13, 2024

jonathan-innis Jan 30, 2024

jonathan-innis commented Jan 30, 2024

tzneal commented Feb 5, 2024

jonathan-innis left a comment

k8s-ci-robot commented Feb 14, 2024

perf: increase scheduler performance #982

perf: increase scheduler performance #982

Conversation

tzneal commented Jan 30, 2024

coveralls commented Jan 30, 2024

Pull Request Test Coverage Report for Build 7715418157

💛 - Coveralls

jonathan-innis Jan 30, 2024

Choose a reason for hiding this comment

jonathan-innis Jan 30, 2024

Choose a reason for hiding this comment

tzneal Feb 5, 2024

Choose a reason for hiding this comment

jonathan-innis Feb 13, 2024

Choose a reason for hiding this comment

jonathan-innis Jan 30, 2024

Choose a reason for hiding this comment

jonathan-innis commented Jan 30, 2024

tzneal commented Feb 5, 2024

jonathan-innis left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Feb 14, 2024