Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhance: Optimize workload based replica selection policy #36181

Merged
merged 2 commits into from
Sep 20, 2024

Conversation

weiliu1031
Copy link
Contributor

@weiliu1031 weiliu1031 commented Sep 11, 2024

issue: #35859

This PR introduce two new param: toleranceFactor and checkRequestNum, after every checkRequestNum request has been assigned, try to compute querynode's workload score.

if the diff is less than the toleranceFactor, replica selection policy will fallback to round_robin, which reduce the average cost to about 500ns.

if the diff is larger than the toleranceFactor, replica selection policy will compute querynode's score to select the target node with smallest score in every assigment.

@sre-ci-robot sre-ci-robot added the size/L Denotes a PR that changes 100-499 lines. label Sep 11, 2024
@mergify mergify bot added dco-passed DCO check passed. kind/enhancement Issues or changes related to enhancement labels Sep 11, 2024
@weiliu1031 weiliu1031 force-pushed the optimize_replica_select branch from d210076 to d8db3e1 Compare September 11, 2024 08:01
Copy link
Contributor

mergify bot commented Sep 11, 2024

@weiliu1031 go-sdk check failed, comment rerun go-sdk can trigger the job again.

Copy link
Contributor

mergify bot commented Sep 11, 2024

@weiliu1031 E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@weiliu1031 weiliu1031 force-pushed the optimize_replica_select branch from d8db3e1 to bb96b15 Compare September 11, 2024 12:52
Copy link
Contributor

mergify bot commented Sep 11, 2024

@weiliu1031 go-sdk check failed, comment rerun go-sdk can trigger the job again.

Copy link
Contributor

mergify bot commented Sep 11, 2024

@weiliu1031 E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@weiliu1031 weiliu1031 force-pushed the optimize_replica_select branch from bb96b15 to d21a606 Compare September 12, 2024 02:15
Copy link
Contributor

mergify bot commented Sep 12, 2024

@weiliu1031 go-sdk check failed, comment rerun go-sdk can trigger the job again.

Copy link
Contributor

mergify bot commented Sep 12, 2024

@weiliu1031 E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@weiliu1031 weiliu1031 force-pushed the optimize_replica_select branch 2 times, most recently from f4f31df to 4206540 Compare September 12, 2024 09:58
@weiliu1031
Copy link
Contributor Author

before this pr:

goos: linux
goarch: amd64
pkg: github.com/milvus-io/milvus/internal/proxy
cpu: Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz
BenchmarkSelectNode_QNWithSameWorkload-8        	   10000	    115017 ns/op	   14542 B/op	     284 allocs/op
BenchmarkSelectNode_QNWithDifferentWorkload-8   	   10000	    107768 ns/op	   14536 B/op	     284 allocs/op

after this pr:

BenchmarkSelectNode_QNWithSameWorkload
BenchmarkSelectNode_QNWithSameWorkload-8        	 2865908	       493.0 ns/op	      40 B/op	       2 allocs/op
BenchmarkSelectNode_QNWithDifferentWorkload
BenchmarkSelectNode_QNWithDifferentWorkload-8   	 1000000	      1196 ns/op	      40 B/op	       2 allocs/op

@weiliu1031 weiliu1031 force-pushed the optimize_replica_select branch from 4206540 to 36cebca Compare September 12, 2024 10:06
Copy link
Contributor

mergify bot commented Sep 12, 2024

@weiliu1031 E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@weiliu1031
Copy link
Contributor Author

rerun ut

@weiliu1031
Copy link
Contributor Author

/run-cpu-e2e

Copy link

codecov bot commented Sep 13, 2024

Codecov Report

Attention: Patch coverage is 95.83333% with 4 lines in your changes missing coverage. Please review.

Project coverage is 82.37%. Comparing base (f652612) to head (2694d09).
Report is 4 commits behind head on master.

Files with missing lines Patch % Lines
internal/proxy/look_aside_balancer.go 95.18% 3 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #36181       +/-   ##
===========================================
+ Coverage   71.72%   82.37%   +10.64%     
===========================================
  Files        1276     1276               
  Lines      150629   150653       +24     
===========================================
+ Hits       108039   124100    +16061     
+ Misses      37614    21583    -16031     
+ Partials     4976     4970        -6     
Files with missing lines Coverage Δ
pkg/util/paramtable/component_param.go 98.29% <100.00%> (+<0.01%) ⬆️
internal/proxy/look_aside_balancer.go 96.22% <95.18%> (-3.78%) ⬇️

... and 286 files with indirect coverage changes

@ShineYellow
Copy link

rerun go-sdk

1 similar comment
@yellow-shine
Copy link
Collaborator

rerun go-sdk

@mergify mergify bot added the ci-passed label Sep 13, 2024
Key: "proxy.workloadToleranceFactor",
Version: "2.4.12",
DefaultValue: "0.1",
Doc: "tolerance factor for query node workload difference",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain more about what's ToleranceFactor

@mergify mergify bot removed the ci-passed label Sep 19, 2024
Copy link
Contributor

mergify bot commented Sep 19, 2024

@weiliu1031 go-sdk check failed, comment rerun go-sdk can trigger the job again.

Copy link
Contributor

mergify bot commented Sep 19, 2024

@weiliu1031 E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

This PR introduce two new param: toleranceFactor and checkRequestNum,
after every checkRequestNum request has been assigned, try to compute
querynode's workload score.

if the diff is less than the toleranceFactor, replica selection policy
will fallback to round_robin, which reduce the average cost to about 200ns.

if the diff is larger than the toleranceFactor, replica selection policy
will compute querynode's score to select the target node with smallest
score in every assigment.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
@weiliu1031 weiliu1031 force-pushed the optimize_replica_select branch from cba3950 to 2694d09 Compare September 19, 2024 09:24
@mergify mergify bot added the ci-passed label Sep 19, 2024
Copy link
Contributor

@XuanYang-cn XuanYang-cn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@congqixia
Copy link
Contributor

/approve

@sre-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: congqixia, weiliu1031, XuanYang-cn

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sre-ci-robot sre-ci-robot merged commit 3b10085 into milvus-io:master Sep 20, 2024
16 checks passed
weiliu1031 added a commit to weiliu1031/milvus that referenced this pull request Sep 20, 2024
…36181)

issue: milvus-io#35859

This PR introduce two new param: toleranceFactor and checkRequestNum,
after every checkRequestNum request has been assigned, try to compute
querynode's workload score.

if the diff is less than the toleranceFactor, replica selection policy
will fallback to round_robin, which reduce the average cost to about
500ns.

if the diff is larger than the toleranceFactor, replica selection policy
will compute querynode's score to select the target node with smallest
score in every assigment.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
sre-ci-robot pushed a commit that referenced this pull request Sep 26, 2024
…36384)

issue: #35859
pr: #36181

This PR introduce two new param: toleranceFactor and checkRequestNum,
after every checkRequestNum request has been assigned, try to compute
querynode's workload score.

if the diff is less than the toleranceFactor, replica selection policy
will fallback to round_robin, which reduce the average cost to about
500ns.

if the diff is larger than the toleranceFactor, replica selection policy
will compute querynode's score to select the target node with smallest
score in every assigment.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved ci-passed dco-passed DCO check passed. kind/enhancement Issues or changes related to enhancement lgtm size/L Denotes a PR that changes 100-499 lines.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants