-
Notifications
You must be signed in to change notification settings - Fork 718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
schedule: support patrol region concurrency #8094
base: master
Are you sure you want to change the base?
Conversation
[REVIEW NOTIFICATION] This pull request has not been approved. To complete the pull request process, please ask the reviewers in the list to review by filling The full list of commands accepted by this bot can be found here. Reviewer can indicate their review by submitting an approval review. |
Skipping CI for Draft Pull Request. |
Signed-off-by: lhy1024 <admin@liudos.us>
4c59016
to
d1f4b8a
Compare
Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
2e61d30
to
948dc77
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall lgtm, is this PR ready?
I am preparing some tests for different scenarios |
Signed-off-by: lhy1024 <admin@liudos.us>
948dc77
to
c198b08
Compare
pkg/schedule/config/config.go
Outdated
@@ -63,6 +63,8 @@ const ( | |||
defaultRegionScoreFormulaVersion = "v2" | |||
defaultLeaderSchedulePolicy = "count" | |||
defaultStoreLimitVersion = "v1" | |||
defaultPatrolRegionConcurrency = 1 | |||
defaultPatrolRegionBatchLimit = 128 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can use max(128,region_count/1024)
@@ -461,7 +461,7 @@ func (oc *Controller) checkAddOperator(isPromoting bool, ops ...*Operator) (bool | |||
return false, NotInCreateStatus | |||
} | |||
if !isPromoting && oc.wopStatus.getCount(op.Desc()) >= oc.config.GetSchedulerMaxWaitingOperator() { | |||
log.Debug("exceed max return false", zap.Uint64("waiting", oc.wopStatus.ops[op.Desc()]), zap.String("desc", op.Desc()), zap.Uint64("max", oc.config.GetSchedulerMaxWaitingOperator())) | |||
log.Debug("exceed max return false", zap.Uint64("waiting", oc.wopStatus.getCount(op.Desc())), zap.String("desc", op.Desc()), zap.Uint64("max", oc.config.GetSchedulerMaxWaitingOperator())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avoid data race
e71f635
to
a0ec33d
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #8094 +/- ##
==========================================
+ Coverage 77.29% 77.36% +0.07%
==========================================
Files 471 471
Lines 61445 61515 +70
==========================================
+ Hits 47491 47590 +99
+ Misses 10395 10362 -33
- Partials 3559 3563 +4
Flags with carried forward coverage won't be shown. Click here to find out more. |
Signed-off-by: lhy1024 <admin@liudos.us>
/test pull-integration-realcluster-test |
pkg/schedule/coordinator.go
Outdated
hbStreams *hbstream.HeartbeatStreams | ||
pluginInterface *PluginInterface | ||
diagnosticManager *diagnostic.Manager | ||
patrolRegionContext *PatrolRegionContext |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we let the checker controller manage the patrol logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move PatrolRegion and patrolRegionContext to checkers
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WDYT
/test pull-integration-realcluster-test |
} | ||
|
||
func calculateScanLimit(cluster sche.CheckerCluster) int { | ||
scanlimit := max(patrolScanRegionMinLimit, cluster.GetTotalRegionCount()/patrolRegionPartition) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering that we have 10 million regions, will it be too large?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about the metrics for CPU and goroutine counts in this case? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This branch with 1000w regions
Before 20:45 the limit was 128, after 20:45 the limit was increased.
The initialization batch limit was 128. After the first round, getTotalRegionCount got the region of the whole cluster, which is equivalent to adjusting the limit, so the speed was up
CPU has increased roughly 70%(from 160% to 230%).
goroutine seems to have changed very little
Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
ff02183
to
ae0778f
Compare
Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rest lgtm
// wait for the regionChan to be drained | ||
if len(c.patrolRegionContext.regionChan) > 0 { | ||
continue | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need we wait for? If the regionChan
is full, it will be blocked.
} | ||
|
||
func (p *PatrolRegionContext) stop() { | ||
close(p.regionChan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is better to add a failpoint to wait for all rest region consumed in regionChan
before close regionChan, rather than time.Sleep(100 * time.Millisecond)
in L171. This is more stable.
If possible, we can always wait it even if it is not in testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
time.Sleep(100 * time.Millisecond)
in L171 is used to wait that regions to be consumed.
co.PatrolRegions()
re.Empty(oc.GetOperators())
For example, if we enable this failpoint, it will wait 100 ms for goroutine consuming regions. And it will check re.Empty(oc.GetOperators())
immediately after failpoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the role of L171. But I think sleep is a destabilizing factor. And it's fine to wait for it to finish consuming, or actively consume all regions before exiting and then exit.
Signed-off-by: lhy1024 <admin@liudos.us>
What problem does this PR solve?
Issue Number: Close #7963 #7706
What is changed and how does it work?
Check List
Tests
Release note