Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

schedule: support patrol region concurrency #8094

Open
wants to merge 39 commits into
base: master
Choose a base branch
from

Conversation

lhy1024
Copy link
Contributor

@lhy1024 lhy1024 commented Apr 18, 2024

What problem does this PR solve?

Issue Number: Close #7963 #7706

What is changed and how does it work?

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)

Release note

None.

Copy link
Contributor

ti-chi-bot bot commented Apr 18, 2024

[REVIEW NOTIFICATION]

This pull request has not been approved.

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

Copy link
Contributor

ti-chi-bot bot commented Apr 18, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-linked-issue do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. labels Apr 18, 2024
@ti-chi-bot ti-chi-bot bot requested review from JmPotato and rleungx April 18, 2024 08:29
@ti-chi-bot ti-chi-bot bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 18, 2024
Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
Copy link
Contributor

@nolouch nolouch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm, is this PR ready?

@lhy1024
Copy link
Contributor Author

lhy1024 commented May 14, 2024

overall lgtm, is this PR ready?

I am preparing some tests for different scenarios

Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
@lhy1024 lhy1024 changed the title Patrol concurrency schedule: support patrol region concurrency May 23, 2024
@lhy1024 lhy1024 marked this pull request as ready for review May 23, 2024 17:58
@ti-chi-bot ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 23, 2024
@@ -63,6 +63,8 @@ const (
defaultRegionScoreFormulaVersion = "v2"
defaultLeaderSchedulePolicy = "count"
defaultStoreLimitVersion = "v1"
defaultPatrolRegionConcurrency = 1
defaultPatrolRegionBatchLimit = 128
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can use max(128,region_count/1024)

@@ -461,7 +461,7 @@ func (oc *Controller) checkAddOperator(isPromoting bool, ops ...*Operator) (bool
return false, NotInCreateStatus
}
if !isPromoting && oc.wopStatus.getCount(op.Desc()) >= oc.config.GetSchedulerMaxWaitingOperator() {
log.Debug("exceed max return false", zap.Uint64("waiting", oc.wopStatus.ops[op.Desc()]), zap.String("desc", op.Desc()), zap.Uint64("max", oc.config.GetSchedulerMaxWaitingOperator()))
log.Debug("exceed max return false", zap.Uint64("waiting", oc.wopStatus.getCount(op.Desc())), zap.String("desc", op.Desc()), zap.Uint64("max", oc.config.GetSchedulerMaxWaitingOperator()))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoid data race

Signed-off-by: lhy1024 <admin@liudos.us>
Copy link

codecov bot commented May 24, 2024

Codecov Report

Attention: Patch coverage is 85.26316% with 14 lines in your changes are missing coverage. Please review.

Project coverage is 77.36%. Comparing base (4cd42b3) to head (ab9ef1e).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8094      +/-   ##
==========================================
+ Coverage   77.29%   77.36%   +0.07%     
==========================================
  Files         471      471              
  Lines       61445    61515      +70     
==========================================
+ Hits        47491    47590      +99     
+ Misses      10395    10362      -33     
- Partials     3559     3563       +4     
Flag Coverage Δ
unittests 77.36% <85.26%> (+0.07%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

@ti-chi-bot ti-chi-bot bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jul 4, 2024
Signed-off-by: lhy1024 <admin@liudos.us>
@lhy1024
Copy link
Contributor Author

lhy1024 commented Jul 4, 2024

/test pull-integration-realcluster-test

hbStreams *hbstream.HeartbeatStreams
pluginInterface *PluginInterface
diagnosticManager *diagnostic.Manager
patrolRegionContext *PatrolRegionContext
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we let the checker controller manage the patrol logic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move PatrolRegion and patrolRegionContext to checkers
?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT

@lhy1024
Copy link
Contributor Author

lhy1024 commented Jul 4, 2024

/test pull-integration-realcluster-test

}

func calculateScanLimit(cluster sche.CheckerCluster) int {
scanlimit := max(patrolScanRegionMinLimit, cluster.GetTotalRegionCount()/patrolRegionPartition)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering that we have 10 million regions, will it be too large?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we test 10 million regions with simulator and it's ok
img_v3_02cg_5e43ef9c-29c7-47a9-aea4-adf9e7d109dg
img_v3_02cg_9475efc9-f16a-4071-8d72-c9cf87276bcg

Copy link
Member

@JmPotato JmPotato Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about the metrics for CPU and goroutine counts in this case? 🤔

Copy link
Member

@HuSharp HuSharp Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch with 1000w regions

Before 20:45 the limit was 128, after 20:45 the limit was increased.

The initialization batch limit was 128. After the first round, getTotalRegionCount got the region of the whole cluster, which is equivalent to adjusting the limit, so the speed was up

image

CPU has increased roughly 70%(from 160% to 230%).
image
goroutine seems to have changed very little
image

memory has increased 7G
image

operator check
image

@ti-chi-bot ti-chi-bot bot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jul 12, 2024
@lhy1024 lhy1024 requested review from JmPotato and rleungx July 15, 2024 08:31
Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
@ti-chi-bot ti-chi-bot bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Aug 1, 2024
Signed-off-by: lhy1024 <admin@liudos.us>
@ti-chi-bot ti-chi-bot bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Aug 8, 2024
lhy1024 and others added 2 commits August 8, 2024 19:18
Signed-off-by: lhy1024 <admin@liudos.us>
Copy link
Member

@okJiang okJiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rest lgtm

Comment on lines +144 to +147
// wait for the regionChan to be drained
if len(c.patrolRegionContext.regionChan) > 0 {
continue
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need we wait for? If the regionChan is full, it will be blocked.

}

func (p *PatrolRegionContext) stop() {
close(p.regionChan)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to add a failpoint to wait for all rest region consumed in regionChan before close regionChan, rather than time.Sleep(100 * time.Millisecond) in L171. This is more stable.

If possible, we can always wait it even if it is not in testing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time.Sleep(100 * time.Millisecond) in L171 is used to wait that regions to be consumed.

co.PatrolRegions()
re.Empty(oc.GetOperators())

For example, if we enable this failpoint, it will wait 100 ms for goroutine consuming regions. And it will check re.Empty(oc.GetOperators()) immediately after failpoint.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the role of L171. But I think sleep is a destabilizing factor. And it's fine to wait for it to finish consuming, or actively consume all regions before exiting and then exit.

tools/pd-ctl/tests/config/config_test.go Outdated Show resolved Hide resolved
Signed-off-by: lhy1024 <admin@liudos.us>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dco-signoff: yes Indicates the PR's author has signed the dco. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

checker: make patrol region sooner
7 participants