-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
executor: remove redundant memory pre-allocations in parallel sort executor #54073
Conversation
Hi @xzhangxian1008. Thanks for your PR. PRs from untrusted users cannot be marked as trusted with I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/cc @windtalker @yibin87 |
spillHelper: spillHelper, | ||
batchRows: make([]chunk.Row, 0, maxSortedRowsLimit), | ||
batchRows: make([]chunk.Row, 0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some benchmarks for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some benchmarks for it?
parallel sort benchmark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, BTW can you check BenchmarkUnionScanTableReadDescRead
, BenchmarkUnionScanIndexReadDescRead
and BenchmarkUnionScanIndexLookUpDescRead
wich can test for parallel sort?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, BTW can you check
BenchmarkUnionScanTableReadDescRead
,BenchmarkUnionScanIndexReadDescRead
andBenchmarkUnionScanIndexLookUpDescRead
wich can test for parallel sort?
okk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, BTW can you check
BenchmarkUnionScanTableReadDescRead
,BenchmarkUnionScanIndexReadDescRead
andBenchmarkUnionScanIndexLookUpDescRead
wich can test for parallel sort?
fixed:
BenchmarkUnionScanTableReadDescRead:
1863 618281 ns/op 148918 B/op 2677 allocs/op
BenchmarkUnionScanIndexReadDescRead:
1922 620293 ns/op 153910 B/op 2750 allocs/op
BenchmarkUnionScanIndexLookUpDescRead:
1744 673479 ns/op 248234 B/op 2820 allocs/op
master:
BenchmarkUnionScanTableReadDescRead:
1826 599392 ns/op 215502 B/op 2674 allocs/op
BenchmarkUnionScanIndexReadDescRead:
1964 593482 ns/op 219032 B/op 2740 allocs/op
BenchmarkUnionScanIndexLookUpDescRead:
1791 659653 ns/op 320267 B/op 2815 allocs/op
dataset: tpch10
sql1: explain analyze select L_COMMENT, L_EXTENDEDPRICE from lineitem where L_SUPPKEY > 95000 order by L_COMMENT desc, L_EXTENDEDPRICE asc;
sql2: explain analyze select * from lineitem where L_SUPPKEY > 95000 order by L_COMMENT desc, L_EXTENDEDPRICE asc;
Sql1 | Sql2 | |
---|---|---|
Master | 7.30s | 15.36s |
Fixed | 6.94s | 16.36s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add it to PR's description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add it to PR's description.
done
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #54073 +/- ##
=================================================
- Coverage 72.6029% 56.5476% -16.0554%
=================================================
Files 1516 1643 +127
Lines 434689 615329 +180640
=================================================
+ Hits 315597 347954 +32357
- Misses 99623 244155 +144532
- Partials 19469 23220 +3751
Flags with carried forward coverage won't be shown. Click here to find out more.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you update the PR description by adding how the performance regression is introduced and how this fixes it?
updated |
/retest |
@xzhangxian1008: Cannot trigger testing until a trusted user reviews the PR and leaves an In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/ok-to-test |
spillHelper: spillHelper, | ||
batchRows: make([]chunk.Row, 0, maxSortedRowsLimit), | ||
batchRows: make([]chunk.Row, 0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should set a small init size unless slices have to resize.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should set a small init size unless slices have to resize.
We don't need to reserve small init size any more as I save chunks and pre-allocate whole memory before sort.
@@ -129,7 +128,7 @@ func (p *parallelSortWorker) multiWayMergeLocalSortedRows() ([]chunk.Row, error) | |||
func (p *parallelSortWorker) sortBatchRows() { | |||
slices.SortFunc(p.batchRows, p.keyColumnsLess) | |||
p.localSortedRows = append(p.localSortedRows, chunk.NewIterator4Slice(p.batchRows)) | |||
p.batchRows = make([]chunk.Row, 0, p.maxSortedRowsLimit) | |||
p.batchRows = make([]chunk.Row, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
dito
How about we just save the orignal |
I think this is very good. |
done |
@@ -126,28 +127,39 @@ func (p *parallelSortWorker) multiWayMergeLocalSortedRows() ([]chunk.Row, error) | |||
return resultSortedRows, nil | |||
} | |||
|
|||
func (p *parallelSortWorker) fillBatchRows() { | |||
p.batchRows = make([]chunk.Row, 0, p.rowNumInChunkIters) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like if fillBatchRows
returns batchRows
, then there is no need to make batchRows
as a variable of parallelSortWorker
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like if
fillBatchRows
returnsbatchRows
, then there is no need to makebatchRows
as a variable ofparallelSortWorker
?
I have deleted it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/retest |
/retest |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hawkingrei, windtalker The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
[LGTM Timeline notifier]Timeline:
|
/retest |
What problem does this PR solve?
Issue Number: close #54070
Problem Summary:
What changed and how does it work?
From clinic we can see that tidb memory usage is larger than normal and most of memory are allocated in sort executor. So we suspect that it's too many allocations which caused the performance regression.
To verify our suspicion we check the cpu usage and find that most cpu are consumed by memory allocation in sort executor. Moreover, GC STW Duration is very high. So I think we can ensure that it's too memory allocation who causes the performance regression in benchbot.
In order to eliminate the memory re-allocation when slice expands, we set a large capacity when creating slices. However, It's a waste to reserving so many memory and wasted memory will be very large because tp sqls usually process few rows while sort executos will be created for many times(showed in the following picture). So I think we can remove the pre-allocation in sort executor to fix this regression. I think this fix will not have a strong impact on sort performance because main bottleneck in parallel sort is io, not memory allocation.
With this pr, performance regression is fixed.
fixed:
BenchmarkUnionScanTableReadDescRead:
1863 618281 ns/op 148918 B/op 2677 allocs/op
BenchmarkUnionScanIndexReadDescRead:
1922 620293 ns/op 153910 B/op 2750 allocs/op
BenchmarkUnionScanIndexLookUpDescRead:
1744 673479 ns/op 248234 B/op 2820 allocs/op
master:
BenchmarkUnionScanTableReadDescRead:
1826 599392 ns/op 215502 B/op 2674 allocs/op
BenchmarkUnionScanIndexReadDescRead:
1964 593482 ns/op 219032 B/op 2740 allocs/op
BenchmarkUnionScanIndexLookUpDescRead:
1791 659653 ns/op 320267 B/op 2815 allocs/op
dataset: tpch10
sql1: explain analyze select L_COMMENT, L_EXTENDEDPRICE from lineitem where L_SUPPKEY > 95000 order by L_COMMENT desc, L_EXTENDEDPRICE asc;
sql2: explain analyze select * from lineitem where L_SUPPKEY > 95000 order by L_COMMENT desc, L_EXTENDEDPRICE asc;
Check List
Tests
Side effects
Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.