executor: remove redundant memory pre-allocations in parallel sort executor #54073

xzhangxian1008 · 2024-06-18T03:48:07Z

What problem does this PR solve?

Issue Number: close #54070

Problem Summary:

What changed and how does it work?

From clinic we can see that tidb memory usage is larger than normal and most of memory are allocated in sort executor. So we suspect that it's too many allocations which caused the performance regression.

To verify our suspicion we check the cpu usage and find that most cpu are consumed by memory allocation in sort executor. Moreover, GC STW Duration is very high. So I think we can ensure that it's too memory allocation who causes the performance regression in benchbot.

In order to eliminate the memory re-allocation when slice expands, we set a large capacity when creating slices. However, It's a waste to reserving so many memory and wasted memory will be very large because tp sqls usually process few rows while sort executos will be created for many times(showed in the following picture). So I think we can remove the pre-allocation in sort executor to fix this regression. I think this fix will not have a strong impact on sort performance because main bottleneck in parallel sort is io, not memory allocation.

With this pr, performance regression is fixed.

fixed:
BenchmarkUnionScanTableReadDescRead:
1863 618281 ns/op 148918 B/op 2677 allocs/op
BenchmarkUnionScanIndexReadDescRead:
1922 620293 ns/op 153910 B/op 2750 allocs/op
BenchmarkUnionScanIndexLookUpDescRead:
1744 673479 ns/op 248234 B/op 2820 allocs/op

master:
BenchmarkUnionScanTableReadDescRead:
1826 599392 ns/op 215502 B/op 2674 allocs/op
BenchmarkUnionScanIndexReadDescRead:
1964 593482 ns/op 219032 B/op 2740 allocs/op
BenchmarkUnionScanIndexLookUpDescRead:
1791 659653 ns/op 320267 B/op 2815 allocs/op

dataset: tpch10
sql1: explain analyze select L_COMMENT, L_EXTENDEDPRICE from lineitem where L_SUPPKEY > 95000 order by L_COMMENT desc, L_EXTENDEDPRICE asc;

sql2: explain analyze select * from lineitem where L_SUPPKEY > 95000 order by L_COMMENT desc, L_EXTENDEDPRICE asc;

	Sql1	Sql2
Master	7.30s	15.36s
Fixed	6.94s	16.36s

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No need to test
- I checked and no code files have been changed.

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Documentation

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

Remove redundant memory pre-allocations in parallel sort executor

tiprow · 2024-06-18T03:48:22Z

Hi @xzhangxian1008. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

xzhangxian1008 · 2024-06-18T03:49:32Z

/cc @windtalker @yibin87

hawkingrei · 2024-06-18T03:51:39Z

pkg/executor/sortexec/parallel_sort_worker.go

 		spillHelper:            spillHelper,
-		batchRows:              make([]chunk.Row, 0, maxSortedRowsLimit),
+		batchRows:              make([]chunk.Row, 0),


Can you add some benchmarks for it?

Can you add some benchmarks for it?

parallel sort benchmark?

Yes, BTW can you check BenchmarkUnionScanTableReadDescRead， BenchmarkUnionScanIndexReadDescRead and BenchmarkUnionScanIndexLookUpDescRead wich can test for parallel sort?

Yes, BTW can you check BenchmarkUnionScanTableReadDescRead， BenchmarkUnionScanIndexReadDescRead and BenchmarkUnionScanIndexLookUpDescRead wich can test for parallel sort?

okk

Yes, BTW can you check BenchmarkUnionScanTableReadDescRead， BenchmarkUnionScanIndexReadDescRead and BenchmarkUnionScanIndexLookUpDescRead wich can test for parallel sort?

fixed:
BenchmarkUnionScanTableReadDescRead:
1863 618281 ns/op 148918 B/op 2677 allocs/op
BenchmarkUnionScanIndexReadDescRead:
1922 620293 ns/op 153910 B/op 2750 allocs/op
BenchmarkUnionScanIndexLookUpDescRead:
1744 673479 ns/op 248234 B/op 2820 allocs/op

master:
BenchmarkUnionScanTableReadDescRead:
1826 599392 ns/op 215502 B/op 2674 allocs/op
BenchmarkUnionScanIndexReadDescRead:
1964 593482 ns/op 219032 B/op 2740 allocs/op
BenchmarkUnionScanIndexLookUpDescRead:
1791 659653 ns/op 320267 B/op 2815 allocs/op

dataset: tpch10
sql1: explain analyze select L_COMMENT, L_EXTENDEDPRICE from lineitem where L_SUPPKEY > 95000 order by L_COMMENT desc, L_EXTENDEDPRICE asc;

sql2: explain analyze select * from lineitem where L_SUPPKEY > 95000 order by L_COMMENT desc, L_EXTENDEDPRICE asc;

Sql1 Sql2

Master 7.30s 15.36s

Fixed 6.94s 16.36s

add it to PR's description.

add it to PR's description.

done

codecov · 2024-06-18T04:06:12Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 56.5476%. Comparing base (8f56847) to head (ebcd77e).
Report is 45 commits behind head on master.

Additional details and impacted files

@@                Coverage Diff                @@
##             master     #54073         +/-   ##
=================================================
- Coverage   72.6029%   56.5476%   -16.0554%     
=================================================
  Files          1516       1643        +127     
  Lines        434689     615329     +180640     
=================================================
+ Hits         315597     347954      +32357     
- Misses        99623     244155     +144532     
- Partials      19469      23220       +3751

Flag	Coverage Δ
integration	`37.9172% <95.4545%> (?)`
unit	`71.8359% <100.0000%> (+0.2424%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
dumpling	`52.9656% <ø> (ø)`
parser	`∅ <ø> (∅)`
br	`52.2830% <ø> (+9.9816%)`	⬆️

zanmato1984

Could you update the PR description by adding how the performance regression is introduced and how this fixes it?

xzhangxian1008 · 2024-06-18T05:48:29Z

Could you update the PR description by adding how the performance regression is introduced and how this fixes it?

updated

xzhangxian1008 · 2024-06-18T06:27:20Z

/retest

tiprow · 2024-06-18T06:27:43Z

@xzhangxian1008: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

xzhangxian1008 · 2024-06-19T03:16:10Z

/cc @zanmato1984 @hawkingrei

hawkingrei · 2024-06-19T08:06:44Z

/ok-to-test

hawkingrei · 2024-06-19T08:07:53Z

pkg/executor/sortexec/parallel_sort_worker.go

 		spillHelper:            spillHelper,
-		batchRows:              make([]chunk.Row, 0, maxSortedRowsLimit),
+		batchRows:              make([]chunk.Row, 0),


You should set a small init size unless slices have to resize.

You should set a small init size unless slices have to resize.

We don't need to reserve small init size any more as I save chunks and pre-allocate whole memory before sort.

hawkingrei · 2024-06-19T08:08:06Z

pkg/executor/sortexec/parallel_sort_worker.go

@@ -129,7 +128,7 @@ func (p *parallelSortWorker) multiWayMergeLocalSortedRows() ([]chunk.Row, error)
 func (p *parallelSortWorker) sortBatchRows() {
 	slices.SortFunc(p.batchRows, p.keyColumnsLess)
 	p.localSortedRows = append(p.localSortedRows, chunk.NewIterator4Slice(p.batchRows))
-	p.batchRows = make([]chunk.Row, 0, p.maxSortedRowsLimit)
+	p.batchRows = make([]chunk.Row, 0)


ditto

dito

windtalker · 2024-06-20T00:53:56Z

How about we just save the orignal chunk in addChunkToBatchRows, and only construct the batchRows before sortBatchRows? So in sortBatchRows we already know the row size, and can pre-allocate it without waste.

xzhangxian1008 · 2024-06-20T01:55:49Z

How about we just save the orignal chunk in addChunkToBatchRows, and only construct the batchRows before sortBatchRows? So in sortBatchRows we already know the row size, and can pre-allocate it without waste.

I think this is very good.

xzhangxian1008 · 2024-06-20T03:22:29Z

How about we just save the orignal chunk in addChunkToBatchRows, and only construct the batchRows before sortBatchRows? So in sortBatchRows we already know the row size, and can pre-allocate it without waste.

done

xzhangxian1008 · 2024-06-20T03:23:32Z

/cc @windtalker @hawkingrei

windtalker · 2024-06-20T04:57:18Z

pkg/executor/sortexec/parallel_sort_worker.go

@@ -126,28 +127,39 @@ func (p *parallelSortWorker) multiWayMergeLocalSortedRows() ([]chunk.Row, error)
 	return resultSortedRows, nil
 }

+func (p *parallelSortWorker) fillBatchRows() {
+	p.batchRows = make([]chunk.Row, 0, p.rowNumInChunkIters)


Looks like if fillBatchRows returns batchRows, then there is no need to make batchRows as a variable of parallelSortWorker?

Looks like if fillBatchRows returns batchRows, then there is no need to make batchRows as a variable of parallelSortWorker?

I have deleted it.

xzhangxian1008 · 2024-06-21T05:34:33Z

/cc @windtalker @hawkingrei

windtalker

LGTM

xzhangxian1008 · 2024-06-21T07:48:38Z

/retest

xzhangxian1008 · 2024-06-24T02:20:54Z

/retest

ti-chi-bot · 2024-06-24T03:09:23Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hawkingrei, windtalker

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [hawkingrei,windtalker]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot · 2024-06-24T03:09:27Z

[LGTM Timeline notifier]

Timeline:

2024-06-21 05:55:31.419603964 +0000 UTC m=+353457.905092792: ☑️ agreed by windtalker.
2024-06-24 03:09:26.144269797 +0000 UTC m=+602692.629758630: ☑️ agreed by hawkingrei.

xzhangxian1008 · 2024-06-24T05:48:22Z

/retest

fix

d34c7e4

ti-chi-bot bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jun 18, 2024

ti-chi-bot bot requested review from windtalker and yibin87 June 18, 2024 03:49

hawkingrei reviewed Jun 18, 2024

View reviewed changes

zanmato1984 reviewed Jun 18, 2024

View reviewed changes

ti-chi-bot bot requested review from hawkingrei and zanmato1984 June 19, 2024 03:16

ti-chi-bot bot added the ok-to-test Indicates a PR is ready to be tested. label Jun 19, 2024

hawkingrei reviewed Jun 19, 2024

View reviewed changes

pre-alloc

e1e1f55

ti-chi-bot bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jun 20, 2024

ti-chi-bot bot requested a review from hawkingrei June 20, 2024 03:23

windtalker reviewed Jun 20, 2024

View reviewed changes

fix

36cf18b

ti-chi-bot bot requested a review from windtalker June 21, 2024 05:34

windtalker approved these changes Jun 21, 2024

View reviewed changes

ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Jun 21, 2024

fix

ebcd77e

hawkingrei approved these changes Jun 24, 2024

View reviewed changes

ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Jun 24, 2024

ti-chi-bot bot merged commit bec113a into pingcap:master Jun 24, 2024
23 checks passed

xzhangxian1008 deleted the fix-54070 branch June 24, 2024 08:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

executor: remove redundant memory pre-allocations in parallel sort executor #54073

executor: remove redundant memory pre-allocations in parallel sort executor #54073

xzhangxian1008 commented Jun 18, 2024 •

edited

Loading

tiprow bot commented Jun 18, 2024

xzhangxian1008 commented Jun 18, 2024

hawkingrei Jun 18, 2024

xzhangxian1008 Jun 18, 2024

hawkingrei Jun 18, 2024

xzhangxian1008 Jun 18, 2024

xzhangxian1008 Jun 19, 2024

hawkingrei Jun 19, 2024

xzhangxian1008 Jun 20, 2024

codecov bot commented Jun 18, 2024 •

edited

Loading

zanmato1984 left a comment

xzhangxian1008 commented Jun 18, 2024

xzhangxian1008 commented Jun 18, 2024

tiprow bot commented Jun 18, 2024

xzhangxian1008 commented Jun 19, 2024

hawkingrei commented Jun 19, 2024

hawkingrei Jun 19, 2024

xzhangxian1008 Jun 20, 2024

hawkingrei Jun 19, 2024

xzhangxian1008 Jun 20, 2024

windtalker commented Jun 20, 2024

xzhangxian1008 commented Jun 20, 2024

xzhangxian1008 commented Jun 20, 2024

xzhangxian1008 commented Jun 20, 2024

windtalker Jun 20, 2024

xzhangxian1008 Jun 21, 2024

xzhangxian1008 commented Jun 21, 2024

windtalker left a comment

xzhangxian1008 commented Jun 21, 2024

xzhangxian1008 commented Jun 24, 2024

ti-chi-bot bot commented Jun 24, 2024

ti-chi-bot bot commented Jun 24, 2024

xzhangxian1008 commented Jun 24, 2024

executor: remove redundant memory pre-allocations in parallel sort executor #54073

executor: remove redundant memory pre-allocations in parallel sort executor #54073

Conversation

xzhangxian1008 commented Jun 18, 2024 • edited Loading

What problem does this PR solve?

What changed and how does it work?

Check List

Release note

tiprow bot commented Jun 18, 2024

xzhangxian1008 commented Jun 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jun 18, 2024 • edited Loading

Codecov Report

zanmato1984 left a comment

Choose a reason for hiding this comment

xzhangxian1008 commented Jun 18, 2024

xzhangxian1008 commented Jun 18, 2024

tiprow bot commented Jun 18, 2024

xzhangxian1008 commented Jun 19, 2024

hawkingrei commented Jun 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

windtalker commented Jun 20, 2024

xzhangxian1008 commented Jun 20, 2024

xzhangxian1008 commented Jun 20, 2024

xzhangxian1008 commented Jun 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xzhangxian1008 commented Jun 21, 2024

windtalker left a comment

Choose a reason for hiding this comment

xzhangxian1008 commented Jun 21, 2024

xzhangxian1008 commented Jun 24, 2024

ti-chi-bot bot commented Jun 24, 2024

ti-chi-bot bot commented Jun 24, 2024

[LGTM Timeline notifier]

xzhangxian1008 commented Jun 24, 2024

xzhangxian1008 commented Jun 18, 2024 •

edited

Loading

codecov bot commented Jun 18, 2024 •

edited

Loading