Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

executor: remove redundant memory pre-allocations in parallel sort executor #54073

Merged
merged 4 commits into from
Jun 24, 2024

Conversation

xzhangxian1008
Copy link
Contributor

@xzhangxian1008 xzhangxian1008 commented Jun 18, 2024

What problem does this PR solve?

Issue Number: close #54070

Problem Summary:

What changed and how does it work?

From clinic we can see that tidb memory usage is larger than normal and most of memory are allocated in sort executor. So we suspect that it's too many allocations which caused the performance regression.
mem-cmp
memory

To verify our suspicion we check the cpu usage and find that most cpu are consumed by memory allocation in sort executor. Moreover, GC STW Duration is very high. So I think we can ensure that it's too memory allocation who causes the performance regression in benchbot.
cpu
stw

In order to eliminate the memory re-allocation when slice expands, we set a large capacity when creating slices. However, It's a waste to reserving so many memory and wasted memory will be very large because tp sqls usually process few rows while sort executos will be created for many times(showed in the following picture). So I think we can remove the pre-allocation in sort executor to fix this regression. I think this fix will not have a strong impact on sort performance because main bottleneck in parallel sort is io, not memory allocation.
sort

With this pr, performance regression is fixed.
20240618-152416

fixed:
BenchmarkUnionScanTableReadDescRead:
1863 618281 ns/op 148918 B/op 2677 allocs/op
BenchmarkUnionScanIndexReadDescRead:
1922 620293 ns/op 153910 B/op 2750 allocs/op
BenchmarkUnionScanIndexLookUpDescRead:
1744 673479 ns/op 248234 B/op 2820 allocs/op

master:
BenchmarkUnionScanTableReadDescRead:
1826 599392 ns/op 215502 B/op 2674 allocs/op
BenchmarkUnionScanIndexReadDescRead:
1964 593482 ns/op 219032 B/op 2740 allocs/op
BenchmarkUnionScanIndexLookUpDescRead:
1791 659653 ns/op 320267 B/op 2815 allocs/op

dataset: tpch10
sql1: explain analyze select L_COMMENT, L_EXTENDEDPRICE from lineitem where L_SUPPKEY > 95000 order by L_COMMENT desc, L_EXTENDEDPRICE asc;

sql2: explain analyze select * from lineitem where L_SUPPKEY > 95000 order by L_COMMENT desc, L_EXTENDEDPRICE asc;

  Sql1 Sql2
Master 7.30s 15.36s
Fixed 6.94s 16.36s

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

Remove redundant memory pre-allocations in parallel sort executor

@ti-chi-bot ti-chi-bot bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jun 18, 2024
Copy link

tiprow bot commented Jun 18, 2024

Hi @xzhangxian1008. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@xzhangxian1008
Copy link
Contributor Author

/cc @windtalker @yibin87

@ti-chi-bot ti-chi-bot bot requested review from windtalker and yibin87 June 18, 2024 03:49
spillHelper: spillHelper,
batchRows: make([]chunk.Row, 0, maxSortedRowsLimit),
batchRows: make([]chunk.Row, 0),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some benchmarks for it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some benchmarks for it?

parallel sort benchmark?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, BTW can you check BenchmarkUnionScanTableReadDescReadBenchmarkUnionScanIndexReadDescRead and BenchmarkUnionScanIndexLookUpDescRead wich can test for parallel sort?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, BTW can you check BenchmarkUnionScanTableReadDescReadBenchmarkUnionScanIndexReadDescRead and BenchmarkUnionScanIndexLookUpDescRead wich can test for parallel sort?

okk

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, BTW can you check BenchmarkUnionScanTableReadDescReadBenchmarkUnionScanIndexReadDescRead and BenchmarkUnionScanIndexLookUpDescRead wich can test for parallel sort?

fixed:
BenchmarkUnionScanTableReadDescRead:
1863 618281 ns/op 148918 B/op 2677 allocs/op
BenchmarkUnionScanIndexReadDescRead:
1922 620293 ns/op 153910 B/op 2750 allocs/op
BenchmarkUnionScanIndexLookUpDescRead:
1744 673479 ns/op 248234 B/op 2820 allocs/op

master:
BenchmarkUnionScanTableReadDescRead:
1826 599392 ns/op 215502 B/op 2674 allocs/op
BenchmarkUnionScanIndexReadDescRead:
1964 593482 ns/op 219032 B/op 2740 allocs/op
BenchmarkUnionScanIndexLookUpDescRead:
1791 659653 ns/op 320267 B/op 2815 allocs/op

dataset: tpch10
sql1: explain analyze select L_COMMENT, L_EXTENDEDPRICE from lineitem where L_SUPPKEY > 95000 order by L_COMMENT desc, L_EXTENDEDPRICE asc;

sql2: explain analyze select * from lineitem where L_SUPPKEY > 95000 order by L_COMMENT desc, L_EXTENDEDPRICE asc;

  Sql1 Sql2
Master 7.30s 15.36s
Fixed 6.94s 16.36s

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add it to PR's description.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add it to PR's description.

done

Copy link

codecov bot commented Jun 18, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 56.5476%. Comparing base (8f56847) to head (ebcd77e).
Report is 45 commits behind head on master.

Additional details and impacted files
@@                Coverage Diff                @@
##             master     #54073         +/-   ##
=================================================
- Coverage   72.6029%   56.5476%   -16.0554%     
=================================================
  Files          1516       1643        +127     
  Lines        434689     615329     +180640     
=================================================
+ Hits         315597     347954      +32357     
- Misses        99623     244155     +144532     
- Partials      19469      23220       +3751     
Flag Coverage Δ
integration 37.9172% <95.4545%> (?)
unit 71.8359% <100.0000%> (+0.2424%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 52.9656% <ø> (ø)
parser ∅ <ø> (∅)
br 52.2830% <ø> (+9.9816%) ⬆️

Copy link
Contributor

@zanmato1984 zanmato1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update the PR description by adding how the performance regression is introduced and how this fixes it?

@xzhangxian1008
Copy link
Contributor Author

Could you update the PR description by adding how the performance regression is introduced and how this fixes it?

updated

@xzhangxian1008
Copy link
Contributor Author

/retest

Copy link

tiprow bot commented Jun 18, 2024

@xzhangxian1008: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@xzhangxian1008
Copy link
Contributor Author

/cc @zanmato1984 @hawkingrei

@hawkingrei
Copy link
Member

/ok-to-test

@ti-chi-bot ti-chi-bot bot added the ok-to-test Indicates a PR is ready to be tested. label Jun 19, 2024
spillHelper: spillHelper,
batchRows: make([]chunk.Row, 0, maxSortedRowsLimit),
batchRows: make([]chunk.Row, 0),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should set a small init size unless slices have to resize.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should set a small init size unless slices have to resize.

We don't need to reserve small init size any more as I save chunks and pre-allocate whole memory before sort.

@@ -129,7 +128,7 @@ func (p *parallelSortWorker) multiWayMergeLocalSortedRows() ([]chunk.Row, error)
func (p *parallelSortWorker) sortBatchRows() {
slices.SortFunc(p.batchRows, p.keyColumnsLess)
p.localSortedRows = append(p.localSortedRows, chunk.NewIterator4Slice(p.batchRows))
p.batchRows = make([]chunk.Row, 0, p.maxSortedRowsLimit)
p.batchRows = make([]chunk.Row, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

dito

@windtalker
Copy link
Contributor

How about we just save the orignal chunk in addChunkToBatchRows, and only construct the batchRows before sortBatchRows? So in sortBatchRows we already know the row size, and can pre-allocate it without waste.

@xzhangxian1008
Copy link
Contributor Author

How about we just save the orignal chunk in addChunkToBatchRows, and only construct the batchRows before sortBatchRows? So in sortBatchRows we already know the row size, and can pre-allocate it without waste.

I think this is very good.

@ti-chi-bot ti-chi-bot bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jun 20, 2024
@xzhangxian1008
Copy link
Contributor Author

How about we just save the orignal chunk in addChunkToBatchRows, and only construct the batchRows before sortBatchRows? So in sortBatchRows we already know the row size, and can pre-allocate it without waste.

done

@xzhangxian1008
Copy link
Contributor Author

/cc @windtalker @hawkingrei

@ti-chi-bot ti-chi-bot bot requested a review from hawkingrei June 20, 2024 03:23
@@ -126,28 +127,39 @@ func (p *parallelSortWorker) multiWayMergeLocalSortedRows() ([]chunk.Row, error)
return resultSortedRows, nil
}

func (p *parallelSortWorker) fillBatchRows() {
p.batchRows = make([]chunk.Row, 0, p.rowNumInChunkIters)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like if fillBatchRows returns batchRows, then there is no need to make batchRows as a variable of parallelSortWorker?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like if fillBatchRows returns batchRows, then there is no need to make batchRows as a variable of parallelSortWorker?

I have deleted it.

@xzhangxian1008
Copy link
Contributor Author

/cc @windtalker @hawkingrei

@ti-chi-bot ti-chi-bot bot requested a review from windtalker June 21, 2024 05:34
Copy link
Contributor

@windtalker windtalker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-chi-bot ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Jun 21, 2024
@xzhangxian1008
Copy link
Contributor Author

/retest

@xzhangxian1008
Copy link
Contributor Author

/retest

Copy link

ti-chi-bot bot commented Jun 24, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hawkingrei, windtalker

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [hawkingrei,windtalker]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Jun 24, 2024
Copy link

ti-chi-bot bot commented Jun 24, 2024

[LGTM Timeline notifier]

Timeline:

  • 2024-06-21 05:55:31.419603964 +0000 UTC m=+353457.905092792: ☑️ agreed by windtalker.
  • 2024-06-24 03:09:26.144269797 +0000 UTC m=+602692.629758630: ☑️ agreed by hawkingrei.

@xzhangxian1008
Copy link
Contributor Author

/retest

@ti-chi-bot ti-chi-bot bot merged commit bec113a into pingcap:master Jun 24, 2024
23 checks passed
@xzhangxian1008 xzhangxian1008 deleted the fix-54070 branch June 24, 2024 08:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm ok-to-test Indicates a PR is ready to be tested. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PR#53537 caused 20-34% performance regression in oltp_read_only and oltp_read_write
4 participants