Refactor loser tree code in SortPreservingMerge per PR comments #4407

alamb · 2022-11-28T20:46:28Z

Which issue does this PR close?

re #4300.

Rationale for this change

I wanted to get the merge speed improvements into DataFusion
I wanted an excuse to work on the code myself for a bit

What changes are included in this PR?

Implements suggestions from @tustvold and @viirya in #4301

Are these changes tested?

covered by existing tests

TODO: need to run benchmarks

Are there any user-facing changes?

No

tustvold · 2022-11-28T21:03:44Z

datafusion/core/src/physical_plan/sorts/sort_preserving_merge.rs

+    /// The tree update could not be completed (e.g. the input was not
+    /// ready or had an error). The caller should return the `Poll`
+    /// result to its caller
+    Incomplete(Poll<Option<ArrowResult<RecordBatch>>>),


Suggested change

Incomplete(Poll<Option<ArrowResult<RecordBatch>>>),

Pending,

Error(ArrowError),

Given we never seem to return TreeUpdate::Incomplete(Poll::Ready(None)) or TreeUpdate::Incomplete(Poll::Ready(Some(Ok(_))))

in 1bdb25a

alamb · 2022-11-28T22:44:55Z

Here are performance results. It is somewhat of a mixed bag (some show a few percent less) results are
results.txt

The first run is the second time I ran cargo bench against 0d334cf (aka no code change) and it reports some regressions and then the second run is with the changes in this PR

alamb

Given this PR makes the code more readable (in my opinion) I would like to merge it.

@richox or @tustvold do you have any concerns?

If not I plan to do over the next day or two

tustvold

I think removing TreeUpdate would further simplify this code, but I'm happy for this to go in as is

tustvold · 2023-01-11T14:49:13Z

datafusion/core/src/physical_plan/sorts/sort_preserving_merge.rs

+
+/// The result of updating the loser tree. It is the same as an Option
+/// but with specific names for easier readability
+enum TreeUpdate {


I don't really see the need for this over Poll<Result<()>> in particular it is confusing what its semantics are w.r.t wakers

if let Err(e) = self.build()? { return Poll::Ready(Err(e)) }

Is not significantly more verbose

I tried it out and you are right -- I think Poll made the code better -- in 2237abe

tustvold · 2023-01-11T14:51:46Z

datafusion/core/src/physical_plan/sorts/sort_preserving_merge.rs

+                    self.aborted = true;
+                    return TreeUpdate::Error(e);
+                }
+                Poll::Pending => return TreeUpdate::Pending,


If this method returned Poll this could just use ready! or even ?

…rging-updates

ursabot · 2023-01-12T22:43:41Z

Benchmark runs are scheduled for baseline = 0d27fcb and contender = 82bbaa3. 82bbaa3 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

alamb added 5 commits November 28, 2022 15:09

Add docstrings for Sort Preserving Merge / Loser Tree

c968ae7

refactor: Extract loser tree initialization into its own function

b603f4b

refactor: Extract loser tree update into its own function

b062a36

Update types

5f6bc21

Remove redundant update

731ac13

github-actions bot added the core Core DataFusion crate label Nov 28, 2022

alamb mentioned this pull request Nov 28, 2022

Use tournament loser tree for k-way sort-merging, increase merge speed by 50% #4301

Merged

tustvold reviewed Nov 28, 2022

View reviewed changes

Add TreeUpdate::Pending and TreeUpdate:Error

1bdb25a

alamb commented Jan 8, 2023

View reviewed changes

tustvold approved these changes Jan 11, 2023

View reviewed changes

alamb added 2 commits January 12, 2023 10:36

Merge remote-tracking branch 'apache/master' into alamb/tournament-me…

53d93c0

…rging-updates

Simplify using Poll directly

2237abe

alamb merged commit 82bbaa3 into apache:master Jan 12, 2023

alamb deleted the alamb/tournament-merging-updates branch August 8, 2023 20:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor loser tree code in SortPreservingMerge per PR comments #4407

Refactor loser tree code in SortPreservingMerge per PR comments #4407

alamb commented Nov 28, 2022

tustvold Nov 28, 2022

alamb Nov 28, 2022

alamb commented Nov 28, 2022

alamb left a comment

tustvold left a comment

tustvold Jan 11, 2023

alamb Jan 12, 2023

tustvold Jan 11, 2023

ursabot commented Jan 12, 2023

Refactor loser tree code in SortPreservingMerge per PR comments #4407

Refactor loser tree code in SortPreservingMerge per PR comments #4407

Conversation

alamb commented Nov 28, 2022

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

tustvold Nov 28, 2022

Choose a reason for hiding this comment

alamb Nov 28, 2022

Choose a reason for hiding this comment

alamb commented Nov 28, 2022

alamb left a comment

Choose a reason for hiding this comment

tustvold left a comment

Choose a reason for hiding this comment

tustvold Jan 11, 2023

Choose a reason for hiding this comment

alamb Jan 12, 2023

Choose a reason for hiding this comment

tustvold Jan 11, 2023

Choose a reason for hiding this comment

ursabot commented Jan 12, 2023