Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use row encoding for SortExec #5292

Closed
wants to merge 44 commits into from
Closed

Conversation

jaylmiller
Copy link
Contributor

Which issue does this PR close?

Closes #5230.

Rationale for this change

Using arrow row format for SortExec should improve sorting speed. Additionally preserving that row encoding to be used
in SortPreservingMergeExec should improve perf by minimizing the amount of encoding needed to be done.

What changes are included in this PR?

  • Changes are all made within the sorts module (datafusion/core/src/physical_plan/sorts).
  • Had to change several files across the module to allow for streaming row encoding data between SortExec and SortPreservingMergeExec

Are these changes tested?

  • Existing unit tests for SortExec/SortPreservingMergeExec cover this for the most part.
  • For the new structs defined for handling selections of rows have new test.

Are there any user-facing changes?

Had to add a few new public exports since SortKeyCursor needs to be public (since it is used in an integration test).

@github-actions github-actions bot added the core Core DataFusion crate label Feb 15, 2023
update memory sizes in spill related unit tests to account for rows

limit sorts dont use row encoding.
@ozankabak
Copy link
Contributor

This looks good. Now that we have the benchmark too, can we have some numbers quantifying the performance from a before vs. after perspective?

One, we would make sure that we are really helping 🙂 Two, we can announce this to the community at large if the numbers look good!

@jaylmiller
Copy link
Contributor Author

jaylmiller commented Feb 15, 2023

This looks good. Now that we have the benchmark too, can we have some numbers quantifying the performance from a before vs. after perspective?

One, we would make sure that we are really helping 🙂 Two, we can announce this to the community at large if the numbers look good!

I could post some preliminary bench results from my macbook once I finish up this last thing (row serialization).

Also there are some regressions on the merge bench that I want to try and cleanup as well before taking this out of draft status 💪

- use SortedStream instead of dynamic dispatch everywhere, gets rid of 
all the Box::pin being when passing streams between funcs
fix lil typo error in sort bench
@jaylmiller jaylmiller marked this pull request as ready for review February 16, 2023 23:02
@jaylmiller jaylmiller marked this pull request as draft February 17, 2023 00:46
@jaylmiller jaylmiller marked this pull request as ready for review February 17, 2023 01:25
@alamb
Copy link
Contributor

alamb commented Mar 14, 2023

What is the status of this PR? Does it have more work needed or is it ready for final review? Or does it need more benchmark results?

@jaylmiller
Copy link
Contributor Author

jaylmiller commented Mar 14, 2023

Coding-wise everything is finished and code is ready to review. But in terms of bench results, I'm not 100% confident yet.

Sort micro-benchmarks are looking pretty good: significant improvements on cases where row encoding is actually used, minor regressions--mostly within error bars--on cases without row encoding but of course more experienced contributors would know better about how significant these regressions actually are:

group                                                     main-sort                                rows-sort
-----                                                     ---------                                ---------
sort f64                                                  1.00     10.8±0.23ms        ? ?/sec      1.04     11.2±0.93ms        ? ?/sec
sort f64 preserve partitioning                            1.00      4.0±0.27ms        ? ?/sec      1.04      4.1±0.28ms        ? ?/sec
sort i64                                                  1.00      9.5±0.55ms        ? ?/sec      1.09     10.3±0.74ms        ? ?/sec
sort i64 preserve partitioning                            1.00      3.3±0.10ms        ? ?/sec      1.06      3.5±0.13ms        ? ?/sec
sort mixed tuple                                          1.28     28.3±3.35ms        ? ?/sec      1.00     22.2±1.60ms        ? ?/sec
sort mixed tuple preserve partitioning                    1.00      3.6±0.17ms        ? ?/sec      1.15      4.1±1.09ms        ? ?/sec
sort mixed utf8 dictionary tuple                          2.84     52.7±8.27ms        ? ?/sec      1.00     18.6±1.29ms        ? ?/sec
sort mixed utf8 dictionary tuple preserve partitioning    1.02      4.2±0.92ms        ? ?/sec      1.00      4.1±0.55ms        ? ?/sec
sort utf8 dictionary                                      1.00      3.7±0.21ms        ? ?/sec      1.04      3.9±0.33ms        ? ?/sec
sort utf8 dictionary preserve partitioning                1.00  1487.2±1444.67µs        ? ?/sec    1.01  1502.8±315.79µs        ? ?/sec
sort utf8 dictionary tuple                                3.26    57.0±11.35ms        ? ?/sec      1.00     17.5±2.08ms        ? ?/sec
sort utf8 dictionary tuple preserve partitioning          1.13      4.1±1.08ms        ? ?/sec      1.00      3.6±0.52ms        ? ?/sec
sort utf8 high cardinality                                1.01     28.0±3.70ms        ? ?/sec      1.00     27.6±3.81ms        ? ?/sec
sort utf8 high cardinality preserve partitioning          1.00     11.1±1.48ms        ? ?/sec      1.21     13.5±3.38ms        ? ?/sec
sort utf8 low cardinality                                 1.00     15.3±5.08ms        ? ?/sec      1.10     16.9±6.20ms        ? ?/sec
sort utf8 low cardinality preserve partitioning           1.03      8.1±2.21ms        ? ?/sec      1.00      7.8±1.75ms        ? ?/sec
sort utf8 tuple                                           1.96     56.8±8.36ms        ? ?/sec      1.00     29.0±4.82ms        ? ?/sec
sort utf8 tuple preserve partitioning                     1.02      6.7±0.95ms        ? ?/sec      1.00      6.5±0.46ms        ? ?/sec

In summary, I'd like to get an opinion on these micro bench results. And then also ideally, we can run the e2e bench comparisons (#5561) on tpch and parquet and get a bit more data on whether this change is worth merging.

@alamb
Copy link
Contributor

alamb commented Mar 15, 2023

In summary, I'd like to get an opinion on these micro bench results. And then also ideally, we can run the e2e bench comparisons (#5561) on tpch and parquet and get a bit more data on whether this change is worth merging.

Ok, cool -- thank you -- I will put this on my worklist for tomorrow morning. Thank you

@alamb
Copy link
Contributor

alamb commented Mar 16, 2023

Starting to run some tests locally / review this PR more carefully

@alamb
Copy link
Contributor

alamb commented Mar 16, 2023

I ran this command against both main and this branch:

cargo run --release --bin parquet -- sort  --path ~/access-log-gen/ --scale-factor 1.0

The results seem to be consistent with the microbenchmarks reported above that this branch is currently slower in several cases

main_sort.txt
new_sort.txt

@jaylmiller
Copy link
Contributor Author

Thanks @alamb . Any ideas on what might be going on? I'm a bit stumped 😅

@alamb
Copy link
Contributor

alamb commented Mar 17, 2023

Thanks @alamb . Any ideas on what might be going on? I'm a bit stumped 😅

I am also a bit stumped. I think I need to sit down and get some profiling data. I'll try and do it later today but realistically it probably won't be until this weekend.

@jaylmiller
Copy link
Contributor Author

No rush! Does anything stick out to you in this small block of code (this is where the root of the perf issues might be coming from...)

https://github.com/jaylmiller/arrow-datafusion/blob/d20263866a77720da33b69a6416cd086574f6968/datafusion/core/src/physical_plan/sorts/sort.rs#L1087-L1112

@alamb
Copy link
Contributor

alamb commented Mar 19, 2023

No rush! Does anything stick out to you in this small block of code (this is where the root of the perf issues might be coming from...)

https://github.com/jaylmiller/arrow-datafusion/blob/d20263866a77720da33b69a6416cd086574f6968/datafusion/core/src/physical_plan/sorts/sort.rs#L1103-L1105

            let mut to_sort: Vec<(usize, Row)> = rows.into_iter().enumerate().collect();
            to_sort.sort_unstable_by(|(_, row_a), (_, row_b)| row_a.cmp(row_b));
            let sorted_indices = to_sort.iter().map(|(idx, _)| *idx).collect::<Vec<_>>();

I think the sort_unstable_by will effectively copy the rows around (to sort them in the vec) and then calling the take kernel will effectively copy them again to form the output.

Update -- not sure about this anymore

I am going to run some profiling experiments and see if I can confirm that this copy takes a significant amount of the execution time.

@alamb
Copy link
Contributor

alamb commented Mar 19, 2023

I ran some profiling of using datafusion-cli to run this query

create table  test as select * from  '/tmp/logs.parquet' ORDER BY request_method;

like

./target/release/datafusion-cli  -f /tmp/script.sql

Here is the flamegraph that came from hotspot. I tried to annotate it the best I can but I am not quite sure how to explain it

Screenshot 2023-03-19 at 7 12 57 AM

I haven't had a chance to study this PR in detail -- but I think we are at the point where we should profile the benches / queries that got slower and figure out what is going on.

Given we don't have much (any?) measurements showing a performance improvement I feel like something else must be going on

@tustvold
Copy link
Contributor

tustvold commented Mar 19, 2023

MutableArrayData would suggest the majority of time is spent reconstructing the sorted data, it is not impossible that decoding from the sorted rows may be faster, or using the interleave kernel. At the very least this should reduce the amount of realloc.

@jaylmiller
Copy link
Contributor Author

jaylmiller commented Mar 19, 2023

I think we are at the point where we should profile the benches / queries that got slower and figure out what is going on.

Thanks for the ideas and suggestions. I will start looking into this more.

Given we don't have much (any?) measurements showing a performance improvement I feel like something else must be going on

There are several micro bench cases with significant improvements (mostly the tuple sort cases) which is why i'm a bit confused why the e2e parquet bench is showing worse perf on the tuple sorts

MutableArrayData would suggest the majority of time is spent reconstructing the sorted data, it is not impossible that decoding from the sorted rows may be faster, or using the interleave kernel. At the very least this should reduce the amount of realloc.

Will look into the interleave kernel that is a good idea thanks

@alamb
Copy link
Contributor

alamb commented Mar 28, 2023

Marking as draft to signify this PR has feedback and is not waiting for another review at the moment.

@alamb alamb marked this pull request as draft March 28, 2023 20:27
@jaylmiller
Copy link
Contributor Author

jaylmiller commented Apr 5, 2023

Marking as draft to signify this PR has feedback and is not waiting for another review at the moment.

Sorry about the inactivity @alamb . Update on my end is that I kind of hit a wall with this. I see that @tustvold has some PRs open that seem like they are working towards tackling using row encoding for SortExec (probably will end up being better than my attempt here anyways). So perhaps we should just close this one?

@alamb
Copy link
Contributor

alamb commented Apr 6, 2023

Update on my end is that I kind of hit a wall with this. I see that @tustvold has some PRs open that seem like they are working towards tackling using row encoding for SortExec (probably will end up being better than my attempt here anyways). So perhaps we should just close this one?

I think that would be best. Thank you for your efforts @jaylmiller -- I did not realize how tricky this would turn out to be. I think your work and initial findings have significantly influenced what @tustvold is up to so while maybe this code won't get merged as is it will live on in spirit.

Thank you very much, again

@alamb alamb closed this Apr 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use Arrow Row Format in SortExec to improve performance
4 participants