Speed up `create_batch_from_map` #339

Dandandan · 2021-05-14T15:42:12Z

Which issue does this PR close?

Closes #338
Closes #431

To be reviewed/merged after #320

Benchmark results db-benchmark:

q1 took 33 ms
q2 took 369 ms
q3 took 1875 ms
q4 took 46 ms
q5 took 1756 ms
q7 took 1686 ms
q10 OOM

This PR (~20% faster for queries with smaller groups, NO OOM)

q1 took 34 ms
q2 took 323 ms
q3 took 1355 ms
q4 took 49 ms
q5 took 1252 ms
q7 took 1294 ms
q10 took 9550 ms

Rationale for this change

Previously, arrays were created per-row in a inefficient way:

There is overhead for generating the array structure for each row by Using ScalarValue::to_array
The single-row arrays are concatenated afterwards at the end, which is slow and would be unnecessary if they are created immediately instead
Intermediate Vecs are generated, causing more memory usage / allocations / fragmentation.

What changes are included in this PR?

Using ScalarValue::iter_to_array to create arrays instead, removing use of most intermediate Vecs / Arrays and concatenation.
This is not as efficient as it could be when data was already contained in typed/contiguous memory, but should be OK for most queries, and much better than before this PR.

My view is that at some point data in aggregations should be stored in contiguous arrays and only referenced (with offsets) to from other places.

Are there any user-facing changes?

No

alamb · 2021-05-25T19:44:54Z

My view is that at some point data in aggregations should be stored in contiguous arrays and only referenced (with offsets) to from other places.

I think this makes a lot of sense.

The reason we can't use Arrow arrays for this is that for now they are not mutable -- making some version of an ArrowVec would be helpful (I think I remember @ritchie46 mentioning he made something like this for polars-rs)

alamb

Thanks @Dandandan -- this is quite cool.

There appears to be a test failure on this PR. I can't say I followed all the details, but the overall approach looks really nice

FYI @jimexist -- the signature of ScalarValue::iter_to_array is changed in this PR

alamb · 2021-05-25T19:48:06Z

datafusion/src/scalar.rs

-    pub fn iter_to_array<'a>(
-        scalars: impl IntoIterator<Item = &'a ScalarValue>,
+    pub fn iter_to_array(
+        scalars: impl IntoIterator<Item = ScalarValue>,


Yeah, this is unfortunate -- I was trying to hard to avoid the need for owned ScalarValues -- but I think since SclarValues effectively own the underlying storage, if the source data is in some other form, you end up having to create one anyways.

But I think this change is for the better; 👍

Dandandan · 2021-05-25T20:06:15Z

datafusion/src/scalar.rs

@@ -381,19 +380,74 @@ impl ScalarValue {
                                )))
                            }
                        })
-                        .collect::<Result<Vec<_>>>()?;
-
-                    // it is annoying that one can not create


FYI @alamb also saw some opportunity simplifying / optimizing build_array_primitive / build_array_string

codecov-commenter · 2021-05-25T20:27:23Z

Codecov Report

Merging #339 (f8bfe3b) into master (ee8b5bf) will decrease coverage by 0.06%.
The diff coverage is 60.95%.

@@            Coverage Diff             @@
##           master     #339      +/-   ##
==========================================
- Coverage   74.86%   74.79%   -0.07%     
==========================================
  Files         146      146              
  Lines       24495    24607     +112     
==========================================
+ Hits        18338    18406      +68     
- Misses       6157     6201      +44

Impacted Files	Coverage Δ
datafusion/src/scalar.rs	`56.19% <42.85%> (-2.48%)`	⬇️
datafusion/src/physical_plan/hash_aggregate.rs	`86.54% <97.14%> (+1.32%)`	⬆️
datafusion-cli/src/print_format.rs	`84.44% <0.00%> (-5.97%)`	⬇️
...tafusion/src/physical_plan/datetime_expressions.rs	`67.29% <0.00%> (-2.52%)`	⬇️
datafusion/src/physical_plan/functions.rs	`92.70% <0.00%> (-0.08%)`	⬇️
datafusion-cli/src/main.rs	`0.00% <0.00%> (ø)`
benchmarks/src/bin/tpch.rs	`30.84% <0.00%> (+0.01%)`	⬆️
datafusion/src/optimizer/filter_push_down.rs	`97.78% <0.00%> (+0.04%)`	⬆️
datafusion/src/optimizer/constant_folding.rs	`91.69% <0.00%> (+0.05%)`	⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ee8b5bf...f8bfe3b. Read the comment docs.

jorgecarleitao

LGTM. Great stuff!

Dandandan mentioned this pull request May 19, 2021

Add Compare to GroupByScalar #364

Closed

jorgecarleitao mentioned this pull request May 25, 2021

Experimenting with arrow2 #68

Closed

Dandandan mentioned this pull request May 25, 2021

Rewrite pruning logic in terms of PruningStatistics using Array trait (option 2) #426

Merged

4 tasks

Rebase changes

96ca0d6

Dandandan force-pushed the speed_hash_aggregate branch from 515a9bc to 96ca0d6 Compare May 25, 2021 17:58

Fmt

acf81c1

Dandandan marked this pull request as ready for review May 25, 2021 19:06

Use macro

1100c26

Dandandan requested review from jorgecarleitao, alamb and andygrove May 25, 2021 19:11

Support floats too

214eb7e

alamb approved these changes May 25, 2021

View reviewed changes

Avoid temporary vec for primitive / string

9cc0ea0

Dandandan commented May 25, 2021

View reviewed changes

Clippy

f8bfe3b

Dandandan mentioned this pull request May 26, 2021

Simplified creation of array from scalar. #432

Closed

jorgecarleitao approved these changes May 27, 2021

View reviewed changes

jorgecarleitao merged commit 9e7bd2d into apache:master May 27, 2021

jorgecarleitao mentioned this pull request May 27, 2021

Fixed typo / logical merge conflict #433

Merged

houqp added datafusion Changes in the datafusion crate performance Make DataFusion faster labels Jul 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up `create_batch_from_map` #339

Speed up `create_batch_from_map` #339

Dandandan commented May 14, 2021 •

edited by jorgecarleitao

Loading

alamb commented May 25, 2021

alamb left a comment

alamb May 25, 2021

Dandandan May 25, 2021

alamb May 25, 2021

codecov-commenter commented May 25, 2021 •

edited

Loading

jorgecarleitao left a comment

Speed up create_batch_from_map #339

Speed up create_batch_from_map #339

Conversation

Dandandan commented May 14, 2021 • edited by jorgecarleitao Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

alamb commented May 25, 2021

alamb left a comment

Choose a reason for hiding this comment

alamb May 25, 2021

Choose a reason for hiding this comment

Dandandan May 25, 2021

Choose a reason for hiding this comment

alamb May 25, 2021

Choose a reason for hiding this comment

codecov-commenter commented May 25, 2021 • edited Loading

Codecov Report

jorgecarleitao left a comment

Choose a reason for hiding this comment

Speed up `create_batch_from_map` #339

Speed up `create_batch_from_map` #339

Dandandan commented May 14, 2021 •

edited by jorgecarleitao

Loading

codecov-commenter commented May 25, 2021 •

edited

Loading