Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up create_batch_from_map #339

Merged
merged 6 commits into from
May 27, 2021

Conversation

Dandandan
Copy link
Contributor

@Dandandan Dandandan commented May 14, 2021

Which issue does this PR close?

Closes #338
Closes #431

To be reviewed/merged after #320

Benchmark results db-benchmark:

#320

q1 took 33 ms
q2 took 369 ms
q3 took 1875 ms
q4 took 46 ms
q5 took 1756 ms
q7 took 1686 ms
q10 OOM

This PR (~20% faster for queries with smaller groups, NO OOM)

q1 took 34 ms
q2 took 323 ms
q3 took 1355 ms
q4 took 49 ms
q5 took 1252 ms
q7 took 1294 ms
q10 took 9550 ms

Rationale for this change

Previously, arrays were created per-row in a inefficient way:

  • There is overhead for generating the array structure for each row by Using ScalarValue::to_array
  • The single-row arrays are concatenated afterwards at the end, which is slow and would be unnecessary if they are created immediately instead
  • Intermediate Vecs are generated, causing more memory usage / allocations / fragmentation.

What changes are included in this PR?

Using ScalarValue::iter_to_array to create arrays instead, removing use of most intermediate Vecs / Arrays and concatenation.
This is not as efficient as it could be when data was already contained in typed/contiguous memory, but should be OK for most queries, and much better than before this PR.

My view is that at some point data in aggregations should be stored in contiguous arrays and only referenced (with offsets) to from other places.

Are there any user-facing changes?

No

@Dandandan Dandandan marked this pull request as ready for review May 25, 2021 19:06
@alamb
Copy link
Contributor

alamb commented May 25, 2021

My view is that at some point data in aggregations should be stored in contiguous arrays and only referenced (with offsets) to from other places.

I think this makes a lot of sense.

The reason we can't use Arrow arrays for this is that for now they are not mutable -- making some version of an ArrowVec would be helpful (I think I remember @ritchie46 mentioning he made something like this for polars-rs)

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Dandandan -- this is quite cool.

There appears to be a test failure on this PR. I can't say I followed all the details, but the overall approach looks really nice

FYI @jimexist -- the signature of ScalarValue::iter_to_array is changed in this PR

pub fn iter_to_array<'a>(
scalars: impl IntoIterator<Item = &'a ScalarValue>,
pub fn iter_to_array(
scalars: impl IntoIterator<Item = ScalarValue>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is unfortunate -- I was trying to hard to avoid the need for owned ScalarValues -- but I think since SclarValues effectively own the underlying storage, if the source data is in some other form, you end up having to create one anyways.

But I think this change is for the better; 👍

@@ -381,19 +380,74 @@ impl ScalarValue {
)))
}
})
.collect::<Result<Vec<_>>>()?;

// it is annoying that one can not create
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @alamb also saw some opportunity simplifying / optimizing build_array_primitive / build_array_string

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks 💯

@codecov-commenter
Copy link

codecov-commenter commented May 25, 2021

Codecov Report

Merging #339 (f8bfe3b) into master (ee8b5bf) will decrease coverage by 0.06%.
The diff coverage is 60.95%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #339      +/-   ##
==========================================
- Coverage   74.86%   74.79%   -0.07%     
==========================================
  Files         146      146              
  Lines       24495    24607     +112     
==========================================
+ Hits        18338    18406      +68     
- Misses       6157     6201      +44     
Impacted Files Coverage Δ
datafusion/src/scalar.rs 56.19% <42.85%> (-2.48%) ⬇️
datafusion/src/physical_plan/hash_aggregate.rs 86.54% <97.14%> (+1.32%) ⬆️
datafusion-cli/src/print_format.rs 84.44% <0.00%> (-5.97%) ⬇️
...tafusion/src/physical_plan/datetime_expressions.rs 67.29% <0.00%> (-2.52%) ⬇️
datafusion/src/physical_plan/functions.rs 92.70% <0.00%> (-0.08%) ⬇️
datafusion-cli/src/main.rs 0.00% <0.00%> (ø)
benchmarks/src/bin/tpch.rs 30.84% <0.00%> (+0.01%) ⬆️
datafusion/src/optimizer/filter_push_down.rs 97.78% <0.00%> (+0.04%) ⬆️
datafusion/src/optimizer/constant_folding.rs 91.69% <0.00%> (+0.05%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ee8b5bf...f8bfe3b. Read the comment docs.

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Great stuff!

@jorgecarleitao jorgecarleitao merged commit 9e7bd2d into apache:master May 27, 2021
@houqp houqp added datafusion Changes in the datafusion crate performance Make DataFusion faster labels Jul 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate performance Make DataFusion faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Simplify creation of array from iterator of scalars Speed up create_batch_from_map
5 participants