[ARROW-12441] [DataFusion] Cross join implementation #11

Dandandan · 2021-04-19T05:58:39Z

This is a first (naive, but probably not that bad) implementation of the cross join and CROSS JOIN syntax.

https://issues.apache.org/jira/browse/ARROW-12441

The left side gets loaded into memory and the right side is streamed and gets combined with the left side.

To keep memory usage down, we keep a "cursor" of values on the left side and producing the batches one by one instead of concatenating the result of the full cartesian product.

FYI @andygrove @alamb @jorgecarleitao

This also makes query 9 run in DataFusion (though performance is not OK, but I believe that should be not related to the cross join itself, but is caused by another issue).

codecov-commenter · 2021-04-19T06:06:18Z

Codecov Report

Merging #11 (734b172) into master (c365a4f) will increase coverage by 0.12%.
The diff coverage is 81.33%.

❗ Current head 734b172 differs from pull request most recent head 844cb4b. Consider uploading reports for the commit 844cb4b to get more accurate results

@@            Coverage Diff             @@
##           master      #11      +/-   ##
==========================================
+ Coverage   70.41%   70.53%   +0.12%     
==========================================
  Files         123      124       +1     
  Lines       21261    21453     +192     
==========================================
+ Hits        14970    15132     +162     
- Misses       6291     6321      +30

Impacted Files	Coverage Δ
...lista/rust/core/src/serde/logical_plan/to_proto.rs	`0.00% <ø> (ø)`
datafusion/src/optimizer/constant_folding.rs	`91.95% <0.00%> (-0.36%)`	⬇️
datafusion/src/optimizer/projection_push_down.rs	`98.66% <ø> (ø)`
datafusion/src/physical_plan/hash_utils.rs	`100.00% <ø> (+2.89%)`	⬆️
datafusion/src/physical_plan/mod.rs	`87.09% <ø> (ø)`
datafusion/src/logical_plan/plan.rs	`79.83% <61.53%> (-0.56%)`	⬇️
datafusion/src/physical_plan/cross_join.rs	`76.56% <76.56%> (ø)`
datafusion/src/optimizer/hash_build_probe_order.rs	`62.60% <88.88%> (+6.93%)`	⬆️
benchmarks/src/bin/tpch.rs	`38.86% <100.00%> (+0.52%)`	⬆️
datafusion/src/logical_plan/builder.rs	`89.18% <100.00%> (+0.30%)`	⬆️
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c365a4f...844cb4b. Read the comment docs.

jorgecarleitao · 2021-04-19T06:09:59Z

Amazing! :D

Since this will land on the next release, I suggest that we add / migrate the corresponding JIRA issue here, so that we can map issues <> features, bugs, etc. to create a change log.

alamb

I think this is looking very cool -- thanks @Dandandan

I think the input handling of the cross join may be reversed from the intention

datafusion/src/physical_plan/cross_join.rs

alamb · 2021-04-19T12:24:12Z

datafusion/src/physical_plan/cross_join.rs

+    /// left (build) side which gets loaded in memory
+    left: Arc<dyn ExecutionPlan>,
+    /// right (probe) side which are combined with left side
+    right: Arc<dyn ExecutionPlan>,


My reading of the code below is that it is the right side that is loaded into memory and then the left is streamed through.

I think the changes to the planner i this PR are trying to send the smaller input to the left hand input, I think it might make sense to change the implementation to buffer the left side rather than the right side

relevant code:

let merge = MergeExec::new(self.left.clone());

left side should be loaded into memory (is the intention, but I am pretty sure we are doing this?), and we try to make the smallest side the left side (as far as possible).

datafusion/tests/sql.rs

alamb · 2021-04-20T10:38:24Z

I rewrote master to remove the history of the other languages (and 50MB of history). This means this PR will need to be "rebased" against the current master.

The best way I found to do this was to find the relevant commits (via git log) and then cherry-pick them one at a time. For example (change COMMIT-SHA to whatever the commit actually is):

git reset --hard apache/master
git cherry-pick COMMIT-SHA
git push -f your_rep

alamb · 2021-04-20T11:04:29Z

@Dandandan I am happy to rebase this PR against new master. Let me know if that would be helpful

Dandandan · 2021-04-20T18:01:48Z

Looks like this is updated too

Dandandan · 2021-04-20T18:27:52Z

@alamb I updated this and replied with some comments on the locking for the reference to the record batch

datafusion/src/physical_plan/cross_join.rs

* Initial commit * initial commit * failing test * table scan projection * closer * test passes, with some hacks * use DataFrame (#2) * update README * update dependency * code cleanup (#3) * Add support for Filter operator and BinaryOp expressions (#4) * GitHub action (#5) * Split code into producer and consumer modules (#6) * Support more functions and scalar types (#7) * Use substrait 0.1 and datafusion 8.0 (#8) * use substrait 0.1 * use datafusion 8.0 * update datafusion to 10.0 and substrait to 0.2 (#11) * Add basic join support (#12) * Added fetch support (#23) Added fetch to consumer Added limit to producer Added unit tests for limit Added roundtrip_fill_none() for testing when None input can be converted to 0 Update src/consumer.rs Co-authored-by: Andy Grove <andygrove73@gmail.com> Co-authored-by: Andy Grove <andygrove73@gmail.com> * Upgrade to DataFusion 13.0.0 (#25) * Add sort consumer and producer (#24) Add consumer Add producer and test Modified error string * Add serializer/deserializer (#26) * Add plan and function extension support (#27) * Add plan and function extension support * Removed unwraps * Implement GROUP BY (#28) * Add consumer, producer and tests for aggregate relation Change function extension registration from absolute to relative anchor (reference) Remove operator to/from reference * Fixed function registration bug * Add test * Addressed PR comments * Changed field reference from mask to direct reference (#29) * Changed field reference from masked reference to direct reference * Handle unsupported case (struct with child) * Handle SubqueryAlias (#30) Fixed aggregate function register bug * Add support for SELECT DISTINCT (#31) Add test case * Implement BETWEEN (#32) * Add case (#33) * Implement CASE WHEN * Add more case to test * Addressed comments * feat: support explicit catalog/schema names in ReadRel (#34) * feat: support explicit catalog/schema names in ReadRel Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * fix: use re-exported expr crate Signed-off-by: Ruihang Xia <waynestxia@gmail.com> Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * move files to subfolder * RAT * remove rust.yaml * revert .gitignore changes * tomlfmt * tomlfmt Signed-off-by: Ruihang Xia <waynestxia@gmail.com> Co-authored-by: Daniël Heres <danielheres@gmail.com> Co-authored-by: JanKaul <jankaul@mailbox.org> Co-authored-by: nseekhao <37189615+nseekhao@users.noreply.github.com> Co-authored-by: Ruihang Xia <waynestxia@gmail.com>

… `interval` (#11466) * Unparser rule for datatime cast (#10) * use timestamp as the identifier for date64 * rename * implement CustomDialectBuilder * fix * dialect with interval style (#11) --------- Co-authored-by: Phillip LeBlanc <phillip@leblanc.tech> * fmt * clippy * doc * Update datafusion/sql/src/unparser/expr.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * update the doc for CustomDialectBuilder * fix doc test --------- Co-authored-by: Phillip LeBlanc <phillip@leblanc.tech> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

… `interval` (apache#11466) * Unparser rule for datatime cast (apache#10) * use timestamp as the identifier for date64 * rename * implement CustomDialectBuilder * fix * dialect with interval style (apache#11) --------- Co-authored-by: Phillip LeBlanc <phillip@leblanc.tech> * fmt * clippy * doc * Update datafusion/sql/src/unparser/expr.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * update the doc for CustomDialectBuilder * fix doc test --------- Co-authored-by: Phillip LeBlanc <phillip@leblanc.tech> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Dandandan mentioned this pull request Apr 19, 2021

ARROW-12441: [Rust][DataFusion] Support cartesian join apache/arrow#10092

Closed

alamb reviewed Apr 19, 2021

View reviewed changes

jorgecarleitao added the datafusion Changes in the datafusion crate label Apr 20, 2021

alamb force-pushed the master branch from 8345b5e to 4123600 Compare April 20, 2021 10:34

Dandandan added 12 commits April 20, 2021 19:59

Cross join implementation

da21e95

Add to ballista, debug line

ba01761

Add to tpch test, format

4f0ab2c

Simplify a bit

5456485

Row-by-row processing for the left side to keep memory down

2dc88df

Fix

129a00b

Fmt

c07e869

Clippy

f6b762f

Fix doc, don't include as much debug info in memoryexec debug

a6cef8f

Use join

01eec41

Fix doc

9a692d6

Add test cases with partitions

178a349

Dandandan force-pushed the cross_join branch from 0007743 to 178a349 Compare April 20, 2021 18:01

Dandandan added 2 commits April 20, 2021 20:06

Make clear that mutex is locked for very short amount of time

844cb4b

Unwrap the lock

fd115e0

alamb approved these changes Apr 21, 2021

View reviewed changes

datafusion/src/physical_plan/cross_join.rs Show resolved Hide resolved

alamb changed the title ~~[DataFusion] Cross join implementation~~ [ARROW-12441] [DataFusion] Cross join implementation Apr 22, 2021

alamb merged commit 57eeb64 into apache:master Apr 22, 2021

jorgecarleitao added the enhancement New feature or request label Apr 22, 2021

yjshen referenced this pull request in yjshen/datafusion Dec 21, 2021

wip (#11)

626591f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ARROW-12441] [DataFusion] Cross join implementation #11

[ARROW-12441] [DataFusion] Cross join implementation #11

Dandandan commented Apr 19, 2021 •

edited

Loading

codecov-commenter commented Apr 19, 2021 •

edited

Loading

jorgecarleitao commented Apr 19, 2021

alamb left a comment

alamb Apr 19, 2021

Dandandan Apr 19, 2021

alamb commented Apr 20, 2021

alamb commented Apr 20, 2021

Dandandan commented Apr 20, 2021

Dandandan commented Apr 20, 2021

[ARROW-12441] [DataFusion] Cross join implementation #11

[ARROW-12441] [DataFusion] Cross join implementation #11

Conversation

Dandandan commented Apr 19, 2021 • edited Loading

codecov-commenter commented Apr 19, 2021 • edited Loading

Codecov Report

jorgecarleitao commented Apr 19, 2021

alamb left a comment

Choose a reason for hiding this comment

alamb Apr 19, 2021

Choose a reason for hiding this comment

Dandandan Apr 19, 2021

Choose a reason for hiding this comment

alamb commented Apr 20, 2021

alamb commented Apr 20, 2021

Dandandan commented Apr 20, 2021

Dandandan commented Apr 20, 2021

Dandandan commented Apr 19, 2021 •

edited

Loading

codecov-commenter commented Apr 19, 2021 •

edited

Loading