[DataFusion] Optimize hash join inner workings, null handling fix #24

Dandandan · 2021-04-21T16:24:35Z

> 5% speed up for query 5 by speeding up some hot functions

Don't store hashes in hashtable and allocate capacity upfront (adding some memory usage for non-high cardinality inputs, but this grows at the same rate as the left side input). This avoids re-hashing (copying + calling hash functions) when resizing the hash table
Do some more optimization for hashing primitives and avoid .value(i) which does a bound check.
Don't call hash_combine on joins with one column
Fix null equality / enable test

Closes #44
Fixes #195

PR

Query 5 avg time: 88.59 ms

Master

Query 5 avg time: 95.91 ms

codecov-commenter · 2021-04-24T14:41:48Z

Codecov Report

Merging #24 (e1c0a4e) into master (245f0b8) will increase coverage by 0.16%.
The diff coverage is 63.00%.

❗ Current head e1c0a4e differs from pull request most recent head 0e1bdb4. Consider uploading reports for the commit 0e1bdb4 to get more accurate results

@@            Coverage Diff             @@
##           master      #24      +/-   ##
==========================================
+ Coverage   76.24%   76.41%   +0.16%     
==========================================
  Files         134      134              
  Lines       23051    23181     +130     
==========================================
+ Hits        17576    17714     +138     
+ Misses       5475     5467       -8

Impacted Files	Coverage Δ
ballista/rust/client/src/context.rs	`0.00% <0.00%> (ø)`
ballista/rust/executor/src/main.rs	`0.00% <0.00%> (ø)`
benchmarks/src/bin/nyctaxi.rs	`0.00% <ø> (ø)`
datafusion-examples/examples/flight_server.rs	`0.00% <0.00%> (ø)`
datafusion-examples/examples/simple_udaf.rs	`0.00% <ø> (ø)`
datafusion/src/physical_plan/expressions/case.rs	`72.91% <16.66%> (-0.39%)`	⬇️
benchmarks/src/bin/tpch.rs	`35.35% <33.33%> (+0.20%)`	⬆️
datafusion/src/datasource/csv.rs	`73.33% <69.86%> (-9.28%)`	⬇️
datafusion/src/physical_plan/csv.rs	`79.09% <75.72%> (-4.13%)`	⬇️
datafusion/src/physical_plan/hash_join.rs	`85.94% <78.57%> (-0.49%)`	⬇️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 245f0b8...0e1bdb4. Read the comment docs.

This reverts commit f14f838.

This reverts commit e1ac6a9.

Dandandan · 2021-04-26T20:28:41Z

@alamb @jorgecarleitao FYI if you have time, I am not planning any changes to this PR for now.

I think the more important part of the PR is the fix wrt null handling which turned out to be a one line change.

The other part is some small performance tweaks and prepares it a bit for further improvements down the road.

jorgecarleitao

Sorry for the delay, @Dandandan and thanks for the nudge.

I went through this. I have a change that I think we should do it, but otherwise looks great. Super cool.

datafusion/src/physical_plan/hash_join.rs

jorgecarleitao · 2021-04-26T20:45:09Z

datafusion/src/physical_plan/hash_join.rs

@@ -708,7 +720,6 @@ macro_rules! equal_rows_elem {
        let right_array = $r.as_any().downcast_ref::<$array_type>().unwrap();

        match (left_array.is_null($left), left_array.is_null($right)) {
-            (true, true) => true,


this was the line, right? Damm, so many wrong things for this one.

Really great finding, @Dandandan !

Yep! To be fair I also added this line myself 😆

I think we also should start testing DataFusion more rigorously wrt null handling, but one step at a time.

Even if it is, I noticed the values method doesn't take offset into account right now, so this is maybe an optimization worth doing once hashing is included as arrow kernel.

The fact we are now worrying about and fixing null handling is a sign, in my mind, of DataFusion's maturation 🤣

Co-authored-by: Jorge Leitao <jorgecarleitao@gmail.com>

jorgecarleitao

💯

alamb · 2021-04-26T21:55:21Z

Sorry for the delay @Dandandan -- I am about out of time today, but I will take a careful look tomorrow morning

alamb

I am a little confused about the lack of keys in the hash map, but otherwise things look good to me

datafusion/src/physical_plan/hash_join.rs

alamb · 2021-04-27T13:04:04Z

datafusion/src/physical_plan/hash_join.rs

@@ -708,7 +720,6 @@ macro_rules! equal_rows_elem {
        let right_array = $r.as_any().downcast_ref::<$array_type>().unwrap();

        match (left_array.is_null($left), left_array.is_null($right)) {
-            (true, true) => true,


The fact we are now worrying about and fixing null handling is a sign, in my mind, of DataFusion's maturation 🤣

alamb · 2021-04-27T13:04:57Z

datafusion/src/physical_plan/hash_join.rs

+                }
+            }
+        } else {
+            if $multi_col {


special casing multi-value join columns is a good one

* Initial commit * initial commit * failing test * table scan projection * closer * test passes, with some hacks * use DataFrame (#2) * update README * update dependency * code cleanup (#3) * Add support for Filter operator and BinaryOp expressions (#4) * GitHub action (#5) * Split code into producer and consumer modules (#6) * Support more functions and scalar types (#7) * Use substrait 0.1 and datafusion 8.0 (#8) * use substrait 0.1 * use datafusion 8.0 * update datafusion to 10.0 and substrait to 0.2 (#11) * Add basic join support (#12) * Added fetch support (#23) Added fetch to consumer Added limit to producer Added unit tests for limit Added roundtrip_fill_none() for testing when None input can be converted to 0 Update src/consumer.rs Co-authored-by: Andy Grove <andygrove73@gmail.com> Co-authored-by: Andy Grove <andygrove73@gmail.com> * Upgrade to DataFusion 13.0.0 (#25) * Add sort consumer and producer (#24) Add consumer Add producer and test Modified error string * Add serializer/deserializer (#26) * Add plan and function extension support (#27) * Add plan and function extension support * Removed unwraps * Implement GROUP BY (#28) * Add consumer, producer and tests for aggregate relation Change function extension registration from absolute to relative anchor (reference) Remove operator to/from reference * Fixed function registration bug * Add test * Addressed PR comments * Changed field reference from mask to direct reference (#29) * Changed field reference from masked reference to direct reference * Handle unsupported case (struct with child) * Handle SubqueryAlias (#30) Fixed aggregate function register bug * Add support for SELECT DISTINCT (#31) Add test case * Implement BETWEEN (#32) * Add case (#33) * Implement CASE WHEN * Add more case to test * Addressed comments * feat: support explicit catalog/schema names in ReadRel (#34) * feat: support explicit catalog/schema names in ReadRel Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * fix: use re-exported expr crate Signed-off-by: Ruihang Xia <waynestxia@gmail.com> Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * move files to subfolder * RAT * remove rust.yaml * revert .gitignore changes * tomlfmt * tomlfmt Signed-off-by: Ruihang Xia <waynestxia@gmail.com> Co-authored-by: Daniël Heres <danielheres@gmail.com> Co-authored-by: JanKaul <jankaul@mailbox.org> Co-authored-by: nseekhao <37189615+nseekhao@users.noreply.github.com> Co-authored-by: Ruihang Xia <waynestxia@gmail.com>

…calculations, limit/order/distinct (#11756) * Fix unparser derived table with columns include calculations, limit/order/distinct (#24) * compare format output to make sure the two level of projects match * add method to find inner projection that could be nested under limit/order/distinct * use format! for matching in unparser sort optimization too * refactor * use to_string and also put comments in * clippy * fix unparser derived table contains cast (#25) * fix unparser derived table contains cast * remove dbg

Speed ups in hash join

f63f0e7

Dandandan force-pushed the misc_speed branch from 25d018a to f63f0e7 Compare April 21, 2021 16:25

Dandandan added 3 commits April 21, 2021 19:23

Fix test

fc22fa7

Update hash

c343dc5

Update commit hash everywhere

8999ccb

Dandandan changed the title ~~Micro optimize hash join functions~~ [DataFusion] Micro optimize hash join functions Apr 21, 2021

Use primitive everywhere

493069b

Dandandan changed the title ~~[DataFusion] Micro optimize hash join functions~~ [DataFusion] [WIP] Micro optimize hash join functions Apr 21, 2021

Undo combine_hashes

ee4fc3a

Dandandan changed the title ~~[DataFusion] [WIP] Micro optimize hash join functions~~ [DataFusion] Micro optimize hash join functions Apr 21, 2021

Dandandan changed the title ~~[DataFusion] Micro optimize hash join functions~~ [DataFusion] Optimize hash join inner workings Apr 21, 2021

Dandandan added 3 commits April 24, 2021 16:16

Merge, fix conflicts

2a68497

Delete

74e4fe5

Fixes

7672164

Dandandan added 7 commits April 24, 2021 17:00

Avoid combine_hashes call for single columns

f14f838

Update comment

c599bcf

Revert "Avoid combine_hashes call for single columns"

e1ac6a9

This reverts commit f14f838.

Fix null handling

f11f20e

Revert "Revert "Avoid combine_hashes call for single columns""

2cb8470

This reverts commit e1ac6a9.

Fix null handling

2dacc01

Unignore test

3ebd266

Dandandan changed the title ~~[DataFusion] Optimize hash join inner workings~~ [DataFusion] Optimize hash join inner workings, null fix Apr 24, 2021

Dandandan changed the title ~~[DataFusion] Optimize hash join inner workings, null fix~~ [DataFusion] Optimize hash join inner workings, null hanling fix Apr 24, 2021

Dandandan changed the title ~~[DataFusion] Optimize hash join inner workings, null hanling fix~~ [DataFusion] Optimize hash join inner workings, null handling fix Apr 24, 2021

jorgecarleitao reviewed Apr 26, 2021

View reviewed changes

Use normal hasher for booleans

2f30453

Co-authored-by: Jorge Leitao <jorgecarleitao@gmail.com>

jorgecarleitao approved these changes Apr 26, 2021

View reviewed changes

alamb approved these changes Apr 27, 2021

View reviewed changes

Dandandan added 2 commits April 27, 2021 17:06

Add extra documentation to hash join hashmap structure

0e1bdb4

empty

2d8fd7c

alamb merged commit 3371574 into apache:master Apr 28, 2021

houqp added ballista bug Something isn't working datafusion Changes in the datafusion crate enhancement New feature or request labels Jul 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DataFusion] Optimize hash join inner workings, null handling fix #24

[DataFusion] Optimize hash join inner workings, null handling fix #24

Dandandan commented Apr 21, 2021 •

edited

Loading

codecov-commenter commented Apr 24, 2021 •

edited

Loading

Dandandan commented Apr 26, 2021

jorgecarleitao left a comment

jorgecarleitao Apr 26, 2021

Dandandan Apr 26, 2021

Dandandan Apr 26, 2021 •

edited

Loading

alamb Apr 27, 2021

jorgecarleitao left a comment

alamb commented Apr 26, 2021

alamb left a comment

alamb Apr 27, 2021

alamb Apr 27, 2021

[DataFusion] Optimize hash join inner workings, null handling fix #24

[DataFusion] Optimize hash join inner workings, null handling fix #24

Conversation

Dandandan commented Apr 21, 2021 • edited Loading

codecov-commenter commented Apr 24, 2021 • edited Loading

Codecov Report

Dandandan commented Apr 26, 2021

jorgecarleitao left a comment

Choose a reason for hiding this comment

jorgecarleitao Apr 26, 2021

Choose a reason for hiding this comment

Dandandan Apr 26, 2021

Choose a reason for hiding this comment

Dandandan Apr 26, 2021 • edited Loading

Choose a reason for hiding this comment

alamb Apr 27, 2021

Choose a reason for hiding this comment

jorgecarleitao left a comment

Choose a reason for hiding this comment

alamb commented Apr 26, 2021

alamb left a comment

Choose a reason for hiding this comment

alamb Apr 27, 2021

Choose a reason for hiding this comment

alamb Apr 27, 2021

Choose a reason for hiding this comment

Dandandan commented Apr 21, 2021 •

edited

Loading

codecov-commenter commented Apr 24, 2021 •

edited

Loading

Dandandan Apr 26, 2021 •

edited

Loading