-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DataFusion] Optimize hash join inner workings, null handling fix #24
Conversation
Codecov Report
@@ Coverage Diff @@
## master #24 +/- ##
==========================================
+ Coverage 76.24% 76.41% +0.16%
==========================================
Files 134 134
Lines 23051 23181 +130
==========================================
+ Hits 17576 17714 +138
+ Misses 5475 5467 -8
Continue to review full report at Codecov.
|
@alamb @jorgecarleitao FYI if you have time, I am not planning any changes to this PR for now. I think the more important part of the PR is the fix wrt null handling which turned out to be a one line change. The other part is some small performance tweaks and prepares it a bit for further improvements down the road. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay, @Dandandan and thanks for the nudge.
I went through this. I have a change that I think we should do it, but otherwise looks great. Super cool.
@@ -708,7 +720,6 @@ macro_rules! equal_rows_elem { | |||
let right_array = $r.as_any().downcast_ref::<$array_type>().unwrap(); | |||
|
|||
match (left_array.is_null($left), left_array.is_null($right)) { | |||
(true, true) => true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was the line, right? Damm, so many wrong things for this one.
Really great finding, @Dandandan !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep! To be fair I also added this line myself 😆
I think we also should start testing DataFusion more rigorously wrt null handling, but one step at a time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if it is, I noticed the values method doesn't take offset into account right now, so this is maybe an optimization worth doing once hashing is included as arrow kernel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact we are now worrying about and fixing null handling is a sign, in my mind, of DataFusion's maturation 🤣
Co-authored-by: Jorge Leitao <jorgecarleitao@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
Sorry for the delay @Dandandan -- I am about out of time today, but I will take a careful look tomorrow morning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little confused about the lack of keys in the hash map, but otherwise things look good to me
@@ -708,7 +720,6 @@ macro_rules! equal_rows_elem { | |||
let right_array = $r.as_any().downcast_ref::<$array_type>().unwrap(); | |||
|
|||
match (left_array.is_null($left), left_array.is_null($right)) { | |||
(true, true) => true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact we are now worrying about and fixing null handling is a sign, in my mind, of DataFusion's maturation 🤣
} | ||
} | ||
} else { | ||
if $multi_col { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
special casing multi-value join columns is a good one
* Initial commit * initial commit * failing test * table scan projection * closer * test passes, with some hacks * use DataFrame (#2) * update README * update dependency * code cleanup (#3) * Add support for Filter operator and BinaryOp expressions (#4) * GitHub action (#5) * Split code into producer and consumer modules (#6) * Support more functions and scalar types (#7) * Use substrait 0.1 and datafusion 8.0 (#8) * use substrait 0.1 * use datafusion 8.0 * update datafusion to 10.0 and substrait to 0.2 (#11) * Add basic join support (#12) * Added fetch support (#23) Added fetch to consumer Added limit to producer Added unit tests for limit Added roundtrip_fill_none() for testing when None input can be converted to 0 Update src/consumer.rs Co-authored-by: Andy Grove <andygrove73@gmail.com> Co-authored-by: Andy Grove <andygrove73@gmail.com> * Upgrade to DataFusion 13.0.0 (#25) * Add sort consumer and producer (#24) Add consumer Add producer and test Modified error string * Add serializer/deserializer (#26) * Add plan and function extension support (#27) * Add plan and function extension support * Removed unwraps * Implement GROUP BY (#28) * Add consumer, producer and tests for aggregate relation Change function extension registration from absolute to relative anchor (reference) Remove operator to/from reference * Fixed function registration bug * Add test * Addressed PR comments * Changed field reference from mask to direct reference (#29) * Changed field reference from masked reference to direct reference * Handle unsupported case (struct with child) * Handle SubqueryAlias (#30) Fixed aggregate function register bug * Add support for SELECT DISTINCT (#31) Add test case * Implement BETWEEN (#32) * Add case (#33) * Implement CASE WHEN * Add more case to test * Addressed comments * feat: support explicit catalog/schema names in ReadRel (#34) * feat: support explicit catalog/schema names in ReadRel Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * fix: use re-exported expr crate Signed-off-by: Ruihang Xia <waynestxia@gmail.com> Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * move files to subfolder * RAT * remove rust.yaml * revert .gitignore changes * tomlfmt * tomlfmt Signed-off-by: Ruihang Xia <waynestxia@gmail.com> Co-authored-by: Daniël Heres <danielheres@gmail.com> Co-authored-by: JanKaul <jankaul@mailbox.org> Co-authored-by: nseekhao <37189615+nseekhao@users.noreply.github.com> Co-authored-by: Ruihang Xia <waynestxia@gmail.com>
…calculations, limit/order/distinct (#11756) * Fix unparser derived table with columns include calculations, limit/order/distinct (#24) * compare format output to make sure the two level of projects match * add method to find inner projection that could be nested under limit/order/distinct * use format! for matching in unparser sort optimization too * refactor * use to_string and also put comments in * clippy * fix unparser derived table contains cast (#25) * fix unparser derived table contains cast * remove dbg
> 5% speed up for query 5 by speeding up some hot functions
.value(i)
which does a bound check.Closes #44
Fixes #195
PR
Master