Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Left join implementation is incorrect for 0 or multiple batches on the right side #238

Merged
merged 10 commits into from
May 2, 2021

Conversation

Dandandan
Copy link
Contributor

@Dandandan Dandandan commented May 2, 2021

Which issue does this PR close?

Closes #235
Closes #239
Closes #143

Rationale for this change

Fixes behavior of left join with regard to multiple batches on the right side and 0 right side batches.

This way it also also likely much faster as it avoids keeping generating / keeping / indexing into a HashSet for each right side batch.

What changes are included in this PR?

  • We add a Vec<bool> to keep track of left-side rows that didn't match with the right side. (number o left rows bytes extra memory usage for left joins, not that much compared to storing the left side data and hashmap + indices. It could be more memory-efficient by using a bitmap instead.
  • Generate rows for any item that is not marked as visited.
  • Add some tests for 0 / multiple batches on the right

Are there any user-facing changes?

Only correctness improvements

@Dandandan Dandandan requested a review from andygrove May 2, 2021 11:08
@codecov-commenter
Copy link

codecov-commenter commented May 2, 2021

Codecov Report

Merging #238 (ab760af) into master (c945b03) will increase coverage by 0.30%.
The diff coverage is 96.42%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #238      +/-   ##
==========================================
+ Coverage   76.46%   76.76%   +0.30%     
==========================================
  Files         135      134       -1     
  Lines       23250    23248       -2     
==========================================
+ Hits        17777    17847      +70     
+ Misses       5473     5401      -72     
Impacted Files Coverage Δ
datafusion/src/physical_plan/hash_utils.rs 100.00% <ø> (ø)
datafusion/src/physical_plan/hash_join.rs 87.21% <96.42%> (+1.27%) ⬆️
ballista/rust/core/src/client.rs 0.00% <0.00%> (ø)
ballista/rust/executor/src/main.rs 0.00% <0.00%> (ø)
ballista/rust/scheduler/src/main.rs 0.00% <0.00%> (ø)
ballista/rust/executor/src/execution_loop.rs 0.00% <0.00%> (ø)
ballista/rust/executor/src/flight_service.rs 0.00% <0.00%> (ø)
...ust/core/src/execution_plans/unresolved_shuffle.rs 50.00% <0.00%> (ø)
ballista/rust/executor/src/lib.rs
ballista/rust/scheduler/src/lib.rs 21.13% <0.00%> (+1.62%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c945b03...ab760af. Read the comment docs.

@Dandandan Dandandan changed the title Fix left join unmatched rows Fix Left join implementation is incorrect for 0 or multiple batches on the right side May 2, 2021
@Dandandan Dandandan mentioned this pull request May 2, 2021
3 tasks
Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I pulled this branch locally and confirmed that this fixes the failing tests. Thanks, @Dandandan!

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yap, the implementation makes a lot of sense. Thanks a lot @Dandandan !

datafusion/src/physical_plan/hash_join.rs Outdated Show resolved Hide resolved
@Dandandan
Copy link
Contributor Author

Merging this when it's green

@Dandandan Dandandan merged commit 3072df6 into apache:master May 2, 2021
@alamb alamb added the datafusion Changes in the datafusion crate label May 3, 2021
@houqp houqp added the bug Something isn't working label Jul 31, 2021
mustafasrepo added a commit that referenced this pull request Dec 25, 2023
* Add first reverse support for partitioned join conversion

* Minor changes

* minor changes

* Minor changes

* Add ordering requirement propogation

* Remove wrong check

* Simplifications

* Simplifications

* Minor changes

* Minor changes

* add test case

* Review

* Propagate group by and aggregate through join

* Minor changes

* Minor changes

* Simplifications

* Buggy state

* Minor changes

* Simplifications

* Add comments

* Update comments

* Update join_pipeline_selection.rs

* Mini

* Update comments

* Fix formatting

* Review

---------

Co-authored-by: metesynnada <100111937+metesynnada@users.noreply.github.com>
Co-authored-by: Mehmet Ozan Kabak <ozankabak@gmail.com>
mustafasrepo added a commit that referenced this pull request Dec 25, 2023
* Add first reverse support for partitioned join conversion

* Minor changes

* minor changes

* Minor changes

* Add ordering requirement propogation

* Remove wrong check

* Simplifications

* Simplifications

* Minor changes

* Minor changes

* add test case

* Review

* Propagate group by and aggregate through join

* Minor changes

* Minor changes

* Simplifications

* Buggy state

* Minor changes

* Simplifications

* Add comments

* Update comments

* Update join_pipeline_selection.rs

* Mini

* Update comments

* Fix formatting

* Review

---------

Co-authored-by: metesynnada <100111937+metesynnada@users.noreply.github.com>
Co-authored-by: Mehmet Ozan Kabak <ozankabak@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working datafusion Changes in the datafusion crate
Projects
None yet
6 participants