Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support semi join #470

Merged
merged 10 commits into from
Jun 2, 2021
Merged

Support semi join #470

merged 10 commits into from
Jun 2, 2021

Conversation

Dandandan
Copy link
Contributor

@Dandandan Dandandan commented Jun 1, 2021

Which issue does this PR close?

Closes #461

Rationale for this change

Adds semi join support. This will enable us to rewrite WHERE id IN (..) and WHERE EXISTS to a (semi) join.

What changes are included in this PR?

Add semi join to hash join implementation.
What's not in this PR

  • Semi join sql (should be added to sqlparser if we want to) / dataframe syntax
  • Using the semi join in the planner for in/exists

Are there any user-facing changes?

No

@codecov-commenter
Copy link

codecov-commenter commented Jun 1, 2021

Codecov Report

Merging #470 (9cb007a) into master (c8ab5a4) will increase coverage by 0.50%.
The diff coverage is 77.55%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #470      +/-   ##
==========================================
+ Coverage   75.30%   75.80%   +0.50%     
==========================================
  Files         152      153       +1     
  Lines       25275    25924     +649     
==========================================
+ Hits        19033    19652     +619     
- Misses       6242     6272      +30     
Impacted Files Coverage Δ
...sta/rust/core/src/serde/logical_plan/from_proto.rs 36.17% <0.00%> (-0.05%) ⬇️
...lista/rust/core/src/serde/logical_plan/to_proto.rs 62.32% <0.00%> (-0.08%) ⬇️
...ta/rust/core/src/serde/physical_plan/from_proto.rs 39.63% <0.00%> (-0.15%) ⬇️
...ista/rust/core/src/serde/physical_plan/to_proto.rs 50.46% <0.00%> (-0.16%) ⬇️
datafusion/src/logical_plan/builder.rs 90.08% <0.00%> (-0.39%) ⬇️
datafusion/src/logical_plan/plan.rs 80.62% <ø> (+0.52%) ⬆️
datafusion/src/physical_plan/planner.rs 80.32% <0.00%> (-0.14%) ⬇️
datafusion/src/optimizer/hash_build_probe_order.rs 58.53% <20.00%> (-1.97%) ⬇️
datafusion/src/physical_plan/hash_join.rs 86.45% <97.29%> (+0.39%) ⬆️
datafusion/src/physical_plan/hash_utils.rs 98.50% <100.00%> (+0.02%) ⬆️
... and 18 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c8ab5a4...9cb007a. Read the comment docs.

@alamb
Copy link
Contributor

alamb commented Jun 1, 2021

I plan to review this tomorrow

@Dandandan Dandandan mentioned this pull request Jun 2, 2021
@Dandandan
Copy link
Contributor Author

Related PR for anti join here: #482

@houqp
Copy link
Member

houqp commented Jun 2, 2021

should the readme be updated to include semi join as well?

@Dandandan
Copy link
Contributor Author

Dandandan commented Jun 2, 2021

should the readme be updated to include semi join as well?

I think for now we don't - I would say that at least need to have in or exists to be mapped to a semi join, as currently there is not really a way for anyone to create it (other than creating a LogicalPlan explicitly).

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me @Dandandan ! Nice work

@@ -187,6 +187,7 @@ impl OptimizerRule for HashBuildProbeOrder {
| LogicalPlan::CreateExternalTable { .. }
| LogicalPlan::Explain { .. }
| LogicalPlan::Union { .. }
| LogicalPlan::Join { .. }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should you also possibly conditionalize this on being a SemiJoin? like

Suggested change
| LogicalPlan::Join { .. }
| LogicalPlan::Join if *join_type == JoinType::Semi { .. }

So if we add a new join type in the future it doesn't accidentally get allowed / disallowed in some confusing way

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried that in earlier commit... But than the exhaustiveness checking gets in the way unfortunately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only way the compiler can do it now is by explicitly matching on all the join types, but as this feature is not added https://github.com/rust-lang/rfcs/blob/master/text/2535-or-patterns.md that would be quite verbose currently

Copy link
Contributor Author

@Dandandan Dandandan Jun 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rewrote it a bit to use a function in the body of the join match

datafusion/src/physical_plan/hash_join.rs Outdated Show resolved Hide resolved
datafusion/src/physical_plan/hash_join.rs Show resolved Hide resolved
@@ -106,6 +106,13 @@ fn should_swap_join_order(left: &LogicalPlan, right: &LogicalPlan) -> bool {
}
}

fn supports_swap(join_type: JoinType) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb alamb merged commit 33ff660 into apache:master Jun 2, 2021
@houqp houqp added datafusion Changes in the datafusion crate enhancement New feature or request labels Jul 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for semi (hash) join
4 participants