Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32299] [SQL] Decide SMJ Join Orientation adaptively #29097

Closed
wants to merge 8 commits into from

Conversation

mayurdb
Copy link
Contributor

@mayurdb mayurdb commented Jul 14, 2020

What changes were proposed in this pull request?

To change SortMergeJoin orientation at runtime using adaptive query execution

Why are the changes needed?

For SortMerge join of type EquiJoin, the left and right side of the joins are decided on the basis of the user order. In SMJ, the left side of the join is streamed and the right side is buffered (matching values). Because of this, B SMJ A would perform better than A SMJ B if, sizeOf(B) > sizeOf(A)

With adaptive query execution, once both ShuffleQueryStages corresponding to the join have completed and if none of them have sizes lesser than the broadcast threshold (the join will not be converted to BroadcastHashJoin), join orientation can be changed at run time.

Does this PR introduce any user-facing change?

No
-->

How was this patch tested?

  • Added unit tests
  • Ran AdaptiveQueryExecSuite

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@mayurdb mayurdb changed the title Spark 32299 [SPARK-32299] [SQL] Decide SMJ Join Orientation adaptively Jul 14, 2020
@mayurdb
Copy link
Contributor Author

mayurdb commented Jul 14, 2020

cc @maryannxue @cloud-fan

@gatorsmile
Copy link
Member

gatorsmile commented Jul 16, 2020

That also depends on the data values, right? Not always faster.

Copy link
Contributor

@c21 c21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have similar concern with @gatorsmile . I think this also depends on the run-time cardinality of data.

E.g., if left side is smaller than right side, but every row from left side is same, and every row from right side is not same (unique). We should buffer right side here even though ride side is larger, because if we buffer left side, we essentially need to read all left side into the buffer.

In addition, this PR is swapping left and right side based on total size. However, during run-time, each task/partition can have different amount of data per left + right side. I think simply swapping left and right side here might cause some tasks to regress but some tasks to improve.

@github-actions
Copy link

github-actions bot commented Dec 1, 2020

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Dec 1, 2020
@github-actions github-actions bot closed this Dec 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants