Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Conditional hash join for inner joins #9696

Closed
jlowe opened this issue Nov 16, 2021 · 1 comment · Fixed by #9917
Closed

[FEA] Conditional hash join for inner joins #9696

jlowe opened this issue Nov 16, 2021 · 1 comment · Fixed by #9917
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@jlowe
Copy link
Member

jlowe commented Nov 16, 2021

Is your feature request related to a problem? Please describe.
Currently the RAPIDS Accelerator for Apache Spark accelerates a hash-based inner join with an inequality condition as an inner-join on the equality condition followed by a filter with the inequality condition. This works fine as long as the equality-join is the primary discriminator, but there are some inner joins in applications that can "explode," generating many output rows due to many replicated keys in either the left or right tables. If the inner join explodes and the inequality condition filters out most rows then the performance approaches that of a nested loop join followed by the filter which is not great. The GPU ends up manifesting many, many rows of the input tables only to have them eliminated by the filter.

It is much more efficient in that case to have the join evaluate not only the equality condition via the hash lookup but also evaluate the inequality after the hash lookup succeeds before declaring the candidate row pair a join match. This avoids manifesting many rows that will not ultimately survive.

Describe the solution you'd like
libcudf supports accepting a supplemental condition represented as an AST expression to the existing hash-based inner join APIs, including the hash_join based APIs. Internally the join kernel performs the hash lookup based on the equality keys then if there's a hit in the hash lookup it evaluates the AST expression to produce a boolean result indicating whether the row should be considered a join match or not. The API can take two sets of table_view pairs, one pair for the left and right equality keys to use and one pair to use for the AST expression evaluation. The result is two gather maps for the left and right tables, respectively, as it is for the hash-based inner join today.

Describe alternatives you've considered
The interface could specify a just two table_views for the left/right table and separately two vectors of ints to specify which columns are the equality keys, but it seems simpler to pass separate table_views for equality vs. AST expression, especially when the application needs to generate the result of an expression for the equality portion of the join (e.g.: t1_col2 + 1 == t2_col3 * 2)

Additional context
This is part of #5401

@jlowe jlowe added feature request New feature or request Needs Triage Need team to review and classify libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS labels Nov 16, 2021
@beckernick beckernick removed the Needs Triage Need team to review and classify label Nov 19, 2021
@vyasr vyasr added this to the Conditional Joins milestone Dec 16, 2021
@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

rapids-bot bot pushed a commit that referenced this issue Jan 18, 2022
This PR implements mixed equality/inequality joins for inner, left, and full joins. This resolves #9696 and contributes to #5401. For the moment, all APIs are functional only, but an object-oriented API is planned to support caching of the hash table.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - Yunsong Wang (https://github.com/PointKernel)
  - Jason Lowe (https://github.com/jlowe)
  - Jake Hemstad (https://github.com/jrhemstad)

URL: #9917
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants