[FEA] Conditional hash join for left semi and left anti joins #9695
Labels
feature request
New feature or request
libcudf
Affects libcudf (C++/CUDA) code.
Spark
Functionality that helps Spark RAPIDS
Milestone
Is your feature request related to a problem? Please describe.
Currently the RAPIDS Accelerator for Apache Spark does not accelerate left semi or anti joins that have both an equality and an inequality condition. Currently the only way to accelerate this is by using a nested loop join with an AST condition, but this often performs far worse than the CPU implementation which first implements a hash-based lookup on the equality condition then evaluates the inequality condition before processing it as a "join hit."
Describe the solution you'd like
libcudf supports accepting a supplemental condition represented as an AST expression to the existing left semi and left anti hash-based join APIs. Internally the join kernel performs the hash lookup based on the equality keys then if there's a hit in the hash lookup it evaluates the AST expression to produce a boolean result indicating whether the row should be considered a join match or not. The API can take two sets of table_view pairs, one pair for the left and right equality keys to use and one pair to use for the AST expression evaulation. The result is a gather map for the left table as it is for the hash-based semi/anti join today.
Describe alternatives you've considered
The interface could specify a just two table_views for the left/right table and separately two vectors of ints to specify which columns are the equality keys, but it seems simpler to pass separate table_views for equality vs. AST expression, especially when the application needs to generate the result of an expression for the equality portion of the join (e.g.:
t1_col2 + 1 == t2_col3 * 2
)Additional context
This is part of #5401
The text was updated successfully, but these errors were encountered: