-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement ExistenceJoin Iterator using an auxiliary left semijoin #4796
Implement ExistenceJoin Iterator using an auxiliary left semijoin #4796
Conversation
Signed-off-by: Gera Shegalov <gera@apache.org>
Signed-off-by: Gera Shegalov <gera@apache.org>
build |
Signed-off-by: Gera Shegalov <gera@apache.org>
build |
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuHashJoin.scala
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuHashJoin.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuHashJoin.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuHashJoin.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/JoinGatherer.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/JoinGatherer.scala
Outdated
Show resolved
Hide resolved
…issue589-gathermap-as-an-existence-column
Signed-off-by: Gera Shegalov <gera@apache.org>
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AbstractGpuJoinIterator.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AbstractGpuJoinIterator.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AbstractGpuJoinIterator.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AbstractGpuJoinIterator.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AbstractGpuJoinIterator.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AbstractGpuJoinIterator.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AbstractGpuJoinIterator.scala
Outdated
Show resolved
Hide resolved
// cuDF executes left semijoin, the gatherer is constructed with a new | ||
// gather to gather every row from lhs | ||
// | ||
// we build a new rhs with a the "exists" Boolean column that has as many rows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels off. with a the "exists"
I think the a
is a typo
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AbstractGpuJoinIterator.scala
Outdated
Show resolved
Hide resolved
// semijoin lhs-GatherMap labeling rows that have at least one match in the original | ||
// rhs | ||
// | ||
val rhsExistsCB = withResource(Scalar.fromBool(false)) { falseScalar => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is way too deeply nested for me. Could we try to break it up some? The falseScalar
is only used to create falseCV
. It might also be nice to create a method for Table.scatter that takes the columnView and a single Scalar as input, and does all of the wrapping/unwrapping. to make this code that much more readable.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AbstractGpuJoinIterator.scala
Outdated
Show resolved
Hide resolved
build |
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuHashJoin.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuHashJoin.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuHashJoin.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuHashJoin.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuHashJoin.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuHashJoin.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuHashJoin.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuHashJoin.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuHashJoin.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuHashJoin.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuHashJoin.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuHashJoin.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuHashJoin.scala
Show resolved
Hide resolved
Co-authored-by: Jason Lowe <jlowe@nvidia.com>
build |
1 similar comment
build |
This PR implements an iterator for ExistenceJoin
This PR computes ExistenceJoin by executing left semijoin via cuDF. The lhs GatherMap is scattering
true
into a Boolean column with all lhs.numRows being initiallyfalse
. The rhs data is not gathered.The PR also fixes regex matching against SparkPlan node strings. The previously used simple String mentions ExistenceJoin only in the CPU plan but does not print ExistenceJoin type as part of the Join exec string in the GPU plan.
Closes #589
Signed-off-by: Gera Shegalov gera@apache.org