Optimize semaphore acquisition in GpuShuffledHashJoinExec #4588

abellina · 2022-01-20T23:58:58Z

Closes #4539.

This PR adds an optimization for the shuffled hash join where the semaphore is not taken after the build side is fetched to the host (since we attempt to keep the build side on the host while we load the first batch from the stream side).

After the first stream batch is loaded, the semaphore has been acquired. We then proceed to bring the build batch to the GPU.

This code falls back to the old behavior in cases where batches are not serialized (our input is not a shuffle) and when the streams are empty.

It is in draft mode because I need to run more testing in NDS. I don't think the case where I need to loop over the build batch multiple times has been hit with my runs so far (other than unit tests), so I need to figure out a way to trigger that case.

Signed-off-by: Alessandro Bellina <abellina@nvidia.com>

revans2 · 2022-01-21T14:19:49Z

So just to be clear this optimization can never work with UCX right?

abellina · 2022-01-21T14:26:22Z

So just to be clear this optimization can never work with UCX right?

Great point. I have missed a case where we need to ignore the config when UCX is turned on and we have compression on. If we don't do this, the fallback case right now doesn't handle compressed batches from the shuffle.

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuCoalesceBatches.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuShuffledHashJoinExec.scala

revans2 · 2022-01-21T14:06:59Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuTransitionOverrides.scala

+  /**
+   * Removes `GpuCoalesceBatches(GpuShuffleCoalesceExec(build side))` for the build side
+   * for the shuffled hash join. The coalesce logic has been moved to the
+   * `GpuShuffleCoalesceExec` class, and is handled differently to prevent holding onto the


This is fine for now, but I really would prefer to have us just build the plan the right way from the beginning. Perhaps we can have a flag in GpuExec that says for this input I know how to handle raw shuffle data so then the rules that insert the host side coalesce can deal with it there.

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuShuffledHashJoinExec.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuCoalesceBatches.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuTransitionOverrides.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuTransitionOverrides.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuShuffledHashJoinExec.scala

abellina · 2022-01-28T18:32:40Z

I added a commit that I think addresses many of the points, enough has changed that I think this needs another pass.

One comment from Jason wasn't addressed for sure:

Do we really need a buffered iterator on the stream-side batch? We just need to call hasNext which will grab the semaphore but I don't see the need for us to call next on the stream-side iterator. That also side-steps another bug here where we create a closeable buffered iterator but pass it to code that will never call close on it since it's just a regular iterator to that code.

We need something to hold on to that first stream batch. That doJoin pays no attention to this, so if there is an exception before that the first batch is popped, then the batch will be leaked.

I think one ugly option is to attach the closeable iterator to the task completion logic. I think we discussed doing this, but I am confirming that's the approach or if there's something better.

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuCoalesceBatches.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuShuffleCoalesceExec.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuShuffledHashJoinExec.scala

sql-plugin/src/main/scala/ai/rapids/cudf/HostConcatResultUtil.scala

GpuShuffledHashJoinExec

…rf/shj_optimization

abellina · 2022-02-02T22:45:01Z

build

abellina · 2022-02-03T15:21:15Z

Looking into the test failures.

abellina · 2022-02-03T23:17:25Z

build

abellina · 2022-02-04T18:34:29Z

@jlowe @revans2 could you take another look when you have a chance?

revans2 · 2022-02-04T21:40:08Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuCoalesceBatches.scala

@@ -448,9 +458,84 @@ class GpuCoalesceIterator(iter: Iterator[ColumnarBatch],
    batches.append(SpillableColumnarBatch(batch, SpillPriorities.ACTIVE_BATCHING_PRIORITY,
      spillCallback))

+  protected def popAll(): Array[ColumnarBatch] = {
+    closeOnExcept(batches.map(_.getColumnarBatch())) { wip =>


nit: does this need to be a safeMap?

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuShuffledHashJoinExec.scala

abellina · 2022-02-04T21:54:08Z

@revans2 I replied to your comment just now in the original request, but adding a link here so we can continue if there's more I need to look into:

Still would like to know the answer to this. The things in the iterator need to be closed and can potentially be very large.

#4588 (comment)

abellina · 2022-02-05T07:39:51Z

build

abellina · 2022-02-07T17:10:11Z

Ok I have a bug with safeMap, I thought I had built this before I pushed it last, but obviously I didn't. I am going to retest the patch against real queries to be sure.

abellina · 2022-02-07T18:36:39Z

build

abellina · 2022-02-07T22:22:17Z

@jlowe @revans2 ok apologies. Now this should be ready.

Optimize semaphore acquisition in GpuShuffledHashJoinExec

29806c2

Signed-off-by: Alessandro Bellina <abellina@nvidia.com>

abellina added the performance A performance related task/issue label Jan 20, 2022

abellina added this to the Jan 10 - Jan 28 milestone Jan 20, 2022

Update buildTime in the non-optimal case as well

e26769f

revans2 requested changes Jan 21, 2022

View reviewed changes

revans2 reviewed Jan 21, 2022

View reviewed changes

jlowe reviewed Jan 21, 2022

View reviewed changes

Refactor for ease of understanding and address review concerns

b23f35b

abellina added 2 commits January 28, 2022 13:19

Fix review nit

af27ac4

Remove extra comment

1c522a8

jlowe reviewed Jan 28, 2022

View reviewed changes

Update for review comments

cc15b8e

sameerz modified the milestones: Jan 10 - Jan 28, Jan 31 - Feb 11 Jan 30, 2022

Add utility to work with rows-only HostConcatResult objects

a947763

jlowe reviewed Feb 2, 2022

View reviewed changes

sql-plugin/src/main/scala/ai/rapids/cudf/HostConcatResultUtil.scala Outdated Show resolved Hide resolved

sql-plugin/src/main/scala/ai/rapids/cudf/HostConcatResultUtil.scala Outdated Show resolved Hide resolved

abellina added 2 commits February 2, 2022 10:06

Apply review suggestions to HostConcatResultUtil and

a607e73

GpuShuffledHashJoinExec

Merge branch 'branch-22.04' of github.com:NVIDIA/spark-rapids into pe…

7df43cc

…rf/shj_optimization

abellina marked this pull request as ready for review February 2, 2022 16:22

abellina added 2 commits February 2, 2022 11:36

Update copyrights

3cd0bf4

Revert copyright change

cfcef3c

jlowe previously approved these changes Feb 2, 2022

View reviewed changes

Fix unit tests

9aefdaf

abellina dismissed jlowe’s stale review via 9aefdaf February 3, 2022 23:17

jlowe previously approved these changes Feb 4, 2022

View reviewed changes

revans2 reviewed Feb 4, 2022

View reviewed changes

Apply review comment

474aef3

abellina dismissed jlowe’s stale review via 474aef3 February 4, 2022 22:02

revans2 previously approved these changes Feb 7, 2022

View reviewed changes

Use arrays with safeMap

b1591ac

abellina dismissed revans2’s stale review via b1591ac February 7, 2022 18:35

jlowe approved these changes Feb 8, 2022

View reviewed changes

abellina merged commit be0f6f7 into NVIDIA:branch-22.04 Feb 8, 2022

abellina deleted the perf/shj_optimization branch February 8, 2022 14:41

sperlingxx mentioned this pull request Mar 7, 2022

[FEA] audit all semaphore acquires to find empty cases #4568

Open

abellina mentioned this pull request Mar 25, 2022

Look into semaphore usage in q78 and q95 #5058

Open

abellina mentioned this pull request May 19, 2022

Filter rows with null keys when coalescing due to reaching cuDF row limits [databricks] #5531

Merged

gerashegalov mentioned this pull request May 24, 2022

[BUG] NoClassDefFoundError with caller classloader off in GpuShuffleCoalesceIterator in local-cluster #5513

Closed

jlowe mentioned this pull request May 26, 2022

What's the update of RapidsShuffleManager to resolve the bottleneck for waiting to acquire the semaphore #5650

Open

abellina mentioned this pull request Oct 5, 2022

Take semaphore after first stream batch is materialized (broadcast) #6709

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize semaphore acquisition in GpuShuffledHashJoinExec #4588

Optimize semaphore acquisition in GpuShuffledHashJoinExec #4588

abellina commented Jan 20, 2022

revans2 commented Jan 21, 2022

abellina commented Jan 21, 2022

revans2 Jan 21, 2022

abellina commented Jan 28, 2022

abellina commented Feb 2, 2022

abellina commented Feb 3, 2022

abellina commented Feb 3, 2022

abellina commented Feb 4, 2022

revans2 Feb 4, 2022

abellina commented Feb 4, 2022

abellina commented Feb 5, 2022

abellina commented Feb 7, 2022

abellina commented Feb 7, 2022

abellina commented Feb 7, 2022

Optimize semaphore acquisition in GpuShuffledHashJoinExec #4588

Optimize semaphore acquisition in GpuShuffledHashJoinExec #4588

Conversation

abellina commented Jan 20, 2022

revans2 commented Jan 21, 2022

abellina commented Jan 21, 2022

revans2 Jan 21, 2022

Choose a reason for hiding this comment

abellina commented Jan 28, 2022

abellina commented Feb 2, 2022

abellina commented Feb 3, 2022

abellina commented Feb 3, 2022

abellina commented Feb 4, 2022

revans2 Feb 4, 2022

Choose a reason for hiding this comment

abellina commented Feb 4, 2022

abellina commented Feb 5, 2022

abellina commented Feb 7, 2022

abellina commented Feb 7, 2022

abellina commented Feb 7, 2022