Prevent approx_percentile aggregate from being split between CPU and GPU #3862

andygrove · 2021-10-19T19:12:25Z

Signed-off-by: Andy Grove andygrove@nvidia.com

We have a mechanism for tagging the underlying SparkPlan when operators are not supported on GPU because we rely on this information when planning query stages when AQE is enabled. The reason for this is that we are only planning a subset of the plan without context about the parent of the query stage being planned. However, we were only respecting these tags for exchange operators and were ignoring them for aggregates. The main change in this PR is to check these tags on all operators.

This change caused some regressions that also had to be addressed in this PR, such as:

The CoalesceShufflePartitions optimization performs a transformUp and may replace ShuffleQueryStageExec with CustomShuffleReaderExec, causing Spark to copy tags from ShuffleQueryStageExec to CustomShuffleReaderExec, including the "no need to replace ShuffleQueryStageExec" tag, so this tag needed to be ignored.
ObjectHashAggregateExec and SortAggregateExec both had type signatures that declared that BinaryType was not supported, which is not always the case. We do support BinaryType for aggregate buffers and there is special handling in the aggregate code for both the case when we are able to convert between CPU and GPU for these buffers and also for the case we are not. This PR adds BinaryType to the type signatures to prevent them from being tagged early on as unsupported on GPU, which was causing regressions in the AQE case.
SortExec was incorrectly declaring that BinaryType is not supported so the type checks are updated and new tests added to demonstrate that we fall back to CPU if a sort expression is binary, but we allow non-sort columns to be binary.

There are some other smaller changes in the PR:

gpuSupportedTag was moved to a new object RapidsMeta so that it could be referenced from classes outside of the RapidsMeta hierarchy.
When checking existing tags, we only call RapidsMeta.willNotWorkOnGpu for reasons that have not already been added to the cannotBeReplacedReasons set.

Signed-off-by: Andy Grove <andygrove@nvidia.com>

abellina

Could there be a comment in the code or the description of the PR on the gpuSupportedTag change?

abellina · 2021-10-20T13:18:05Z

integration_tests/src/main/python/hash_aggregate_test.py

+def test_hash_groupby_approx_percentile_partial_fallback_to_cpu(aqe_enabled):
+    conf = copy_and_update(_approx_percentile_conf, {
+        'spark.sql.adaptive.enabled': aqe_enabled,
+        'spark.rapids.sql.explain': 'ALL'


could this test work in a similar way to the assert_cpu_and_gpu_are_equal_collect_with_capture, where a list of "exist" and "non_exist" classes are used to assert that the query has indeed fallen back.

Thinking of the case if the cast gets "pushed up" to a projection after the hash agg in the future.

andygrove · 2021-10-25T22:30:12Z

build

andygrove · 2021-10-26T02:36:03Z

build

… binary input to see what other issues remain

andygrove · 2021-10-26T23:27:56Z

build

andygrove · 2021-10-27T15:48:27Z

build

andygrove · 2021-10-27T21:38:02Z

build

revans2 · 2021-10-28T14:24:12Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

-          TypeSig.MAP).nested(), TypeSig.all),
+          TypeSig.MAP + TypeSig.BINARY).nested()
+        .withPsNote(TypeEnum.BINARY,
+          "Binary columns are supported but not for the sort expressions")


My main concern is that we have not added notes for the other types that can only go along for a ride, like arrays. We should be consistent. Also we should not put in a note for maps because Spark only allows maps to go along for the ride. You cannot sort on them.

sql-plugin/src/main/scala/com/nvidia/spark/rapids/aggregate.scala

andygrove · 2021-11-01T17:33:47Z

build

andygrove · 2021-11-01T20:10:33Z

build failed with:

Error: 1-01T18:35:45.150Z] [ERROR] Failed to execute goal org.apache.maven.plugins:maven-antrun-plugin:3.0.0:run (create-parallel-world) on project rapids-4-spark_2.12: An Ant BuildException has occured: The following error occurred while executing this line:
Error: 1-01T18:35:45.150Z] [ERROR] /home/jenkins/agent/workspace/jenkins-rapids_premerge-github-3157/dist/maven-antrun/build-parallel-worlds.xml:98: exec returned: 255
Error: 1-01T18:35:45.150Z] [ERROR] around Ant part ...<ant antfile="/home/jenkins/agent/workspace/jenkins-rapids_premerge-github-3157/dist/maven-antrun/build-parallel-worlds.xml" target="build-parallel-worlds" />... @ 9:163 in /home/jenkins/agent/workspace/jenkins-rapids_premerge-github-3157/dist/target/antrun/build-main.xml

andygrove · 2021-11-02T12:55:50Z

build

Implement fix and basic test

ae86e6d

Signed-off-by: Andy Grove <andygrove@nvidia.com>

andygrove added the bug Something isn't working label Oct 19, 2021

andygrove added this to the Oct 18 - Oct 29 milestone Oct 19, 2021

andygrove self-assigned this Oct 19, 2021

abellina reviewed Oct 20, 2021

View reviewed changes

andygrove added 3 commits October 25, 2021 16:01

improve test based on PR feedback

5321446

check tags more consistently

a2b96e1

Add test that does not depend on CAST of array falling back to CPU

e0229e6

add license header

e91bab1

andygrove added 6 commits October 26, 2021 09:14

simplify test to use spark.rapids.sql.hashAgg.replaceMode

8480729

Update comments

3743bfa

revert plugin changes

eb864f9

fix some regressions

74edef1

WIP temporarily allow ObjectHashAggregate/SortAggregate/Sort to allow…

96e52a6

… binary input to see what other issues remain

scalastyle

3ca8c75

andygrove added 3 commits October 27, 2021 08:26

add placeholders for BinaryType checks

8689bf0

Merge remote-tracking branch 'nvidia/branch-21.12' into issue-3834

30d075a

ps notes and type checks

36a43aa

andygrove added 4 commits October 27, 2021 10:26

enable more tests

ff500a1

remove redundant and untested type check

1413059

add test for sort fallback to cpu with binary input

b183cdf

test for SortExec with BinaryType

a82d6ef

andygrove requested review from abellina and revans2 October 27, 2021 21:38

andygrove changed the title ~~WIP: Prevent approx_percentile aggregate from being split between CPU and GPU~~ Prevent approx_percentile aggregate from being split between CPU and GPU Oct 27, 2021

andygrove marked this pull request as ready for review October 27, 2021 21:38

revans2 reviewed Oct 28, 2021

View reviewed changes

andygrove added 2 commits October 28, 2021 08:36

revert changes to aggregate.scala

245aaf6

remove ps note for SortExec BinaryType

f8ef03e

Salonijain27 modified the milestones: Oct 18 - Oct 29, Nov 1 - Nov 12 Oct 29, 2021

revans2 approved these changes Nov 1, 2021

View reviewed changes

Merge remote-tracking branch 'nvidia/branch-21.12' into issue-3834

abce3d1

andygrove merged commit e03c66b into NVIDIA:branch-21.12 Nov 2, 2021

andygrove deleted the issue-3834 branch November 2, 2021 16:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent approx_percentile aggregate from being split between CPU and GPU #3862

Prevent approx_percentile aggregate from being split between CPU and GPU #3862

andygrove commented Oct 19, 2021 •

edited

Loading

abellina left a comment

abellina Oct 20, 2021

abellina Oct 20, 2021

andygrove commented Oct 25, 2021

andygrove commented Oct 26, 2021

andygrove commented Oct 26, 2021

andygrove commented Oct 27, 2021

andygrove commented Oct 27, 2021

revans2 Oct 28, 2021

andygrove commented Nov 1, 2021

andygrove commented Nov 1, 2021

andygrove commented Nov 2, 2021

Prevent approx_percentile aggregate from being split between CPU and GPU #3862

Prevent approx_percentile aggregate from being split between CPU and GPU #3862

Conversation

andygrove commented Oct 19, 2021 • edited Loading

abellina left a comment

Choose a reason for hiding this comment

abellina Oct 20, 2021

Choose a reason for hiding this comment

abellina Oct 20, 2021

Choose a reason for hiding this comment

andygrove commented Oct 25, 2021

andygrove commented Oct 26, 2021

andygrove commented Oct 26, 2021

andygrove commented Oct 27, 2021

andygrove commented Oct 27, 2021

revans2 Oct 28, 2021

Choose a reason for hiding this comment

andygrove commented Nov 1, 2021

andygrove commented Nov 1, 2021

andygrove commented Nov 2, 2021

andygrove commented Oct 19, 2021 •

edited

Loading