Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent approx_percentile aggregate from being split between CPU and GPU #3862

Merged
merged 21 commits into from
Nov 2, 2021

Conversation

andygrove
Copy link
Contributor

@andygrove andygrove commented Oct 19, 2021

Signed-off-by: Andy Grove andygrove@nvidia.com

Closes #3834

We have a mechanism for tagging the underlying SparkPlan when operators are not supported on GPU because we rely on this information when planning query stages when AQE is enabled. The reason for this is that we are only planning a subset of the plan without context about the parent of the query stage being planned. However, we were only respecting these tags for exchange operators and were ignoring them for aggregates. The main change in this PR is to check these tags on all operators.

This change caused some regressions that also had to be addressed in this PR, such as:

  • The CoalesceShufflePartitions optimization performs a transformUp and may replace ShuffleQueryStageExec with CustomShuffleReaderExec, causing Spark to copy tags from ShuffleQueryStageExec to CustomShuffleReaderExec, including the "no need to replace ShuffleQueryStageExec" tag, so this tag needed to be ignored.
  • ObjectHashAggregateExec and SortAggregateExec both had type signatures that declared that BinaryType was not supported, which is not always the case. We do support BinaryType for aggregate buffers and there is special handling in the aggregate code for both the case when we are able to convert between CPU and GPU for these buffers and also for the case we are not. This PR adds BinaryType to the type signatures to prevent them from being tagged early on as unsupported on GPU, which was causing regressions in the AQE case.
  • SortExec was incorrectly declaring that BinaryType is not supported so the type checks are updated and new tests added to demonstrate that we fall back to CPU if a sort expression is binary, but we allow non-sort columns to be binary.

There are some other smaller changes in the PR:

  • gpuSupportedTag was moved to a new object RapidsMeta so that it could be referenced from classes outside of the RapidsMeta hierarchy.
  • When checking existing tags, we only call RapidsMeta.willNotWorkOnGpu for reasons that have not already been added to the cannotBeReplacedReasons set.

Signed-off-by: Andy Grove <andygrove@nvidia.com>
@andygrove andygrove added the bug Something isn't working label Oct 19, 2021
@andygrove andygrove added this to the Oct 18 - Oct 29 milestone Oct 19, 2021
@andygrove andygrove self-assigned this Oct 19, 2021
Copy link
Collaborator

@abellina abellina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could there be a comment in the code or the description of the PR on the gpuSupportedTag change?

def test_hash_groupby_approx_percentile_partial_fallback_to_cpu(aqe_enabled):
conf = copy_and_update(_approx_percentile_conf, {
'spark.sql.adaptive.enabled': aqe_enabled,
'spark.rapids.sql.explain': 'ALL'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debug?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this test work in a similar way to the assert_cpu_and_gpu_are_equal_collect_with_capture, where a list of "exist" and "non_exist" classes are used to assert that the query has indeed fallen back.

Thinking of the case if the cast gets "pushed up" to a projection after the hash agg in the future.

@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

build

@andygrove andygrove requested review from abellina and revans2 October 27, 2021 21:38
@andygrove andygrove changed the title WIP: Prevent approx_percentile aggregate from being split between CPU and GPU Prevent approx_percentile aggregate from being split between CPU and GPU Oct 27, 2021
@andygrove andygrove marked this pull request as ready for review October 27, 2021 21:38
TypeSig.MAP).nested(), TypeSig.all),
TypeSig.MAP + TypeSig.BINARY).nested()
.withPsNote(TypeEnum.BINARY,
"Binary columns are supported but not for the sort expressions")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern is that we have not added notes for the other types that can only go along for a ride, like arrays. We should be consistent. Also we should not put in a note for maps because Spark only allows maps to go along for the ride. You cannot sort on them.

@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

build failed with:

Error: 1-01T18:35:45.150Z] [ERROR] Failed to execute goal org.apache.maven.plugins:maven-antrun-plugin:3.0.0:run (create-parallel-world) on project rapids-4-spark_2.12: An Ant BuildException has occured: The following error occurred while executing this line:
Error: 1-01T18:35:45.150Z] [ERROR] /home/jenkins/agent/workspace/jenkins-rapids_premerge-github-3157/dist/maven-antrun/build-parallel-worlds.xml:98: exec returned: 255
Error: 1-01T18:35:45.150Z] [ERROR] around Ant part ...<ant antfile="/home/jenkins/agent/workspace/jenkins-rapids_premerge-github-3157/dist/maven-antrun/build-parallel-worlds.xml" target="build-parallel-worlds" />... @ 9:163 in /home/jenkins/agent/workspace/jenkins-rapids_premerge-github-3157/dist/target/antrun/build-main.xml

@andygrove
Copy link
Contributor Author

build

@andygrove andygrove merged commit e03c66b into NVIDIA:branch-21.12 Nov 2, 2021
@andygrove andygrove deleted the issue-3834 branch November 2, 2021 16:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Approx_percentile deserialize error when calling "show" rather than "collect"
4 participants