-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Crossjoin performance degraded a lot on RAPIDS 21.10 snapshot #3736
Comments
Is this really specific to Spark 3.2? Looking at the event logs, I see the 3.1 run has a "filter time" metric that is missing from the 3.2 event log. This leads me to think that the 3.1 run was run a while back before #3105, and it too would see a similar performance regression if re-run on the same snapshot. I suspect the regression is linked to #3242 since that changed a broadcast nested loop inner join from being implemented as a cross join followed by a separate filter into a nested loop join with an AST condition evaluated during the join. |
The metrics work I did made the filter time a debug metric. |
Ah, good to know, I didn't realize that. That's even more evidence that the 3.1 run is from a different snapshot build, since #3105 removed the filter metric from |
Looking at the configs from the two event logs and examining the For the Spark 3.1.1 run, it was using rapids-4-spark_2.12-21.08.0.jar For the Spark 3.2 run, it was using rapids-4-spark_2.12-21.10.0-20210928.181339-115.jar It would be good to check the performance of the Spark 3.1.1 run on the same plugin version used for the Spark 3.2 run. |
I ran with the latest jar for 3.1.1. and 3.2.0 in spark2a. Confirmed there is a regression. It looks like: |
Updated the headline since @abellina confirmed this is not specific to Spark 3.2. |
There are some patches that @jlowe is working on to address these for 21.10, related to AST changes for the join, and some extra calls we now have to |
Describe the bug
A clear and concise description of what the bug is.
Crossjoin performance degraded a lot on Spark 3.2rc3 + RAPIDS 21.10 snapshot.
@nvliyuan found that on Spark2a cluster, the performance difference is as below:
Spark 3.1: 13s (event log: app-20210908040027-3651)
Spark 3.2rc3: 324s (event log: app-20211001082934-0655)
As per the query plan, there are some non-GPU plan in the Spark 3.2rc3 event log.
Steps/Code to reproduce bug
Please provide a list of steps or a code sample to reproduce the issue.
Avoid posting private or sensitive data.
Run microbenchmark cross-join query which is a 1 million row self-join:
Expected behavior
It should have similar performance as before.
Environment details (please complete the following information)
rapids-4-spark_2.12-21.10.0-20210929.183153-116.jar
cudf-21.10.0-20210930.121728-64.jar
Spark2a 8 nodes standalone cluster.
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: