Add leafNodeDefaultParallelism support #3408

wbo4958 · 2021-09-08T05:05:55Z

This PR is to fix #1925 by add leafNodeDefaultParallelism to the shims layer.

Signed-off-by: Bobby Wang <wbo4958@gmail.com>

wbo4958 · 2021-09-08T05:32:09Z

build

gerashegalov · 2021-09-08T07:07:31Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/basicPhysicalOperators.scala

@@ -378,7 +378,8 @@ case class GpuRangeExec(range: org.apache.spark.sql.catalyst.plans.logical.Range
  val start: Long = range.start
  val end: Long = range.end
  val step: Long = range.step
-  val numSlices: Int = range.numSlices.getOrElse(sparkContext.defaultParallelism)
+  val numSlices: Int = range.numSlices.getOrElse(ShimLoader.getSparkShims
+    .leafNodeDefaultParallelism(ShimLoader.getSparkShims.sessionFromPlan(this)))


we already have an inherited function sparkSession from GpuExec so we can just write leafNodeDefaultParallelism(sparkSession)

but I think we can get away without a Shim if we simply spell out what https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L781 does. I think it's ok if we "accidentally" understand the 3.2.0+ property in prior Spark versions.

val numSlices: Int = range.numSlices.getOrElse( sparkSession.conf.getOption("spark.sql.leafNodeDefaultParallelism").map(_.toInt) .getOrElse(sparkSession.sparkContext.defaultParallelism) )

what if users have set "spark.sql.leafNodeDefaultParallelism" in the prior spark 3.2 version? Then it will have a different behavior between GPU and CPU. What's your opinion? @gerashegalov

the reason I thought it's OK to have process spark.sql.leafNodeDefaultParallelism even for earlier versions is because

it does not change query results logically

the behavior is easily tweak-able by config

In this particular case if we wanted to be very rigorous with regard to this we can call
ShimLoader.getSparkShims.getSparkShimVersion and check major and minor accordingly and still avoid introducing extra shim changes. IMO it's not necessary.

Yeah, it does not change query results logically. But it will change the number of output files, I think we should keep rigorous as spark itself. Just like @revans2 said, user tends to write a script with this configuration to run their queries against different spark versions.

And I'd like to add this in the shim layer, Since getSparkShimVersion will match different versions like (SparkShimVersion/ClouderaShimVersion ...) to get the major/minor number. But what if someone adds a new ShimVersion and forgets to update here. @gerashegalov what's your opinion?

IMO, the user will rather appreciate seeing consistency between Spark versions in the scenario.

Sorry @gerashegalov, You have not convinced me. I'd like to merge this PR.

Well, at least I tried :)

having consistent behavior used to be a strong argument

wbo4958 · 2021-09-08T08:44:59Z

build

revans2

For me this looks fine. If @gerashegalov convinces you to not do it in the shim I am okay with that too. I don't think it is going to be common for the config in question to be set on older versions of spark. But I can see it happen, especially if someone is switching back and forth trying to test something between different versions.

wbo4958 added 2 commits September 8, 2021 12:59

Add leafNodeDefaultParallelism support

85a0950

Signed-off-by: Bobby Wang <wbo4958@gmail.com>

minor change

f652195

gerashegalov reviewed Sep 8, 2021

View reviewed changes

resolve comments

76d4b15

revans2 approved these changes Sep 8, 2021

View reviewed changes

sameerz assigned wbo4958 Sep 9, 2021

sameerz added the Spark 3.2+ label Sep 9, 2021

sameerz added this to the Aug 30 - Sept 10 milestone Sep 9, 2021

wbo4958 merged commit 9742ea7 into NVIDIA:branch-21.10 Sep 9, 2021

wbo4958 deleted the leafNodeParallism branch September 9, 2021 21:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add leafNodeDefaultParallelism support #3408

Add leafNodeDefaultParallelism support #3408

wbo4958 commented Sep 8, 2021

wbo4958 commented Sep 8, 2021

gerashegalov Sep 8, 2021

wbo4958 Sep 8, 2021 •

edited

Loading

gerashegalov Sep 8, 2021 •

edited

Loading

wbo4958 Sep 9, 2021

gerashegalov Sep 9, 2021

wbo4958 Sep 9, 2021

gerashegalov Sep 10, 2021

wbo4958 commented Sep 8, 2021

revans2 left a comment

Add leafNodeDefaultParallelism support #3408

Add leafNodeDefaultParallelism support #3408

Conversation

wbo4958 commented Sep 8, 2021

wbo4958 commented Sep 8, 2021

gerashegalov Sep 8, 2021

Choose a reason for hiding this comment

wbo4958 Sep 8, 2021 • edited Loading

Choose a reason for hiding this comment

gerashegalov Sep 8, 2021 • edited Loading

Choose a reason for hiding this comment

wbo4958 Sep 9, 2021

Choose a reason for hiding this comment

gerashegalov Sep 9, 2021

Choose a reason for hiding this comment

wbo4958 Sep 9, 2021

Choose a reason for hiding this comment

gerashegalov Sep 10, 2021

Choose a reason for hiding this comment

wbo4958 commented Sep 8, 2021

revans2 left a comment

Choose a reason for hiding this comment

wbo4958 Sep 8, 2021 •

edited

Loading

gerashegalov Sep 8, 2021 •

edited

Loading