Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add explain Plugin API for CPU plan #3850

Merged
merged 67 commits into from
Oct 21, 2021

Conversation

tgravescs
Copy link
Collaborator

@tgravescs tgravescs commented Oct 18, 2021

part of #3659

I specifically did not put in user docs anywhere, we need to decide what is best place or if we want to leave this undocumented until we have tried it on a few customers.

This adds an API that will run the CPU plan through the plugin api's to see what it can convert and what it can't and just outputs a string with the explain output. Since we are looking at the CPU plan it runs other things as compared to when the rapids spark plugin would normally run in the query execution, so we strip those back out. ie cpu plan that we look at here will have CollapseCodegenStages and subqueries and ReuseExchangeAndSubquery applied, whereas normally our plugin runs before those get applied to the plan. See more details in prepareExplainOnly

It currently still requires the rapids 4 spark jar and cudf jars to be present in the Spark application , but doesn't need gpu or cuda.

The main api exposed: com.nvidia.spark.rapids.ExplainPlan.explainPotentialGpuPlan(df)

The changes in GPUOverrides are mostly moving code from the class into the object and then adding the functions needed to process the CPU plan without doing the convert.

Perhaps in the future we can try to make it more standalone tool but we need to get the GPUoverrides meta stuff where we decide if we can convert or not out by itself.

We may also want to add more API's there to perhaps get some sort of summary report rather then dumping out all the explain text.

I ran this api through the NDS queries and compared to actual GPU explain and it did pretty good when comparing the 2. This is how I found I missed the subquery expressions. So now we follow those and explain on those for cpu side.

I manually tested this on Databricks, only difference is you have to add libraries through UI so that it is available in python notebook.

@tgravescs tgravescs added the feature request New feature or request label Oct 18, 2021
@tgravescs tgravescs added this to the Oct 18 - Oct 29 milestone Oct 18, 2021
@tgravescs tgravescs self-assigned this Oct 18, 2021
@tgravescs
Copy link
Collaborator Author

build

Signed-off-by: Thomas Graves <tgraves@nvidia.com>
@tgravescs
Copy link
Collaborator Author

build

Signed-off-by: Thomas Graves <tgraves@nvidia.com>
@tgravescs
Copy link
Collaborator Author

build

@tgravescs
Copy link
Collaborator Author

build

andygrove
andygrove previously approved these changes Oct 19, 2021
* @return String containing the explained plan.
*/
def explainPotentialGpuPlan(df: DataFrame, explain: String = "ALL"): String = {
val gpuOverrideClass = ShimLoader.loadGpuOverrides()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want any kind of error handling here? If this is a public API we should probably have something talking about exceptions and saying what it can throw.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes good point, I don't want to really log from this interface so I'll delcare what can be thrown

*/
def explainPotentialGpuPlan(df: DataFrame, explain: String = "ALL"): String = {
val gpuOverrideClass = ShimLoader.loadGpuOverrides()
val explainMethod = gpuOverrideClass
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we try to avoid using reflection in the code we have full control over? We already make another top-level class file, so we can have a shareable API between classloaders

trait ExplainPlan {
   def explainPotentialGpuPlan(df: DataFrame, explain: String = "ALL")
}

In GpuOverrides.scala we can define a class that call GpuOverrides.explain

class GpuExplainPlain extends ExplainPlan {
  override def explainPotentialGpuPlan(df: DataFrame, explain: String): String = {
    GpuOverrides.explainPotentialGpuPlan(df, explain)
  }
}

in ShimLoader

 def newExplainPlan(): ExplainPlan = {
    ShimLoader.newInstanceOf[ExplainPlan]("com.nvidia.spark.rapids.ExplainPlanImpl")
  }

The client API is

ShimLoader.newExplainPlan.explainPotentialGpuPlan

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to look at more closely as I'm not following. your last line the client API, are you saying that is the public user facing api? why would we want to expose the Shimloader as a public api?
you also have a ExplainPlanImpl class, is that supposed to match the GpuExplainPlain?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you mean the client API as in the object ExplainPlan.explainPotentialGpuPlan would call? Basically you are just saying create a trait that is public so can be used by both because it will be in unshimmed code?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I think this works, the only issue is the trait can't be called ExplainPlan if we want the pyspark to be able to call it, but we can change name to ExplainPlanType or something.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you mean the client API as in the object ExplainPlan.explainPotentialGpuPlan would call? Basically you are just saying create a trait that is public so can be used by both because it will be in unshimmed code?

that's correct. Having a trait/interface/abstract class loaded via parent classloader enables us to avoid using reflection even if the implementation was loaded via a child classloader.

Signed-off-by: Thomas Graves <tgraves@nvidia.com>
@tgravescs
Copy link
Collaborator Author

build

@tgravescs tgravescs merged commit 9d5ed8d into NVIDIA:branch-21.12 Oct 21, 2021
@tgravescs tgravescs deleted the qualExplain branch October 21, 2021 18:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants