Fix potential problems and AQE updates in Qual tool #1021

amahussein · 2024-05-17T16:10:01Z

Signed-off-by: Ahmed Hussein (amahussein) a@ahussein.me

Fixes #1019

What this PR fixes:

Fixes a bug introduced by Refactor TaskEnd to be accessible by Q/P tools #1000 where the Potential Problems of file sql_duration_and_executor_cpu_time_percent.csv generated by CSV file would be empty.
Fixes an old bug in the Qual tool caused by processing SQLplan as they are being created "SQLExecutionStart" and AQEUpdate. this bug was leaving some duplicate records in the SQL-to-problematic

Changes:

In order to fix the Old Bug caused by AQEs:
- Avoid processing SQLPlans by the event handler. Instead, process the plan after all the eventlogs are processed.
- This implies removing QualificationAppInfo.processSQLPlan since the logic is the same as AppSQLPlanAnalyzer.processSQLPlanMetrics()
- Fix the implementation of RunningQualificationApp to make sure that QualificationAppInfo.processSQLPlan is called before aggregateStats(). Otherwise, the RunningQualificationApp would have empty dataSources/problematics/writeDataFormats
In order to fix the empty column:
- Broke the logic of AppSQLPlanAnalyzer.processSQLPlanMetrics() into visitNode() that is separate from the main loop.
- Created QualSQLPlanAnalyzer that extends AppSQLPlanAnalyzer overriding the visitNode() to be able to update the WriteDataFromats
Misc changes:
- changed Some datastructure types to preserve the order of insertion or order of Keys.
- For unit-tests: Updated the order of the unsupported execs in the expected Qualification Execs.

Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me> Fixes NVIDIA#1019

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSQLPlanAnalyzer.scala

core/src/main/scala/com/nvidia/spark/rapids/tool/qualification/QualSQLPlanAnalyzer.scala

core/src/main/scala/org/apache/spark/sql/rapids/tool/AppBase.scala

core/src/main/scala/org/apache/spark/sql/rapids/tool/ToolUtils.scala

tgravescs · 2024-05-20T13:55:00Z

core/src/main/scala/org/apache/spark/sql/rapids/tool/profiling/ApplicationInfo.scala

@@ -196,7 +196,7 @@ class ApplicationInfo(
  processEvents()

  // Process SQL Plan Metrics after all events are processed
-  val planMetricProcessor: AppSQLPlanAnalyzer = AppSQLPlanAnalyzer.processSQLPlan(this)
+  val planMetricProcessor: AppSQLPlanAnalyzer = AppSQLPlanAnalyzer(this)


why is appIndex not passed in here?

Changed the code to pass appIndex to make it more readable.
The reason it was not passing appIndex before is that it is handled by AppSQLPlanAnalyzer.apply().

val sqlAnalyzer = app match { case qApp: QualificationAppInfo => new QualSQLPlanAnalyzer(qApp, appIndex) case pApp: ApplicationInfo => new AppSQLPlanAnalyzer(pApp, pApp.index) }

core/src/main/scala/org/apache/spark/sql/rapids/tool/qualification/QualificationAppInfo.scala

Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me>

amahussein

Thanks @tgravescs
I addressed the comments.

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSQLPlanAnalyzer.scala

core/src/main/scala/com/nvidia/spark/rapids/tool/qualification/QualSQLPlanAnalyzer.scala

core/src/main/scala/org/apache/spark/sql/rapids/tool/ToolUtils.scala

amahussein · 2024-05-20T17:55:21Z

core/src/main/scala/org/apache/spark/sql/rapids/tool/profiling/ApplicationInfo.scala

@@ -196,7 +196,7 @@ class ApplicationInfo(
  processEvents()

  // Process SQL Plan Metrics after all events are processed
-  val planMetricProcessor: AppSQLPlanAnalyzer = AppSQLPlanAnalyzer.processSQLPlan(this)
+  val planMetricProcessor: AppSQLPlanAnalyzer = AppSQLPlanAnalyzer(this)


Changed the code to pass appIndex to make it more readable.
The reason it was not passing appIndex before is that it is handled by AppSQLPlanAnalyzer.apply().

val sqlAnalyzer = app match { case qApp: QualificationAppInfo => new QualSQLPlanAnalyzer(qApp, appIndex) case pApp: ApplicationInfo => new AppSQLPlanAnalyzer(pApp, pApp.index) }

core/src/main/scala/org/apache/spark/sql/rapids/tool/qualification/QualificationAppInfo.scala

core/src/main/scala/org/apache/spark/sql/rapids/tool/AppBase.scala

parthosa · 2024-05-20T17:26:31Z

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSQLPlanAnalyzer.scala

-  private val sqlPlanNodeIdToStageIds: mutable.HashMap[(Long, Long), Set[Int]] =
-    mutable.HashMap.empty[(Long, Long), Set[Int]]
+  // A map between (SQL ID, Node ID) and the set of stage IDs
+  // TODO: The Qualification should use this map instead of building a new set for each exec.


Does this have any performance impact and do we have a tracking issue or need one?

There is indeed a memory overhead because of allocating more objects in memory. This overhead was introduced in #1000; but that was the price to combine the 2 tools.
Depending on priorities, we will address the redundant information stored in QualificationAppInfo

We do have [FEA] Improve performance of core module #367 as an umbrella for performance issues.

There is a plan to set the dataBase storage to be a summer project for interns.

there is Refactor core to converge Qualification and Profiling Tools implementation #980 which is still open and we can add subtask as we think is necessary to consider that done.

The incremental refactor changes the code frequently. I find that filing issues for each possible improvement will be cumbersome and create of flood of overlapping issues. So, unless there is a bug, I mark possible improvements as TODO that we can later revisit depending on priorities.

parthosa · 2024-05-20T18:31:21Z

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSQLPlanAnalyzer.scala

+   *
+   * It has the following effect on the visitor object:
+   * 1- It updates the sqlIsDsOrRDD argument to True when the visited node is an RDD or Dataset.
+   * 2- If the SLID is an RDD, the potentialProblems is cleared because once SQL is marked as RDD,


nit: typo Should this be 'SQLID'?

Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me>

Fix potential problems and AQE updates in Qual tool

f0f0b20

Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me> Fixes NVIDIA#1019

amahussein added bug Something isn't working core_tools Scope the core module (scala) labels May 17, 2024

amahussein requested review from tgravescs, parthosa, cindyyuanjiang and nartal1 May 17, 2024 16:10

amahussein self-assigned this May 17, 2024

tgravescs reviewed May 20, 2024

View reviewed changes

remove sqlInfoCase.promatic field and add more detils to docs

6900d12

Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me>

amahussein commented May 20, 2024

View reviewed changes

amahussein requested a review from tgravescs May 20, 2024 17:58

parthosa reviewed May 20, 2024

View reviewed changes

tgravescs previously approved these changes May 20, 2024

View reviewed changes

fix typo in scaladoc

757aeee

Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me>

amahussein dismissed tgravescs’s stale review via 757aeee May 20, 2024 20:12

tgravescs approved these changes May 20, 2024

View reviewed changes

amahussein merged commit aaf7dc8 into NVIDIA:dev May 20, 2024
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix potential problems and AQE updates in Qual tool #1021

Fix potential problems and AQE updates in Qual tool #1021

amahussein commented May 17, 2024

tgravescs May 20, 2024

amahussein May 20, 2024

amahussein left a comment

amahussein May 20, 2024

parthosa May 20, 2024

amahussein May 20, 2024 •

edited

Loading

parthosa May 20, 2024

Fix potential problems and AQE updates in Qual tool #1021

Fix potential problems and AQE updates in Qual tool #1021

Conversation

amahussein commented May 17, 2024

tgravescs May 20, 2024

Choose a reason for hiding this comment

amahussein May 20, 2024

Choose a reason for hiding this comment

amahussein left a comment

Choose a reason for hiding this comment

amahussein May 20, 2024

Choose a reason for hiding this comment

parthosa May 20, 2024

Choose a reason for hiding this comment

amahussein May 20, 2024 • edited Loading

Choose a reason for hiding this comment

parthosa May 20, 2024

Choose a reason for hiding this comment

amahussein May 20, 2024 •

edited

Loading