Refactor TaskEnd to be accessible by Q/P tools #1000

amahussein · 2024-05-08T15:23:34Z

Contributes to #980

This PR is another step toward closing the gap between Q/P tools

With the changes in this PR, the Qualification tool will create a new subdirectory raw_metrics that contains a folder for each application.

rapids_4_spark_qualification_output
    |
    |->raw_metrics/
          |--> app-id-1/
          |--> app-id-2/
          |--> app-id-n/
                 |
                 |-> raw_information.log
                 |-> sql_level_aggregated_task_metrics.csv
                 |-> stage_level_aggregated_task_metrics.csv
                 |-> sql_level_aggregated_task_metrics.csv
                 |-> job_level_aggregated_task_metrics.csv
                 |-> sql_to_stage_information.csv
                 |-> sql_duration_and_executor_cpu_time_percent.csv
                 |-> sql_plan_metrics_for_application.csv
                 |-> io_metrics.csv
                 |-> *ALL_OTHER_FILES*.csv

The following files are generated:

-raw_information.log contains the text format of the tables (equivalent to profile.log)

“Executor Information”: executor_information
“Job Information”: job_information
“SQL to Stage Information”: sql_to_stage_information
“SQL Plan Metrics for Application”: sql_plan_metrics_for_application
“WholeStageCodeGen Mapping”: wholestagecodegen_mapping
“Failed Tasks”
“Failed Stages”
“Failed Jobs”
“Removed BlockManagers”
“Removed Executors”
“Stage level aggregated task metrics”: stage_level_aggregated_task_metrics (unlike the Profiler output, I separated stage from jobs so that we can process it easier)
“Job level aggregated task metrics”: job_level_aggregated_task_metrics
“Shuffle Skew Check”: shuffle_skew_check
“SQL level aggregated task metrics”: sql_level_aggregated_task_metrics
“IO Metrics”: io_metrics
“SQL Duration and Executor CPU Time Percent”: sql_duration_and_executor_cpu_time_percent

Changes

TaskCase class has moved to a new scala class TaskModel
taskEnd which is the ArrayBuffer holding all the TaskCase objects has moved to a new container class TaskModelManager
TaskModelManager becomes a member of AppBase
The storage of Tasks has changed from ArrayBuffer to HashMap: This was implemented because ArrayBuffer is O(N) operation every time the tools fetch the tasks by stageID. Evaluating this new optimization, I found that the Profiling tool has improved by 10% (it depends on how many tasks in the eventlog)
StageCompleted eventHandler will add the stage accumulables for both Q/P Tools
Analysis code implementation is simplified because the new hashMap already does the grouping of tasks based on stageID and stageAttempt

Second iteration Changes

Qualification dumps aggregate metrics so that Qualx won't need to run the profiling tool. This requires:
- use common structure between Q/P to keep track of mapping between SQLs and (stages/tasks)
- Implement the output code for the Qualification
two new packages were created
- com.nvidia.spark.rapids.tool.analysis: contains the definitions and implementation of the logic to collect information and build relations/aggregates from the eventlogs. In generaal, any analysis should be done after processing all the events.
- com.nvidia.spark.rapids.tool.views: Contains the wrappers that render the analyzed information. This package will include the final extraction of data, and some cosmetics such as how the records are sorted for each view.
Created AppSQLPlanAnalyzer to be able to break stagnation caused by some members in ApplicationInfo. sqlPlanNodeIdToStageIds, wholeStage , unsupportedSQLPlan and allSQLMetrics. Those members are used to create SQL level information output currently used by the Estimation_model.
In the original code, the AppInfo was filling those members after all the events were processed.
If we simply move those members to AppBase we would face the following problems:
- Those members will increase memory usage, but the qualification does not need them to generate QualSummaryInfo. Therefore, moving them to a different class implies that they won't hand around the entire lifetime of QualTool; especially when it builds the SummaryInfo data in memory.
- in the original code, there is an overlap between those members and the Qual field members. If we simply move them, then we will have to change the qual tool implementation which is a huge/risky change.

Performance impact

Profiling:
- Improve performance of ProfilingTool because the hashMap will pull the tasks in O(1) instead of O(N). we used to scan taskEnd by stageID everywhere. Initial profiling show 10% improvement in CPUTime
Qualification:
- memory consumption will increase because it stores Tasks and their metrics. This can be a huge memory overhead with large number of tasks.
- performance will get worse because of GC triggered more frequently. Again, this depends on the number of tasks in the eventlog
QualX:
- It is expected to take longer time running the CLI for Qualx since Qualification tool will be doing more work.

ToDos

There is hack in AppSQLPlanAnalyzer to get it to work for both Qual/Prof tools. The code needs to be cleaned up to be used neatly for both Qual/Prof tools. I got around duplication PotentialProblem by preventing the AppSQLPlanAnalyzer from writing to the
For Qualification sql_duration_and_executor_cpu_time_percent will have a null values for Potential Problems column. This is due because the AppSQLPlanAnalyzer is not updating the AppSQLPlanAnalyzer when the Qualification tool is running (this is the hack I mentioned above)
Start using TaskModel in Qualification code instead of the legacy data structures such as QualificationAppInfo.stageIdToTaskEndSum and QualificationAppInfo.sqlIDToTaskEndSum
move SQLIDtoFailures into AppBase

Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me> Contributes to NVIDIA#980

Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me>

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/SQLPlanMetricProcessor.scala

nartal1

Thanks @amahussein for refactoring the code. It's a significant change. Started on the review but still need to go over a lot of it.

Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me>

core/src/main/scala/org/apache/spark/sql/rapids/tool/store/TaskModelManager.scala

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppIndexMapperTrait.scala

Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me>

nartal1

Thanks @amahussein ! I tested this PR and it looks good to me.

amahussein added 2 commits May 7, 2024 17:10

Refactor TaskEnd to be accessible by Q/P tools

8845855

Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me> Contributes to NVIDIA#980

Add Stage Accumulables to the accumulable objects for Q tool

8dd3057

Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me>

amahussein added the core_tools Scope the core module (scala) label May 8, 2024

amahussein self-assigned this May 8, 2024

amahussein mentioned this pull request May 8, 2024

Refactor core to converge Qualification and Profiling Tools implementation #980

Closed

Refactor the code to allow Qual tool to generate same CSV files as Prof

342d6f0

Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me>

amahussein changed the title ~~[WIP] Refactor TaskEnd to be accessible by Q/P tools~~ Refactor TaskEnd to be accessible by Q/P tools May 13, 2024

amahussein marked this pull request as ready for review May 13, 2024 21:58

amahussein requested review from nartal1 and parthosa May 13, 2024 22:08

nartal1 reviewed May 14, 2024

View reviewed changes

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/SQLPlanMetricProcessor.scala Outdated Show resolved Hide resolved

nartal1 reviewed May 14, 2024

View reviewed changes

amahussein added 4 commits May 14, 2024 10:05

Cleaning up some naming conventions

b30085c

Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me>

Fix typo in job/stage qual agg metrics

46aafa0

Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me>

Remove redundant sort function in skewshuffle analyzer

d2b3cbf

Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me>

Merge branch 'dev' into spark-rapids-tools-980-taskEnd

363dd93

nartal1 reviewed May 14, 2024

View reviewed changes

amahussein mentioned this pull request May 15, 2024

[FEA] Split JobStageAggTaskMetrics file into two different files #1017

Closed

7 tasks

Fix typos and remove unused classes from ProfileClassWarehouse

ce51bbe

Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me>

amahussein requested a review from nartal1 May 15, 2024 16:23

nartal1 approved these changes May 16, 2024

View reviewed changes

amahussein merged commit 0cc500f into NVIDIA:dev May 16, 2024
15 checks passed

amahussein deleted the spark-rapids-tools-980-taskEnd branch May 16, 2024 14:09

This was referenced May 16, 2024

[BUG] Fix potential problems and AQE updates in Qual tool #1019

Closed

Fix potential problems and AQE updates in Qual tool #1021

Merged

[FEA] Update Estimation-model prediction to run only the Qual Jar cmd #1022

Closed

amahussein mentioned this pull request May 28, 2024

[FEA] Generate all CSVs from Profiler in Qualification #1041

Closed

3 tasks

This was referenced May 29, 2024

Refactor ProfileResult classes to implement new interface design and add CSV output to Qual Tool #1043

Merged

Split JobStageAggTaskMetrics file into two different files #1044

Merged

parthosa mentioned this pull request Jun 6, 2024

Remove using Profiler metrics for QualX and Heuristics #1080

Merged

amahussein mentioned this pull request Jun 7, 2024

[BUG] The raw_metrics CSV file miss App endTime and appDuration #1091

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor TaskEnd to be accessible by Q/P tools #1000

Refactor TaskEnd to be accessible by Q/P tools #1000

amahussein commented May 8, 2024 •

edited

Loading

nartal1 left a comment

nartal1 left a comment

Refactor TaskEnd to be accessible by Q/P tools #1000

Refactor TaskEnd to be accessible by Q/P tools #1000

Conversation

amahussein commented May 8, 2024 • edited Loading

Changes

Second iteration Changes

Performance impact

ToDos

nartal1 left a comment

Choose a reason for hiding this comment

nartal1 left a comment

Choose a reason for hiding this comment

amahussein commented May 8, 2024 •

edited

Loading