Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor TaskEnd to be accessible by Q/P tools #1000

Merged
merged 8 commits into from
May 16, 2024

Conversation

amahussein
Copy link
Collaborator

@amahussein amahussein commented May 8, 2024

Contributes to #980

This PR is another step toward closing the gap between Q/P tools

With the changes in this PR, the Qualification tool will create a new subdirectory raw_metrics that contains a folder for each application.

rapids_4_spark_qualification_output
    |
    |->raw_metrics/
          |--> app-id-1/
          |--> app-id-2/
          |--> app-id-n/
                 |
                 |-> raw_information.log
                 |-> sql_level_aggregated_task_metrics.csv
                 |-> stage_level_aggregated_task_metrics.csv
                 |-> sql_level_aggregated_task_metrics.csv
                 |-> job_level_aggregated_task_metrics.csv
                 |-> sql_to_stage_information.csv
                 |-> sql_duration_and_executor_cpu_time_percent.csv
                 |-> sql_plan_metrics_for_application.csv
                 |-> io_metrics.csv
                 |-> *ALL_OTHER_FILES*.csv

The following files are generated:

-raw_information.log contains the text format of the tables (equivalent to profile.log)

  • “Executor Information”: executor_information
  • “Job Information”: job_information
  • “SQL to Stage Information”: sql_to_stage_information
  • “SQL Plan Metrics for Application”: sql_plan_metrics_for_application
  • “WholeStageCodeGen Mapping”: wholestagecodegen_mapping
  • “Failed Tasks”
  • “Failed Stages”
  • “Failed Jobs”
  • “Removed BlockManagers”
  • “Removed Executors”
  • “Stage level aggregated task metrics”: stage_level_aggregated_task_metrics (unlike the Profiler output, I separated stage from jobs so that we can process it easier)
  • “Job level aggregated task metrics”: job_level_aggregated_task_metrics
  • “Shuffle Skew Check”: shuffle_skew_check
  • “SQL level aggregated task metrics”: sql_level_aggregated_task_metrics
  • “IO Metrics”: io_metrics
  • “SQL Duration and Executor CPU Time Percent”: sql_duration_and_executor_cpu_time_percent

Changes

  • TaskCase class has moved to a new scala class TaskModel
  • taskEnd which is the ArrayBuffer holding all the TaskCase objects has moved to a new container class TaskModelManager
  • TaskModelManager becomes a member of AppBase
  • The storage of Tasks has changed from ArrayBuffer to HashMap: This was implemented because ArrayBuffer is O(N) operation every time the tools fetch the tasks by stageID. Evaluating this new optimization, I found that the Profiling tool has improved by 10% (it depends on how many tasks in the eventlog)
  • StageCompleted eventHandler will add the stage accumulables for both Q/P Tools
  • Analysis code implementation is simplified because the new hashMap already does the grouping of tasks based on stageID and stageAttempt

Second iteration Changes

  • Qualification dumps aggregate metrics so that Qualx won't need to run the profiling tool. This requires:
    • use common structure between Q/P to keep track of mapping between SQLs and (stages/tasks)
    • Implement the output code for the Qualification
  • two new packages were created
    • com.nvidia.spark.rapids.tool.analysis: contains the definitions and implementation of the logic to collect information and build relations/aggregates from the eventlogs. In generaal, any analysis should be done after processing all the events.
    • com.nvidia.spark.rapids.tool.views: Contains the wrappers that render the analyzed information. This package will include the final extraction of data, and some cosmetics such as how the records are sorted for each view.
  • Created AppSQLPlanAnalyzer to be able to break stagnation caused by some members in ApplicationInfo. sqlPlanNodeIdToStageIds, wholeStage , unsupportedSQLPlan and allSQLMetrics. Those members are used to create SQL level information output currently used by the Estimation_model.
  • In the original code, the AppInfo was filling those members after all the events were processed.
  • If we simply move those members to AppBase we would face the following problems:
    • Those members will increase memory usage, but the qualification does not need them to generate QualSummaryInfo. Therefore, moving them to a different class implies that they won't hand around the entire lifetime of QualTool; especially when it builds the SummaryInfo data in memory.
    • in the original code, there is an overlap between those members and the Qual field members. If we simply move them, then we will have to change the qual tool implementation which is a huge/risky change.

Performance impact

  • Profiling:
    • Improve performance of ProfilingTool because the hashMap will pull the tasks in O(1) instead of O(N). we used to scan taskEnd by stageID everywhere. Initial profiling show 10% improvement in CPUTime
  • Qualification:
    • memory consumption will increase because it stores Tasks and their metrics. This can be a huge memory overhead with large number of tasks.
    • performance will get worse because of GC triggered more frequently. Again, this depends on the number of tasks in the eventlog
  • QualX:
    • It is expected to take longer time running the CLI for Qualx since Qualification tool will be doing more work.

ToDos

  • There is hack in AppSQLPlanAnalyzer to get it to work for both Qual/Prof tools. The code needs to be cleaned up to be used neatly for both Qual/Prof tools. I got around duplication PotentialProblem by preventing the AppSQLPlanAnalyzer from writing to the
  • For Qualification sql_duration_and_executor_cpu_time_percent will have a null values for Potential Problems column. This is due because the AppSQLPlanAnalyzer is not updating the AppSQLPlanAnalyzer when the Qualification tool is running (this is the hack I mentioned above)
  • Start using TaskModel in Qualification code instead of the legacy data structures such as QualificationAppInfo.stageIdToTaskEndSum and QualificationAppInfo.sqlIDToTaskEndSum
  • move SQLIDtoFailures into AppBase

Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me>

Contributes to NVIDIA#980
Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me>
Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me>
@amahussein amahussein changed the title [WIP] Refactor TaskEnd to be accessible by Q/P tools Refactor TaskEnd to be accessible by Q/P tools May 13, 2024
@amahussein amahussein marked this pull request as ready for review May 13, 2024 21:58
Copy link
Collaborator

@nartal1 nartal1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amahussein for refactoring the code. It's a significant change. Started on the review but still need to go over a lot of it.

Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me>
Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me>
Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me>
Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me>
@amahussein amahussein requested a review from nartal1 May 15, 2024 16:23
Copy link
Collaborator

@nartal1 nartal1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amahussein ! I tested this PR and it looks good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core_tools Scope the core module (scala)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants