Add all stage metrics to tools output #1151

nartal1 · 2024-06-28T01:00:00Z

This fixes #928 and fixes #1081

This PR adds a new output file(stage_level_all_metrics.csv) which has the information of all accumulables aggregated at stage level including Gpu accumulables(if any) in the eventlog.

In some cases, there is just one entry in the eventlog for a given accumulator. In such cases, the min,median and max values are assigned default value of 0 and the actual value is in total column. This is to similar where we assign default values for sql_metrics_for_application.csv.

We drop those accumulators from output file if no metrics are generated.

The output format is as below:

appIndex,stageId,accumulatorId,name,min,median,max,total
1,"0",46,"duration",2694,4051,4051,6745
1,"0",48,"number of input batches",0,0,0,6
1,"10",1010,"gpuSemaphoreWait",0,1196,68449,11682104
1,"10",1013,"gpuSpillToDiskTime",229,2252,17319,1682175
1,"10",1015,"gpuReadSpillFromDiskTime",121,1790,9318,2034767
1,"10",1016,"gpuSplitAndRetryCount",0,0,0,1
1,"10",1018,"gpuSpillToHostTime",11,2087,15533,1783900
1,"10",1039,"output rows",177,476370,1376040,1015737409
1,"10",1040,"output columnar batches",177,61421,143354,118334405

Added unit test to simulate eventlog containing GPU metrics and capture them.

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

…-issue-928

nartal1 · 2024-06-28T01:01:40Z

@amahussein @tgravescs - Would be good to know if the information in the output file is okay OR any columns need to be added/removed. Thanks!

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

amahussein

This PR adds a new output file(gpu_metrics_for_application.csv) which has the information of Gpu accumulables(if any) in the eventlog.

Do we want to have a separate CSV file for GPU metrics. This is a stageLevel accumulator view. If we name it to something more generic, then it can be used to dump all other accumulators.
CC: @tgravescs Do you have any preferences?

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSQLPlanAnalyzer.scala

core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ProfileClassWarehouse.scala

amahussein · 2024-06-28T15:59:12Z

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSQLPlanAnalyzer.scala

+    val groupedByAccumulatorId = gpuAccums.groupBy(_.accumulatorId)
+    groupedByAccumulatorId.flatMap { case (accumulatorId, accums) =>
+      // Extract and sort the update values, defaulting to 0 if not present
+      val sortedUpdates = accums.flatMap(_.update).toSeq.sorted


Why do we need to sort here? If we sort the final result in AppGpuMetricsViewTrait, then we are double sorting?

This is to sort the list to get the min, median and max for a given accumulator. The other sort is for the View.

I see.
It looks like a code efficiency problem. Picking min, max, med, and then sum seems to be done in several iterations than necessary.
Anyway, this is not part of this PR.

FYI @bilalbari

core/src/main/scala/com/nvidia/spark/rapids/tool/views/AppGpuMetricsView.scala

core/src/test/scala/com/nvidia/spark/rapids/tool/profiling/ApplicationInfoSuite.scala

amahussein · 2024-06-28T19:02:43Z

This PR adds a new output file(gpu_metrics_for_application.csv) which has the information of Gpu accumulables(if any) in the eventlog.

Do we want to have a separate CSV file for GPU metrics. This is a stageLevel accumulator view. If we name it to something more generic, then it can be used to dump all other accumulators. CC: @tgravescs Do you have any preferences?

I discussed with Tom offline.
Let's remove the filter and make the file generic for all the Accumulators CPU/GPU.

One main fix to do in this PR is to generate the view by the Qualification tool as well.
this PR will also close issue 1081.

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

amahussein

The PR's description needs to be updated

For convenience, I am attaching sample output. stage_level_aggregated_task_metrics.csv

I can see that we have internal accumulators.
I believe that internal.metrics.* are basically duplicate of the stage_level_aggregated_task_metrics.csv

A couple of questions:

Do we want to list internal.metrics.* in this file? A concern is redundancy and that having same information in multiple files might be confusing.
Should we drop "zero" rows to reduce the size of the output (rows that have 0-value in all the columns)?

CC: @tgravescs

amahussein

I triggered qualification tool. The new file does not get generated under raw_metrics subdirectory

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

nartal1 · 2024-07-02T00:23:44Z

I triggered qualification tool. The new file does not get generated under raw_metrics subdirectory

Thanks @amahussein ! I forgot that we have another path to generate the output files in raw_metrics. Updated the code to generate the csv file under raw_metrics directory. PTAL.

Should we drop "zero" rows to reduce the size of the output (rows that have 0-value in all the columns)?

I have updated the code to remove those rows where metrics are not updated at all i.e value of all columns is zero. Will revert it if we want those rows as well.

Do we want to list internal.metrics.* in this file? A concern is redundancy and that having same information in multiple files might be confusing.

I was not certain if we wanted to remove it since the issue was to dump all the metrics to a output file. @tgravescs - Could you please let us know if we need to keep internal.* ones ? Also, is it okay to remove rows from the output file if all the values are "zeros"?

amahussein · 2024-07-02T15:39:22Z

Thanks @nartal1
I discussed offline with Tom.

For this PR, we can dump everything without filtering any rows. This means:

the rows with all zero-values should be dumped out as well.
the internal metrics should show up too.

Later, I can investigate with Bilal the performance impact of re-aggregating the internal Spark metrics as part of figuring out low-hanging fruits for #367

amahussein · 2024-07-02T16:14:43Z

Also, let's change the name from stage_level_all_metrics_for_application.csv to stage_level_all_metrics.csv

amahussein · 2024-07-02T19:42:07Z

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSQLPlanAnalyzer.scala

+    val groupedByAccumulatorId = gpuAccums.groupBy(_.accumulatorId)
+    groupedByAccumulatorId.flatMap { case (accumulatorId, accums) =>
+      // Extract and sort the update values, defaulting to 0 if not present
+      val sortedUpdates = accums.flatMap(_.update).toSeq.sorted


I see.
It looks like a code efficiency problem. Picking min, max, med, and then sum seems to be done in several iterations than necessary.
Anyway, this is not part of this PR.

core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ApplicationSummaryInfo.scala

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSQLPlanAnalyzer.scala

amahussein · 2024-07-02T19:49:17Z

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSQLPlanAnalyzer.scala

+   * @return a sequence of AccumProfileResults
+   */
+
+  def generateStageLevelAccums(): Seq[AccumProfileResults] = {


nit: this is actually a stageLevel thing but it is defined in the AppSQLPlanAnalyzer which does not look like a good fit.
I am fine to keep it here for now.

core/src/main/scala/org/apache/spark/sql/rapids/tool/ToolUtils.scala

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

amahussein

Thanks @nartal1
LGTME

amahussein · 2024-07-03T15:33:23Z

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSQLPlanAnalyzer.scala

+    val groupedByAccumulatorId = gpuAccums.groupBy(_.accumulatorId)
+    groupedByAccumulatorId.flatMap { case (accumulatorId, accums) =>
+      // Extract and sort the update values, defaulting to 0 if not present
+      val sortedUpdates = accums.flatMap(_.update).toSeq.sorted


FYI @bilalbari

nartal1 added 2 commits June 27, 2024 17:43

Add Gpu metrics to profiling tool output

7968997

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

Merge branch 'dev' of github.com:NVIDIA/spark-rapids-tools into tools…

c36411f

…-issue-928

nartal1 added feature request New feature or request core_tools Scope the core module (scala) labels Jun 28, 2024

nartal1 requested a review from amahussein June 28, 2024 01:00

nartal1 self-assigned this Jun 28, 2024

nartal1 requested a review from tgravescs June 28, 2024 01:00

add file

8d7eb6e

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

amahussein requested changes Jun 28, 2024

View reviewed changes

addressed review comments

f34b710

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

nartal1 changed the title ~~Add Gpu metrics for profiling tool output~~ Add all stage metrics to profiling tool output Jun 29, 2024

add comments

8e659ed

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

nartal1 mentioned this pull request Jul 1, 2024

[FEA] Revisit/Update eventlogs of scala unit tests #1154

Open

nartal1 requested a review from amahussein July 1, 2024 00:38

amahussein reviewed Jul 1, 2024

View reviewed changes

amahussein requested changes Jul 1, 2024

View reviewed changes

addressed review comments

9948c01

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

amahussein changed the title ~~Add all stage metrics to profiling tool output~~ Add all stage metrics to tools output Jul 2, 2024

amahussein mentioned this pull request Jul 2, 2024

[FEA] Improve performance of core module #367

Open

address review comments

73d16f0

amahussein requested changes Jul 2, 2024

View reviewed changes

addressed review comments, removed unwanted code.

cca3e75

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

amahussein approved these changes Jul 3, 2024

View reviewed changes

nartal1 merged commit 91fed9d into NVIDIA:dev Jul 3, 2024
14 checks passed

amahussein added the affect-output A change that modifies the output (add/remove/rename files, add/remove/rename columns) label Jul 3, 2024

nartal1 mentioned this pull request Aug 2, 2024

[BUG] Stage level metrics output result is incorrect #1250

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add all stage metrics to tools output #1151

Add all stage metrics to tools output #1151

nartal1 commented Jun 28, 2024 •

edited

Loading

nartal1 commented Jun 28, 2024

amahussein left a comment

amahussein Jun 28, 2024

nartal1 Jul 1, 2024

amahussein Jul 2, 2024

amahussein Jul 3, 2024

amahussein commented Jun 28, 2024

amahussein left a comment •

edited

Loading

amahussein left a comment

nartal1 commented Jul 2, 2024

amahussein commented Jul 2, 2024

amahussein commented Jul 2, 2024

amahussein Jul 2, 2024

amahussein Jul 2, 2024

amahussein left a comment

amahussein Jul 3, 2024

Add all stage metrics to tools output #1151

Add all stage metrics to tools output #1151

Conversation

nartal1 commented Jun 28, 2024 • edited Loading

nartal1 commented Jun 28, 2024

amahussein left a comment

Choose a reason for hiding this comment

amahussein Jun 28, 2024

Choose a reason for hiding this comment

nartal1 Jul 1, 2024

Choose a reason for hiding this comment

amahussein Jul 2, 2024

Choose a reason for hiding this comment

amahussein Jul 3, 2024

Choose a reason for hiding this comment

amahussein commented Jun 28, 2024

amahussein left a comment • edited Loading

Choose a reason for hiding this comment

amahussein left a comment

Choose a reason for hiding this comment

nartal1 commented Jul 2, 2024

amahussein commented Jul 2, 2024

amahussein commented Jul 2, 2024

amahussein Jul 2, 2024

Choose a reason for hiding this comment

amahussein Jul 2, 2024

Choose a reason for hiding this comment

amahussein left a comment

Choose a reason for hiding this comment

amahussein Jul 3, 2024

Choose a reason for hiding this comment

nartal1 commented Jun 28, 2024 •

edited

Loading

amahussein left a comment •

edited

Loading