Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add all stage metrics to tools output #1151

Merged
merged 8 commits into from
Jul 3, 2024
Merged

Conversation

nartal1
Copy link
Collaborator

@nartal1 nartal1 commented Jun 28, 2024

This fixes #928 and fixes #1081

This PR adds a new output file(stage_level_all_metrics.csv) which has the information of all accumulables aggregated at stage level including Gpu accumulables(if any) in the eventlog.

In some cases, there is just one entry in the eventlog for a given accumulator. In such cases, the min,median and max values are assigned default value of 0 and the actual value is in total column. This is to similar where we assign default values for sql_metrics_for_application.csv.

We drop those accumulators from output file if no metrics are generated.

The output format is as below:

appIndex,stageId,accumulatorId,name,min,median,max,total
1,"0",46,"duration",2694,4051,4051,6745
1,"0",48,"number of input batches",0,0,0,6
1,"10",1010,"gpuSemaphoreWait",0,1196,68449,11682104
1,"10",1013,"gpuSpillToDiskTime",229,2252,17319,1682175
1,"10",1015,"gpuReadSpillFromDiskTime",121,1790,9318,2034767
1,"10",1016,"gpuSplitAndRetryCount",0,0,0,1
1,"10",1018,"gpuSpillToHostTime",11,2087,15533,1783900
1,"10",1039,"output rows",177,476370,1376040,1015737409
1,"10",1040,"output columnar batches",177,61421,143354,118334405

Added unit test to simulate eventlog containing GPU metrics and capture them.

@nartal1 nartal1 added feature request New feature or request core_tools Scope the core module (scala) labels Jun 28, 2024
@nartal1 nartal1 requested a review from amahussein June 28, 2024 01:00
@nartal1 nartal1 self-assigned this Jun 28, 2024
@nartal1 nartal1 requested a review from tgravescs June 28, 2024 01:00
@nartal1
Copy link
Collaborator Author

nartal1 commented Jun 28, 2024

@amahussein @tgravescs - Would be good to know if the information in the output file is okay OR any columns need to be added/removed. Thanks!

Signed-off-by: Niranjan Artal <nartal@nvidia.com>
Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR adds a new output file(gpu_metrics_for_application.csv) which has the information of Gpu accumulables(if any) in the eventlog.

Do we want to have a separate CSV file for GPU metrics. This is a stageLevel accumulator view. If we name it to something more generic, then it can be used to dump all other accumulators.
CC: @tgravescs Do you have any preferences?

val groupedByAccumulatorId = gpuAccums.groupBy(_.accumulatorId)
groupedByAccumulatorId.flatMap { case (accumulatorId, accums) =>
// Extract and sort the update values, defaulting to 0 if not present
val sortedUpdates = accums.flatMap(_.update).toSeq.sorted
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to sort here? If we sort the final result in AppGpuMetricsViewTrait, then we are double sorting?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to sort the list to get the min, median and max for a given accumulator. The other sort is for the View.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.
It looks like a code efficiency problem. Picking min, max, med, and then sum seems to be done in several iterations than necessary.
Anyway, this is not part of this PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amahussein
Copy link
Collaborator

This PR adds a new output file(gpu_metrics_for_application.csv) which has the information of Gpu accumulables(if any) in the eventlog.

Do we want to have a separate CSV file for GPU metrics. This is a stageLevel accumulator view. If we name it to something more generic, then it can be used to dump all other accumulators. CC: @tgravescs Do you have any preferences?

This PR adds a new output file(gpu_metrics_for_application.csv) which has the information of Gpu accumulables(if any) in the eventlog.

Do we want to have a separate CSV file for GPU metrics. This is a stageLevel accumulator view. If we name it to something more generic, then it can be used to dump all other accumulators. CC: @tgravescs Do you have any preferences?

I discussed with Tom offline.
Let's remove the filter and make the file generic for all the Accumulators CPU/GPU.

  • One main fix to do in this PR is to generate the view by the Qualification tool as well.
  • this PR will also close issue 1081.

Signed-off-by: Niranjan Artal <nartal@nvidia.com>
@nartal1 nartal1 changed the title Add Gpu metrics for profiling tool output Add all stage metrics to profiling tool output Jun 29, 2024
Signed-off-by: Niranjan Artal <nartal@nvidia.com>
Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR's description needs to be updated

For convenience, I am attaching sample output. stage_level_aggregated_task_metrics.csv

I can see that we have internal accumulators.
I believe that internal.metrics.* are basically duplicate of the stage_level_aggregated_task_metrics.csv

A couple of questions:

  • Do we want to list internal.metrics.* in this file? A concern is redundancy and that having same information in multiple files might be confusing.
  • Should we drop "zero" rows to reduce the size of the output (rows that have 0-value in all the columns)?

CC: @tgravescs

Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I triggered qualification tool. The new file does not get generated under raw_metrics subdirectory

Signed-off-by: Niranjan Artal <nartal@nvidia.com>
@nartal1
Copy link
Collaborator Author

nartal1 commented Jul 2, 2024

I triggered qualification tool. The new file does not get generated under raw_metrics subdirectory

Thanks @amahussein ! I forgot that we have another path to generate the output files in raw_metrics. Updated the code to generate the csv file under raw_metrics directory. PTAL.

Should we drop "zero" rows to reduce the size of the output (rows that have 0-value in all the columns)?

I have updated the code to remove those rows where metrics are not updated at all i.e value of all columns is zero. Will revert it if we want those rows as well.

Do we want to list internal.metrics.* in this file? A concern is redundancy and that having same information in multiple files might be confusing.

I was not certain if we wanted to remove it since the issue was to dump all the metrics to a output file. @tgravescs - Could you please let us know if we need to keep internal.* ones ? Also, is it okay to remove rows from the output file if all the values are "zeros"?

@amahussein amahussein changed the title Add all stage metrics to profiling tool output Add all stage metrics to tools output Jul 2, 2024
@amahussein
Copy link
Collaborator

Thanks @nartal1
I discussed offline with Tom.

For this PR, we can dump everything without filtering any rows. This means:

  • the rows with all zero-values should be dumped out as well.
  • the internal metrics should show up too.

Later, I can investigate with Bilal the performance impact of re-aggregating the internal Spark metrics as part of figuring out low-hanging fruits for #367

@amahussein
Copy link
Collaborator

Also, let's change the name from stage_level_all_metrics_for_application.csv to stage_level_all_metrics.csv

val groupedByAccumulatorId = gpuAccums.groupBy(_.accumulatorId)
groupedByAccumulatorId.flatMap { case (accumulatorId, accums) =>
// Extract and sort the update values, defaulting to 0 if not present
val sortedUpdates = accums.flatMap(_.update).toSeq.sorted
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.
It looks like a code efficiency problem. Picking min, max, med, and then sum seems to be done in several iterations than necessary.
Anyway, this is not part of this PR.

* @return a sequence of AccumProfileResults
*/

def generateStageLevelAccums(): Seq[AccumProfileResults] = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is actually a stageLevel thing but it is defined in the AppSQLPlanAnalyzer which does not look like a good fit.
I am fine to keep it here for now.

Signed-off-by: Niranjan Artal <nartal@nvidia.com>
Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nartal1
LGTME

val groupedByAccumulatorId = gpuAccums.groupBy(_.accumulatorId)
groupedByAccumulatorId.flatMap { case (accumulatorId, accums) =>
// Extract and sort the update values, defaulting to 0 if not present
val sortedUpdates = accums.flatMap(_.update).toSeq.sorted
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nartal1 nartal1 merged commit 91fed9d into NVIDIA:dev Jul 3, 2024
14 checks passed
@amahussein amahussein added the affect-output A change that modifies the output (add/remove/rename files, add/remove/rename columns) label Jul 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affect-output A change that modifies the output (add/remove/rename files, add/remove/rename columns) core_tools Scope the core module (scala) feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Profiler should dump all SQL metrics [FEA] Add Gpu metrics in Profiling tool
2 participants