Qualification tool: Add output stats file for Execs(operators) #1225

nartal1 · 2024-07-24T18:17:07Z

This PR contributes to #1157 .

This PR has the following changes:

It creates a new csv file (qualification_statistics.csv) with the qualification statistics run per-sql level for all Execs present in rapids_4_spark_qualification_output_execs.csv.

Changes made:

Changes made in python user-tools to read qualification output files.
Made changes in qualification.py to create SparkQualificationStats object passing the context. The logic of creating statistics and generating a csv out file is in SparkQualificationStats.
merge_dataframes function in SparkQualificationStats has the code to generate the stats. After reading the csv files(stages, execs and unsupporteOperators) to dataframes, we do preprocessing of the dataframes such as filtering out the WholeStageCodegen and exploding the Exec Stage column.
The report generated per App per SQL ID. Example of calculating the stats:
For a given Exec(operator), determine all the stages where this Exec is present in that SQL. Get StageTaskDurations of
all the stages this Exec is present to get the StageTaskExecDuration. We can also determine the total StageTaskDuration corresponding to that SQLID from execs.csv. Note that this stats is based on the stage to SQL mapping based on execs.csv file. We know that there could be some Execs to which we cannot map stages and will miss those out in the calculation/output. This is consistent with the current csv output files.

Output:

Generate output file from user-tools cmd.

spark_rapids qualification --platform=onprem  --eventlogs=<EVENTLOGPATH>

Sample output:

App ID,SQL ID,Operator,Count,Stage Task Exec Duration(s),Total SQL Task Duration(s),% of Total SQL Task Duration,Supported
app-20240220043231-0000,1,BroadcastHashJoin,1,577418.50,5278814.00,10.94,True
app-20240220043231-0000,1,ColumnarToRow,5,1334460.475,1200.855,23.665,False
app-20240220043231-0000,5,Scan ExistingRDD,4,35.14,35.296,0.696,False
app-20240220043231-0000,1,Scan parquet,5,1334460.48,5278814.00,25.28,True
app-20240220043231-0000,15,ObjectHashAggregate,5,15.236,4.502,0.089,False
app-20240220043231-0000,15,Scan ExistingRDD,4,14.852,4.072,0.08,False
app-20240220043231-0000,15,filesizehistogramagg,5,15.236,4.502,0.089,False
app-20240220043231-0000,20,Project,2,856774.69,865275.43,99.02,True

###Output Columns description:

Operator - Name of the Exec
Count  - Number of times the operator is executed in the SQL. It also depends on whether the Exec is supported/not supported in any of the stages in that SQL
Stage task exec duration -  task duration for the stage in seconds
Total SQL Task Duration -  -  total of all task durations for all stages in that SQL id?
% of Total SQL Task Duration -( Stage Task Exec Duration/Totatl SQL Task Duration) * 100
Supported: Whether the operator is supported or not.

Generate output file using python cmd(Assumption that qual run was completed earlier).

1. With `qual_output` argument 
    spark_rapids_dev stats --qual_output=/home/test/<QUAL_OUTPUT_DIRECTORY>/rapids_4_spark_qualification_output 

2. `output_folder` arg where the results to be stored.
spark_rapids stats --qual_output=/home/test/<QUAL_OUTPUT_DIRECTORY>/rapids_4_spark_qualification_output --output_folder=/home/test/output_directory

Follow on

Discuss how to report the operators which do not map to any stages. In unsupported_operators.csv, we mark those stages as -1. One approach is we could display totalStageTaskDuration of the SQL ID to which this operator maps to and keep the columns - StageTaskExecDuration and % of Total SQL Duration as zero

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

…-issue-1157

parthosa

Thanks @nartal1 for this feature.

user_tools/src/spark_rapids_pytools/rapids/qualification.py

user_tools/src/spark_rapids_tools/tools/qualification_stats_report.py

amahussein

Thanks @nartal1 !
Do we create this as a separate tool similar to what we did for "Predict"?
If we treat it as independent tool, then it can have its own yaml file and it can be easier to configure and extend without affecting the qualification cmd.

We can also allow this post-phase processing to be enabled/disabled.

It will be nice if we can get an initial quick feedback from @viadea to incorporate his input in the tuning of this PR.

nartal1 · 2024-07-25T22:08:03Z

It will be nice if we can get an initial quick feedback from @viadea to incorporate his input in the tuning of this PR.

Discussed with @viadea offline. His feedback is that having it as a part of Qualification tool should be sufficient so that the stats are produced along with the qual tool results. Asking users to run CLI just for generating the statistics file (given the qual tool is run) wouldn't be a good experience.
@amahussein - Should we keep it as the part of Qual tool run? Or would it help to create a framework to run this as a separate tool ? Wanted to check if you have any preference.

amahussein · 2024-07-25T22:13:50Z

It will be nice if we can get an initial quick feedback from @viadea to incorporate his input in the tuning of this PR.

Discussed with @viadea offline. His feedback is that having it as a part of Qualification tool should be sufficient so that the stats are produced along with the qual tool results. Asking users to run CLI just for generating the statistics file (given the qual tool is run) wouldn't be a good experience. @amahussein - Should we keep it as the part of Qual tool run? Or would it help to create a framework to run this as a separate tool ? Wanted to check if you have any preference.

Yes, I meant that it runs as part of the Qualification tool (enabled by default).
But there is a command to allow running the stats report on existing directory path. This is what we did for "prediction". The latter is part of the processing of the qualification, but it is also possible to run prediction on existing tools folders. This is to allow getting reports on older runs and to reiterate on the stats without having to go through the entire cycle.

One last thing. How to keep compatibility? for example, it is possible that the stats is scanning a folder that was generated by older tools version. in that case, it might crash.
Any thoughts about that?

user_tools/src/spark_rapids_tools/tools/qualification_stats_report.py

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

…-issue-1157

user_tools/src/spark_rapids_pytools/rapids/qualification.py

user_tools/src/spark_rapids_tools/tools/qualification_stats_report.py

…-issue-1157

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

user_tools/src/spark_rapids_tools/tools/qualification_stats_report.py

user_tools/src/spark_rapids_tools/cmdli/tools_cli.py

user_tools/src/spark_rapids_tools/tools/qualification_stats_report.py

nartal1 · 2024-07-29T23:45:24Z

Thanks @parthosa for the review. I have addressed all the review comments. PTAL.

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

user_tools/src/spark_rapids_pytools/rapids/qualification_stats.py

tgravescs · 2024-07-31T22:40:02Z

Impacted Stage duration is sum of all stage durations for the SQLID.

Stage durations here seem like they would be very hard to get accurate. I think we should think about just doing task time and potentially % total task time per sql

nartal1 · 2024-08-01T16:17:04Z

Stage durations here seem like they would be very hard to get accurate. I think we should think about just doing task time and potentially % total task time per sql

Looking into this if we can get % total task time per sql

…-issue-1157

nartal1 · 2024-08-07T00:22:33Z

Stage durations here seem like they would be very hard to get accurate. I think we should think about just doing task time and potentially % total task time per SQL

@tgravescs - Updated the PR to report % total task time.
@parthosa, @amahussein @cindyyuanjiang - This includes all the operators where we can map stageID to Execs. Had to include execs.csv file and some changes while merging the dataframes. PTAL.

user_tools/src/spark_rapids_tools/tools/qualification_stats_report.py

tgravescs · 2024-08-08T13:55:12Z

can you update sample output in description?

nartal1 · 2024-08-08T14:48:17Z

can you update sample output in description?
Thanks @tgravescs ! I had updated the sample output in dropdown section earlier. Now it's in the description itself. PTAL.

parthosa

Thanks @nartal1. The final output columns are quite meaningful.

cindyyuanjiang

Thanks @nartal1 for this feature!

tgravescs · 2024-08-09T12:59:52Z

Can you please describe each of these columns:

Stage Task Exec Duration(s),Total SQL Duration(s),% of Total SQL Duration,Supported
577418.50,5278814.00,10.94,True

Stage task exec duration - is the task duration for the stage in seconds
Total SQL Duration - ??? - is this the total of all task durations for all stages in that SQL id?
% of Total SQL Duration - I assume just percent based on above 2 numbers

I want to make sure the columns are clear to the user. from just reading it I would have expected the total SQL duration to just be the wall clock time but from previous discussions that isn't what we were talking about so want to make sure and then maybe come up with a name that is more obvious to the user.

tgravescs · 2024-08-09T13:09:59Z

It will be nice if we can get an initial quick feedback from @viadea to incorporate his input in the tuning of this PR.

Discussed with @viadea offline. His feedback is that having it as a part of Qualification tool should be sufficient so that the stats are produced along with the qual tool results. Asking users to run CLI just for generating the statistics file (given the qual tool is run) wouldn't be a good experience. @amahussein - Should we keep it as the part of Qual tool run? Or would it help to create a framework to run this as a separate tool ? Wanted to check if you have any preference.

Yes, I meant that it runs as part of the Qualification tool (enabled by default). But there is a command to allow running the stats report on existing directory path. This is what we did for "prediction". The latter is part of the processing of the qualification, but it is also possible to run prediction on existing tools folders. This is to allow getting reports on older runs and to reiterate on the stats without having to go through the entire cycle.

One last thing. How to keep compatibility? for example, it is possible that the stats is scanning a folder that was generated by older tools version. in that case, it might crash. Any thoughts about that?

ok if we want to have separate cli for users to run after the fact I guess I'm ok with it. it adds more things - documentation and testing mainly, but more commands always potential for user confusion. please make sure we have followups to add docs/tests.

I thought this requirement was mostly from Felix and would just get this with the normal qualification runs. Let me talk to Felix and Hao

user_tools/src/spark_rapids_pytools/rapids/qualification_stats.py

user_tools/src/spark_rapids_tools/cmdli/tools_cli.py

user_tools/src/spark_rapids_tools/tools/qualification_stats_report.py

tgravescs · 2024-08-09T13:54:58Z

user_tools/src/spark_rapids_tools/tools/qualification_stats_report.py

+        merged_df = merged_df.merge(total_duration_df, on=['App ID', 'SQL ID'], how='left')
+
+        # Mark unique stage task durations
+        merged_df['Unique StageTaskDuration'] = ~merged_df.duplicated(


so for each operator we will get all stages and then possibly separate rows if have both supported and unsupported and we just want to make sure we have those unique so we aren't double counting on time, correct?

So does this mean if you have an two operators in a stage, one that is supported and one that is not supported that we the time will be in there twice, correct? which is fine I just want to make sure I'm understanding properly.

Thanks! I updated it to handle this case. Until now, it was taking into account same operators within a SQL ID and the assumption was that the operator will be either supported or unsupported within a stage. But there could be a scenario where the same operator can be supported and unsupported within a stage(due to underlying Expression not supported).

Sample output: There are 4 entries of Project for SQLID=1, StageID(5,6) in execs.csv and 3 are unsupported as below:
unsupported_operators_df

App ID SQL ID Stage ID Operator 0 1 3 7 HashAggregate 1 1 1 6 Project 2 2 1 -1 Project 3 1 1 -1 Filter 4 1 1 5 Project 5 1 1 5 Project

stages_df:

stages_df App ID Stage ID StageTaskDuration 0 1 5 50 2 1 4 20 3 1 3 100 4 1 6 60 5 1 7 100

final dataframe output(before renaming columns):

App ID SQL ID Operator Count StageTaskDuration TotalSQLDuration % of Total SQL Duration Supported 0 1 1 Filter 2 70 230 30.434783 True 1 1 1 Project 3 110 230 47.826087 False 2 1 1 Project 1 50 230 21.739130 True 3 1 1 Sort 3 170 230 73.913043 True 4 1 3 HashAggregate 1 100 100 100.000000 False

so I assume the sql id 1 above example has some execs - likely the Project that are unsupposed in the same stages as another exec that is supported, correct? Because you hadd up the 4 of those 70 + 110 + 50 + 170 and its more then the 230 for total.

Yes, that's correct. In the above example For SQLID=1, there are 4 Projects in total.
StageID=5, Unsupported=2, Supported=1, StageTaskDuration=50
StageID=6, Unsupported=1, StageTaskDuration=60

So we have 2 rows in output column for Project.
StageTaskDuration=110 ( 50 + 60) for Unsupported
StageTaskDuration=50 for Supported

amahussein · 2024-08-09T15:46:26Z

ok if we want to have separate cli for users to run after the fact I guess I'm ok with it. it adds more things - documentation and testing mainly, but more commands always potential for user confusion. please make sure we have followups to add docs/tests.

I thought this requirement was mostly from Felix and would just get this with the normal qualification runs. Let me talk to Felix and Hao

That's a good point @tgravescs
In fact, I agree that this is better to be an internal CLI rapids_tools_dev (should be added to dev_cli.py). The purpose of that new CLI is for the internal team to be able to run the report on existing output folders.
It is unlikely that users will need to run that CLI.

tgravescs · 2024-08-09T18:15:30Z

yeah I'm good with a dev command for it. Talked to Hao and Felix and they agreed not needed the public/external command

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

nartal1 added 2 commits July 22, 2024 15:32

Qualification tool: Add stats report file for unsupported operators

58a5ff6

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

Merge branch 'dev' of github.com:NVIDIA/spark-rapids-tools into tools…

faaf770

…-issue-1157

nartal1 added feature request New feature or request user_tools Scope the wrapper module running CSP, QualX, and reports (python) labels Jul 24, 2024

nartal1 requested review from parthosa, cindyyuanjiang and amahussein July 24, 2024 18:17

nartal1 self-assigned this Jul 24, 2024

parthosa reviewed Jul 24, 2024

View reviewed changes

amahussein reviewed Jul 25, 2024

View reviewed changes

cindyyuanjiang reviewed Jul 26, 2024

View reviewed changes

user_tools/src/spark_rapids_tools/tools/qualification_stats_report.py Outdated Show resolved Hide resolved

cindyyuanjiang reviewed Jul 26, 2024

View reviewed changes

user_tools/src/spark_rapids_tools/tools/qualification_stats_report.py Outdated Show resolved Hide resolved

nartal1 added 2 commits July 26, 2024 00:55

addressed review comments

12d9380

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

Merge branch 'dev' of github.com:NVIDIA/spark-rapids-tools into tools…

7311e7a

…-issue-1157

parthosa reviewed Jul 26, 2024

View reviewed changes

nartal1 added 3 commits July 26, 2024 18:08

Merge branch 'dev' of github.com:NVIDIA/spark-rapids-tools into tools…

691428d

…-issue-1157

read ctxt in class

51806d2

addressed review comments, added stats as a different tool

e483f8d

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

nartal1 requested review from parthosa, cindyyuanjiang and amahussein July 29, 2024 20:14

parthosa reviewed Jul 29, 2024

View reviewed changes

addressed review comments

3cb96f0

nartal1 requested a review from parthosa July 29, 2024 23:45

nartal1 added 2 commits July 29, 2024 17:38

improve exception handling

610070c

improve error handling and fix bug when incorrect path is provided

270109f

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

cindyyuanjiang reviewed Jul 30, 2024

View reviewed changes

user_tools/src/spark_rapids_pytools/rapids/qualification_stats.py Outdated Show resolved Hide resolved

nartal1 dismissed cindyyuanjiang’s stale review via e348c84 August 1, 2024 16:15

nartal1 added 2 commits August 2, 2024 11:12

Merge branch 'dev' of github.com:NVIDIA/spark-rapids-tools into tools…

90afe64

…-issue-1157

Merge branch 'dev' of github.com:NVIDIA/spark-rapids-tools into tools…

4ef25c9

…-issue-1157

nartal1 changed the title ~~Qualification tool: Add output stats file for unsupported operators~~ Qualification tool: Add output stats file for Execs(operators) Aug 6, 2024

updated taskStage calculation and added supported operators as well

7efa3ae

nartal1 requested a review from tgravescs August 7, 2024 20:25

cindyyuanjiang self-requested a review August 8, 2024 00:29

cindyyuanjiang reviewed Aug 8, 2024

View reviewed changes

user_tools/src/spark_rapids_tools/tools/qualification_stats_report.py Show resolved Hide resolved

parthosa previously approved these changes Aug 8, 2024

View reviewed changes

cindyyuanjiang self-requested a review August 8, 2024 18:03

cindyyuanjiang previously approved these changes Aug 8, 2024

View reviewed changes

tgravescs reviewed Aug 9, 2024

View reviewed changes

addressed review comments

97d9eb5

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

nartal1 dismissed stale reviews from cindyyuanjiang and parthosa via 97d9eb5 August 9, 2024 23:20

nartal1 requested review from tgravescs, parthosa and cindyyuanjiang August 9, 2024 23:50

tgravescs approved these changes Aug 12, 2024

View reviewed changes

nartal1 merged commit 32239bf into NVIDIA:dev Aug 12, 2024
14 checks passed

amahussein mentioned this pull request Sep 6, 2024

[FEA] Qualification tool: Add operators stats output csv file #1157

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qualification tool: Add output stats file for Execs(operators) #1225

Qualification tool: Add output stats file for Execs(operators) #1225

nartal1 commented Jul 24, 2024 •

edited

Loading

parthosa left a comment

amahussein left a comment

nartal1 commented Jul 25, 2024

amahussein commented Jul 25, 2024

nartal1 commented Jul 29, 2024

tgravescs commented Jul 31, 2024

nartal1 commented Aug 1, 2024

nartal1 commented Aug 7, 2024 •

edited

Loading

tgravescs commented Aug 8, 2024

nartal1 commented Aug 8, 2024

parthosa left a comment

cindyyuanjiang left a comment

tgravescs commented Aug 9, 2024

tgravescs commented Aug 9, 2024

tgravescs Aug 9, 2024

nartal1 Aug 9, 2024

tgravescs Aug 12, 2024

nartal1 Aug 12, 2024

amahussein commented Aug 9, 2024

tgravescs commented Aug 9, 2024

Qualification tool: Add output stats file for Execs(operators) #1225

Qualification tool: Add output stats file for Execs(operators) #1225

Conversation

nartal1 commented Jul 24, 2024 • edited Loading

Changes made:

Output:

Sample output:

Follow on

parthosa left a comment

Choose a reason for hiding this comment

amahussein left a comment

Choose a reason for hiding this comment

nartal1 commented Jul 25, 2024

amahussein commented Jul 25, 2024

nartal1 commented Jul 29, 2024

tgravescs commented Jul 31, 2024

nartal1 commented Aug 1, 2024

nartal1 commented Aug 7, 2024 • edited Loading

tgravescs commented Aug 8, 2024

nartal1 commented Aug 8, 2024

parthosa left a comment

Choose a reason for hiding this comment

cindyyuanjiang left a comment

Choose a reason for hiding this comment

tgravescs commented Aug 9, 2024

tgravescs commented Aug 9, 2024

tgravescs Aug 9, 2024

Choose a reason for hiding this comment

nartal1 Aug 9, 2024

Choose a reason for hiding this comment

tgravescs Aug 12, 2024

Choose a reason for hiding this comment

nartal1 Aug 12, 2024

Choose a reason for hiding this comment

amahussein commented Aug 9, 2024

tgravescs commented Aug 9, 2024

nartal1 commented Jul 24, 2024 •

edited

Loading

nartal1 commented Aug 7, 2024 •

edited

Loading