Add cluster details in qualification summary output #921

parthosa · 2024-04-10T00:58:47Z

Fixes #918. This PR adds columns Vendor, Driver Host, Cluster ID, Cluster Name in the user tools qualification summary. This will assist the customers distinguish different jobs especially when in Databricks where all applications have the same name.

User Tools Output:

File: qualification_summary.csv

,Vendor,Driver Host,Cluster ID,Cluster Name,App Name,App ID,Speedup Based Recommendation,Estimated GPU Speedup,Estimated GPU Duration,App Duration,Unsupported Operators Stage Duration,Unsupported Operators Stage Duration Percent,Speed Up Estimation Model
0,databricks-azure,123.11.11.11,0220-012345-aaabbcc,job-name-test1,Databricks Shell,app-20240220075000-0000,Recommended,1.89,1206911.55,2282520.00,17298.00,0.76,SPEEDUPS
1,databricks-azure,123.11.11.12,0220-012345-xxxyyzz,job-name-test2,Databricks Shell,app-20240220074214-0000,Recommended,1.62,418620.75,676256.00,2576.00,0.38,SPEEDUPS
2,databricks-azure,123.11.11.13,0220-012345-pppqqrr,job-name-test3,Databricks Shell,app-20240220075434-0000,Recommended,1.56,1122375.53,1750824.00,11822.00,0.68,SPEEDUPS
3,databricks-azure,123.11.11.14,0220-012345-dddeeff,job-name-test4,Databricks Shell,app-20240220083138-0000,Recommended,1.44,209442.94,300577.00,455.00,0.15,SPEEDUPS
4,databricks-azure,123.11.11.15,0220-012345-ggghhii,job-name-test5,Databricks Shell,app-20240220065555-0000,Not Applicable,1.33,1718690.03,2282016.00,17709.00,0.78,SPEEDUPS

Note:

For TCO, we will still group apps but instead of App Name, we will group by [Vendor,Driver Host,Cluster ID,Cluster Name,App Name]
This change also ensures Top Candidates view shows meaningful results

+----+-------------------------+------------------+-------------------------+
|    | App ID                  | App Name         | Estimated GPU Speedup   |
|----+-------------------------+------------------+-------------------------|
|  0 | app-20240220075000-0000 | Databricks Shell | Small                   |
|  1 | app-20240220074214-0000 | Databricks Shell | Small                   |
|  2 | app-20240220075434-0000 | Databricks Shell | Small                   |
|  3 | app-20240220083138-0000 | Databricks Shell | Small                   |
|  4 | app-20240220065555-0000 | Databricks Shell | Small                   |
+----+-------------------------+------------------+-------------------------+

Report Summary:
------------------  -
Total applications  5
Top candidates      5
------------------  -

How To Test

Use latest dev jar while testing
spark_rapids qualification --platform $PLATFORM --eventlogs $EVENTLOGS --tools_jar $SPARK_RAPIDS_TOOLS_DEV_JAR
Tested on Databricks, Dataproc and EMR

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>

parthosa · 2024-04-10T15:47:31Z

Some Design Questions

Should we replace the intermediate cluster_information.json to cluster_information.csv?
- Since we are convert the JSON to Dataframe in user tools now
- This would simplify processing of this information in user tools
- CSV would be more convenient for stakeholders usage.
How should empty columns be handled?
- For example, in OnPrem and Dataproc only Spark Driver would be present, remaining 3 columns would be empty

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>

parthosa · 2024-04-10T22:41:03Z

user_tools/src/spark_rapids_pytools/rapids/qualification.py

@@ -370,12 +369,6 @@ def check_discount_percentage(discount_type: str, discount_value: int):
            self.ctxt.set_ctxt('cpu_discount', cpu_discount)
            self.ctxt.set_ctxt('gpu_discount', gpu_discount)

-    def _create_cluster_report_args(self) -> list:


This is not required any more since we want to generate cluster information in every case.

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>

amahussein

Thanks @parthosa
We discussed offline the design questions.
LGTME

Add cluster details in qualification summary output

9b07f9f

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>

parthosa requested review from mattahrens, viadea, kuhushukla, cindyyuanjiang, amahussein and nartal1 April 10, 2024 00:58

parthosa self-assigned this Apr 10, 2024

parthosa added feature request New feature or request user_tools Scope the wrapper module running CSP, QualX, and reports (python) labels Apr 10, 2024

Handle missing cluster info columns

f651a93

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>

parthosa added 2 commits April 10, 2024 15:17

Refactor cluster information output to csv and remove writing to JSON

3f547af

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>

Reduce redundancy in writing CSV file

44f4c03

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>

parthosa commented Apr 10, 2024

View reviewed changes

parthosa added 3 commits April 10, 2024 16:13

Update column names in Cluster Inference and improve comment

98e0e95

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>

Handle pandas NaN in cluster inference

0e66499

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>

Fix unit tests

4d13dfb

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>

parthosa added the core_tools Scope the core module (scala) label Apr 11, 2024

amahussein approved these changes Apr 11, 2024

View reviewed changes

amahussein merged commit 5bf9b69 into NVIDIA:dev Apr 11, 2024
15 checks passed

parthosa mentioned this pull request Apr 15, 2024

[BUG] Cluster inference should not run for unsupported platform #940

Closed

parthosa deleted the spark-rapids-tools-918 branch April 22, 2024 16:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cluster details in qualification summary output #921

Add cluster details in qualification summary output #921

parthosa commented Apr 10, 2024 •

edited

Loading

parthosa commented Apr 10, 2024 •

edited

Loading

parthosa Apr 10, 2024

amahussein left a comment

Add cluster details in qualification summary output #921

Add cluster details in qualification summary output #921

Conversation

parthosa commented Apr 10, 2024 • edited Loading

User Tools Output:

Note:

How To Test

parthosa commented Apr 10, 2024 • edited Loading

Some Design Questions

parthosa Apr 10, 2024

Choose a reason for hiding this comment

amahussein left a comment

Choose a reason for hiding this comment

parthosa commented Apr 10, 2024 •

edited

Loading

parthosa commented Apr 10, 2024 •

edited

Loading