Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cluster details in qualification summary output #921

Merged
merged 7 commits into from
Apr 11, 2024

Conversation

parthosa
Copy link
Collaborator

@parthosa parthosa commented Apr 10, 2024

Fixes #918. This PR adds columns Vendor, Driver Host, Cluster ID, Cluster Name in the user tools qualification summary. This will assist the customers distinguish different jobs especially when in Databricks where all applications have the same name.

User Tools Output:

File: qualification_summary.csv

,Vendor,Driver Host,Cluster ID,Cluster Name,App Name,App ID,Speedup Based Recommendation,Estimated GPU Speedup,Estimated GPU Duration,App Duration,Unsupported Operators Stage Duration,Unsupported Operators Stage Duration Percent,Speed Up Estimation Model
0,databricks-azure,123.11.11.11,0220-012345-aaabbcc,job-name-test1,Databricks Shell,app-20240220075000-0000,Recommended,1.89,1206911.55,2282520.00,17298.00,0.76,SPEEDUPS
1,databricks-azure,123.11.11.12,0220-012345-xxxyyzz,job-name-test2,Databricks Shell,app-20240220074214-0000,Recommended,1.62,418620.75,676256.00,2576.00,0.38,SPEEDUPS
2,databricks-azure,123.11.11.13,0220-012345-pppqqrr,job-name-test3,Databricks Shell,app-20240220075434-0000,Recommended,1.56,1122375.53,1750824.00,11822.00,0.68,SPEEDUPS
3,databricks-azure,123.11.11.14,0220-012345-dddeeff,job-name-test4,Databricks Shell,app-20240220083138-0000,Recommended,1.44,209442.94,300577.00,455.00,0.15,SPEEDUPS
4,databricks-azure,123.11.11.15,0220-012345-ggghhii,job-name-test5,Databricks Shell,app-20240220065555-0000,Not Applicable,1.33,1718690.03,2282016.00,17709.00,0.78,SPEEDUPS

Note:

  1. For TCO, we will still group apps but instead of App Name, we will group by [Vendor,Driver Host,Cluster ID,Cluster Name,App Name]
  2. This change also ensures Top Candidates view shows meaningful results
+----+-------------------------+------------------+-------------------------+
|    | App ID                  | App Name         | Estimated GPU Speedup   |
|----+-------------------------+------------------+-------------------------|
|  0 | app-20240220075000-0000 | Databricks Shell | Small                   |
|  1 | app-20240220074214-0000 | Databricks Shell | Small                   |
|  2 | app-20240220075434-0000 | Databricks Shell | Small                   |
|  3 | app-20240220083138-0000 | Databricks Shell | Small                   |
|  4 | app-20240220065555-0000 | Databricks Shell | Small                   |
+----+-------------------------+------------------+-------------------------+

Report Summary:
------------------  -
Total applications  5
Top candidates      5
------------------  -

How To Test

  1. Use latest dev jar while testing
    spark_rapids qualification --platform $PLATFORM --eventlogs $EVENTLOGS --tools_jar $SPARK_RAPIDS_TOOLS_DEV_JAR
  2. Tested on Databricks, Dataproc and EMR

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
@parthosa parthosa self-assigned this Apr 10, 2024
@parthosa parthosa added feature request New feature or request user_tools Scope the wrapper module running CSP, QualX, and reports (python) labels Apr 10, 2024
Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
@parthosa
Copy link
Collaborator Author

parthosa commented Apr 10, 2024

Some Design Questions

  1. Should we replace the intermediate cluster_information.json to cluster_information.csv?
    • Since we are convert the JSON to Dataframe in user tools now
    • This would simplify processing of this information in user tools
    • CSV would be more convenient for stakeholders usage.
  2. How should empty columns be handled?
    • For example, in OnPrem and Dataproc only Spark Driver would be present, remaining 3 columns would be empty

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
@@ -370,12 +369,6 @@ def check_discount_percentage(discount_type: str, discount_value: int):
self.ctxt.set_ctxt('cpu_discount', cpu_discount)
self.ctxt.set_ctxt('gpu_discount', gpu_discount)

def _create_cluster_report_args(self) -> list:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not required any more since we want to generate cluster information in every case.

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
@parthosa parthosa added the core_tools Scope the core module (scala) label Apr 11, 2024
Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @parthosa
We discussed offline the design questions.
LGTME

@amahussein amahussein merged commit 5bf9b69 into NVIDIA:dev Apr 11, 2024
15 checks passed
@parthosa parthosa deleted the spark-rapids-tools-918 branch April 22, 2024 16:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core_tools Scope the core module (scala) feature request New feature or request user_tools Scope the wrapper module running CSP, QualX, and reports (python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Write cluster information to qualification summary
2 participants