Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Remove CLI dependency for EMR and Databricks-AWS platforms in user tool #1196

Merged
merged 8 commits into from
Jul 22, 2024

Conversation

cindyyuanjiang
Copy link
Collaborator

@cindyyuanjiang cindyyuanjiang commented Jul 16, 2024

Contributes to #1191

Changes

This PR remove AWS CLI dependency for EMR and Databricks-AWS platforms, except for case when --cluster input is cluster ID/name.

  • Add instance_descriptions in class CMDDriverBase which loads and stores the instance catalog json files for each platform
  • Add is_props_file in class ClusterBase to indicate if the cluster is loaded from a properties file, if is true, there should not be any CLI calls when running the tool

Testing

For each platform, we confirm that there is no AWS CLI call, and the output is the same before and after this PR.

Platform: EMR
spark_rapids qualification -v -p emr --eventlogs <my-event-logs> --cluster <my-cluster-props-file>

Stdout
2024-07-16 15:33:36,755 INFO rapids.tools.qualification: ******* [Process-Arguments]: Starting *******
2024-07-16 15:33:36,755 DEBUG rapids.tools.qualification: Processing Output Arguments
2024-07-16 15:33:36,755 DEBUG rapids.tools.qualification: Root directory of local storage is set as: xxxxxx/spark-rapids-tools
2024-07-16 15:33:36,755 INFO rapids.tools.qualification.ctxt: Local workdir root folder is set as xxxxxx/spark-rapids-tools/qual_20240716223336_7a717563
2024-07-16 15:33:36,755 INFO rapids.tools.qualification.ctxt: Dependencies are generated locally in local disk as: xxxxxx/spark-rapids-tools/qual_20240716223336_7a717563/work_dir
2024-07-16 15:33:36,755 INFO rapids.tools.qualification.ctxt: Local output folder is set as: xxxxxx/spark-rapids-tools/qual_20240716223336_7a717563
2024-07-16 15:33:36,755 INFO rapids.tools.qualification: Qualification tool processing the arguments
2024-07-16 15:33:36,861 INFO rapids.tools.qualification: RAPIDS accelerator tools jar is downloaded to work_dir xxxxxx/spark-rapids-tools/qual_20240716223336_7a717563/work_dir/rapids-4-spark-tools_2.12-24.06.1.jar
2024-07-16 15:33:36,861 INFO rapids.tools.qualification: Using Spark RAPIDS Accelerator Tools jar version 24.06.1
2024-07-16 15:33:36,861 INFO rapids.tools.qualification: Checking dependency Apache Spark
2024-07-16 15:33:36,862 INFO rapids.tools.qualification: Checking dependency Hadoop AWS
2024-07-16 15:33:36,862 INFO rapids.tools.qualification: Checking dependency AWS Java SDK Bundled
2024-07-16 15:33:36,894 INFO rapids.tools.qualification: Completed downloading of dependency [Hadoop AWS] => 0.032 seconds
2024-07-16 15:33:37,579 INFO rapids.tools.qualification: Completed downloading of dependency [AWS Java SDK Bundled] => 0.716 seconds
2024-07-16 15:33:40,371 INFO rapids.tools.qualification: Completed downloading of dependency [Apache Spark] => 3.509 seconds
2024-07-16 15:33:40,372 INFO rapids.tools.qualification: Dependencies are processed as: xxxxxx/spark-rapids-tools/qual_20240716223336_7a717563/work_dir/hadoop-aws-3.3.4.jar; xxxxxx/spark-rapids-tools/qual_20240716223336_7a717563/work_dir/aws-java-sdk-bundle-1.12.262.jar; xxxxxx/spark-rapids-tools/qual_20240716223336_7a717563/work_dir/spark-3.5.0-bin-hadoop3/jars/*
2024-07-16 15:33:40,372 INFO rapids.tools.qualification: Total Execution Time: Downloading dependencies for local Mode => 3.511 seconds
2024-07-16 15:33:40,372 DEBUG rapids.tools.qualification: Processing Rapids plugin Arguments {}
2024-07-16 15:33:40,373 INFO rapids.tools.qualification: Loading CPU cluster properties from file xxxxxx/emr_cpu_cluster_props.json
2024-07-16 15:33:40,374 INFO rapids.tools.qualification: Creating GPU cluster by converting the CPU cluster instances to GPU supported types
2024-07-16 15:33:40,374 DEBUG rapids.tools.cmd_driver: Skip converting Master nodes
2024-07-16 15:33:40,374 INFO rapids.tools.qualification: Generating input file for Auto-tuner
2024-07-16 15:33:40,374 DEBUG rapids.tools.qualification: Opening file xxxxxx/spark-rapids-tools/qual_20240716223336_7a717563/work_dir/worker_info.yaml to write worker info
2024-07-16 15:33:40,376 INFO rapids.tools.qualification: Generated autotuner worker info: xxxxxx/spark-rapids-tools/qual_20240716223336_7a717563/work_dir/worker_info.yaml
2024-07-16 15:33:40,376 INFO rapids.tools.qualification: WorkerInfo successfully processed into workDir [xxxxxx/Desktop/spark-rapids-tools/qual_20240716223336_7a717563/work_dir/worker_info.yaml]
2024-07-16 15:33:40,376 INFO rapids.tools.qualification: No remote output folder specified.
2024-07-16 15:33:40,376 INFO rapids.tools.qualification: ======= [Process-Arguments]: Finished =======

Platform: Databricks-AWS
spark_rapids qualification -v -p databricks-aws --eventlogs <my-event-logs> --cluster <my-cluster-props-file>

Stdout
2024-07-16 15:40:41,173 INFO rapids.tools.qualification: ******* [Process-Arguments]: Starting *******
2024-07-16 15:40:41,173 DEBUG rapids.tools.qualification: Processing Output Arguments
2024-07-16 15:40:41,173 DEBUG rapids.tools.qualification: Root directory of local storage is set as: xxxxxx/spark-rapids-tools
2024-07-16 15:40:41,173 INFO rapids.tools.qualification.ctxt: Local workdir root folder is set as xxxxxx/spark-rapids-tools/qual_20240716224041_2a5ecdc0
2024-07-16 15:40:41,174 INFO rapids.tools.qualification.ctxt: Dependencies are generated locally in local disk as: xxxxxx/spark-rapids-tools/qual_20240716224041_2a5ecdc0/work_dir
2024-07-16 15:40:41,174 INFO rapids.tools.qualification.ctxt: Local output folder is set as: xxxxxx/spark-rapids-tools/qual_20240716224041_2a5ecdc0
2024-07-16 15:40:41,174 INFO rapids.tools.qualification: Qualification tool processing the arguments
2024-07-16 15:40:41,248 INFO rapids.tools.qualification: RAPIDS accelerator tools jar is downloaded to work_dir xxxxxx/spark-rapids-tools/qual_20240716224041_2a5ecdc0/work_dir/rapids-4-spark-tools_2.12-24.06.1.jar
2024-07-16 15:40:41,248 INFO rapids.tools.qualification: Using Spark RAPIDS Accelerator Tools jar version 24.06.1
2024-07-16 15:40:41,249 INFO rapids.tools.qualification: Checking dependency Apache Spark
2024-07-16 15:40:41,249 INFO rapids.tools.qualification: Checking dependency Hadoop AWS
2024-07-16 15:40:41,250 INFO rapids.tools.qualification: Checking dependency AWS Java SDK Bundled
2024-07-16 15:40:41,277 INFO rapids.tools.qualification: Completed downloading of dependency [Hadoop AWS] => 0.028 seconds
2024-07-16 15:40:41,957 INFO rapids.tools.qualification: Completed downloading of dependency [AWS Java SDK Bundled] => 0.707 seconds
2024-07-16 15:40:44,738 INFO rapids.tools.qualification: Completed downloading of dependency [Apache Spark] => 3.489 seconds
2024-07-16 15:40:44,739 INFO rapids.tools.qualification: Dependencies are processed as: xxxxxx/spark-rapids-tools/qual_20240716224041_2a5ecdc0/work_dir/hadoop-aws-3.3.4.jar; xxxxxx/spark-rapids-tools/qual_20240716224041_2a5ecdc0/work_dir/aws-java-sdk-bundle-1.12.262.jar; xxxxxx/spark-rapids-tools/qual_20240716224041_2a5ecdc0/work_dir/spark-3.5.0-bin-hadoop3/jars/*
2024-07-16 15:40:44,740 INFO rapids.tools.qualification: Total Execution Time: Downloading dependencies for local Mode => 3.491 seconds
2024-07-16 15:40:44,740 DEBUG rapids.tools.qualification: Processing Rapids plugin Arguments {}
2024-07-16 15:40:44,740 INFO rapids.tools.qualification: Loading CPU cluster properties from file xxxxxx/db_aws_cpu_cluster_props.json
2024-07-16 15:40:44,741 WARNING rapids.tools.cluster: Cluster configuration: `executors` count 0 does not match the `num_workers` value 8. Using generated names.
2024-07-16 15:40:44,741 INFO rapids.tools.qualification: Creating GPU cluster by converting the CPU cluster instances to GPU supported types
2024-07-16 15:40:44,742 DEBUG rapids.tools.cmd_driver: Skip converting Master nodes
2024-07-16 15:40:44,742 INFO rapids.tools.cluster: Node with g4dn.4xlarge supports GPU devices.
2024-07-16 15:40:44,742 INFO rapids.tools.cluster: Node with g4dn.4xlarge supports GPU devices.
2024-07-16 15:40:44,742 INFO rapids.tools.cluster: Node with g4dn.4xlarge supports GPU devices.
2024-07-16 15:40:44,742 INFO rapids.tools.cluster: Node with g4dn.4xlarge supports GPU devices.
2024-07-16 15:40:44,742 INFO rapids.tools.cluster: Node with g4dn.4xlarge supports GPU devices.
2024-07-16 15:40:44,742 INFO rapids.tools.cluster: Node with g4dn.4xlarge supports GPU devices.
2024-07-16 15:40:44,742 INFO rapids.tools.cluster: Node with g4dn.4xlarge supports GPU devices.
2024-07-16 15:40:44,742 INFO rapids.tools.cluster: Node with g4dn.4xlarge supports GPU devices.
2024-07-16 15:40:44,742 INFO rapids.tools.qualification: Generating input file for Auto-tuner
2024-07-16 15:40:44,742 DEBUG rapids.tools.qualification: Opening file xxxxxx/spark-rapids-tools/qual_20240716224041_2a5ecdc0/work_dir/worker_info.yaml to write worker info
2024-07-16 15:40:44,744 INFO rapids.tools.qualification: Generated autotuner worker info: xxxxxx/spark-rapids-tools/qual_20240716224041_2a5ecdc0/work_dir/worker_info.yaml
2024-07-16 15:40:44,744 INFO rapids.tools.qualification: WorkerInfo successfully processed into workDir [xxxxxx/spark-rapids-tools/qual_20240716224041_2a5ecdc0/work_dir/worker_info.yaml]
2024-07-16 15:40:44,745 INFO rapids.tools.qualification: No remote output folder specified.
2024-07-16 15:40:44,745 INFO rapids.tools.qualification: ======= [Process-Arguments]: Finished =======

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
@cindyyuanjiang cindyyuanjiang marked this pull request as ready for review July 18, 2024 23:21
@cindyyuanjiang cindyyuanjiang requested review from parthosa, amahussein, tgravescs and nartal1 and removed request for parthosa July 18, 2024 23:21
@cindyyuanjiang
Copy link
Collaborator Author

I put some comments (# To deprecate) for code that I think can be deprecated after removing CLI dependency for all platforms. I am planning to remove the unused code in a separate PR after we have finished every platform (in addition to some final refactoring).

@cindyyuanjiang cindyyuanjiang added bug Something isn't working user_tools Scope the wrapper module running CSP, QualX, and reports (python) usability track issues related to the Tools's user experience labels Jul 18, 2024
@amahussein
Copy link
Collaborator

I put some comments (# To deprecate) for code that I think can be deprecated after removing CLI dependency for all platforms. I am planning to remove the unused code in a separate PR after we have finished every platform (in addition to some final refactoring).

Thanks @cindyyuanjiang
That's a great idea to mark things to be removed.
Can you please change that to a # TODO: tobe deprecated description? TODO is more like a standard and can be picked up by almost all the IDEs

@amahussein
Copy link
Collaborator

@nartal1 I remember you contributed to the usability tasks and you have knowledge in this area.
Can you please take a closer look at Cindy's PR and test it?

@nartal1
Copy link
Collaborator

nartal1 commented Jul 19, 2024

Thanks @cindyyuanjiang ! Tested your PR and it looks good. As per offline discussion it would be good if we could add log message that it is using the cached instance catalog file.

Another thing is, we see this log in Initialization phase which can be confusing:

2024-07-19 10:29:15,579 INFO rapids.tools.qualification: ******* [Initialization]: Starting *******
2024-07-19 10:29:15,602 INFO rapids.tools.qualification.ctxt: Start connecting to the platform
2024-07-19 10:29:15,602 WARNING rapids.tools.csp: Property profile is not set. Setting default value default from environment variable
2024-07-19 10:29:15,602 WARNING rapids.tools.csp: Property credentialFile is not set. Setting default value /home/nartal/.aws/credentials from environment variable
2024-07-19 10:29:15,602 WARNING rapids.tools.csp: Property cliConfigFile is not set. Setting default value /home/nartal/.aws/config from environment variable
2024-07-19 10:29:15,602 WARNING rapids.tools.csp: Property region is not set. Setting default value us-east-1 from environment variable
2024-07-19 10:29:15,602 WARNING rapids.tools.csp: Property output is not set. Setting default value json from environment variable
2024-07-19 10:29:15,603 WARNING rapids.tools.cmd_driver: Environment report: Private key file path is not set. It is required to SSH on driver node. Set RAPIDS_USER_TOOLS_KEY_PAIR_PATH

Can we remove it if property file is used for cluster argument. It can be in this PR or a follow on.
@amahussein - Wanted to know your thoughts on removing above log messages.

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
@amahussein
Copy link
Collaborator

Can we remove it if property file is used for cluster argument. It can be in this PR or a follow on.
@amahussein - Wanted to know your thoughts on removing above log messages.

Thanks @nartal1 that's a good point.
We have other issues to improve logging which can be addressed then.
@cindyyuanjiang you may want to rerun a e-to-e after upmerging. I wonder if PR #1188 would require you to handle some new code of blocks.

@cindyyuanjiang
Copy link
Collaborator Author

@cindyyuanjiang you may want to rerun a e-to-e after upmerging. I wonder if PR #1188 would require you to handle some new code of blocks.

Thanks @amahussein! I upmerged with dev branch and did another round of e-to-e testing. Results look good to me.

Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cindyyuanjiang !
I don't have specific comments on the changes.
If @nartal1 and @parthosa are okay with the changes, then we can move fwd with it.

Copy link
Collaborator

@nartal1 nartal1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cindyyuanjiang ! LGTM.

@cindyyuanjiang
Copy link
Collaborator Author

Follow up issue for logging: #1207

Copy link
Collaborator

@parthosa parthosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cindyyuanjiang. This looks great.

@cindyyuanjiang cindyyuanjiang merged commit 56e6962 into NVIDIA:dev Jul 22, 2024
14 checks passed
@cindyyuanjiang cindyyuanjiang deleted the spark-rapids-tools-1191-emr branch July 22, 2024 18:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working usability track issues related to the Tools's user experience user_tools Scope the wrapper module running CSP, QualX, and reports (python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants