Improve cluster node initialisation for CSPs #964

parthosa · 2024-04-25T19:45:18Z

Fixes #959. Currently we initialise cluster nodes based on cluster information provided through three ways - cluster name, local cluster info file or cluster inference. However the current approach had different limitations for different CSPs.

dataproc, databricks-aws, databricks-azure :

When using a local cluster info file or cluster inference, there are two ways two determine the number of executor nodes:
- Using the property - numInstances for dataproc and num_workers for databricks
- Using number of entries in the list -instanceNames for dataproc and executors for databricks
Currently we the second method.
However if a user wants to test for a different cluster with different number of executor nodes, the user would have to add entries for each of those instances.
This PR improves this limitation by using the first method to get the number of executors and use default template for node configurations.
If numInstances != len(instanceNames), show a warning and use numInstances as number of executors.

emr:

When using a local cluster info file, we use the aws cmd aws emr list-instances --cluster-id to fetch the instance list which is used to calculate number of executor nodes.
However this requires the cluster to exist in AWS.
If a user wants to use a sample cluster info file, tools will crash because the cluster-id would not exist.
This PR improves this limitation by using Cluster --> InstanceGroups --> RequestedInstanceCount property to get the number of executors and use default template for node configurations.

Testing

Manually tested with using sample cluster files in user_tools/tests/spark_rapids_tools_ut/resources/cluster

dataproc

Removed entries from instanceNames and set numInstances to 10.

2024-04-25 10:25:09,087 WARNING rapids.tools.cluster: Cluster configuration: `instanceNames` count 0 does not match the `numInstances` value 10. Using generated names.

databricks-aws/databricks-azure

Removed entries from executors[] and set num_workers to 5.

2024-04-25 10:19:48,525 WARNING rapids.tools.cluster: Cluster configuration: `executors` count 0 does not match the `num_workers` value 5. Using generated names.

emr

Set RequestedInstanceCount to 5 for instance group CORE

2024-04-25 10:31:13,182 DEBUG rapids.tools.cmd: submitting system command: <aws emr list-instances --cluster-id j-383838383838383 --instance-group-id ig-e77347373737373>
2024-04-25 10:31:14,100 ERROR rapids.tools.cluster: Failed to fetch configurations for 1 instances in group ig-e77347373737373. Using generated names.
2024-04-25 10:31:14,102 DEBUG rapids.tools.cmd: submitting system command: <aws emr list-instances --cluster-id j-383838383838383 --instance-group-id ig-29292929292929>
2024-04-25 10:31:14,882 ERROR rapids.tools.cluster: Failed to fetch configurations for 5 instances in group ig-29292929292929. Using generated names.

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>

cindyyuanjiang · 2024-04-25T21:00:19Z

user_tools/src/spark_rapids_pytools/cloud_api/databricks_aws.py

+        executors_cnt = len(worker_nodes_from_conf) if worker_nodes_from_conf else 0
+        if num_workers != executors_cnt:
+            self.logger.warning('Cluster configuration: `executors` count %d does not match the '
+                                '`num_workers` value %d. Using generated names.', executors_cnt,


nit: it may be a bit difficult to understand "Using generated names" without context?

Thanks @cindyyuanjiang. This refers to using default templates as node configurations. I will update the message.

cindyyuanjiang

MInor question, thanks @parthosa!

nartal1

Thanks @parthosa !

amahussein

Thanks @parthosa !
LGTME!

Refactor cluster initialization to use instance counts

55bffe8

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>

parthosa added bug Something isn't working user_tools Scope the wrapper module running CSP, QualX, and reports (python) labels Apr 25, 2024

parthosa requested review from cindyyuanjiang, amahussein and nartal1 April 25, 2024 19:45

parthosa self-assigned this Apr 25, 2024

parthosa changed the title ~~Refactor cluster initialization to use instance counts~~ Improve cluster node initialisation for CSPs Apr 25, 2024

cindyyuanjiang reviewed Apr 25, 2024

View reviewed changes

cindyyuanjiang approved these changes Apr 25, 2024

View reviewed changes

nartal1 approved these changes Apr 26, 2024

View reviewed changes

amahussein approved these changes Apr 26, 2024

View reviewed changes

amahussein merged commit d350eb6 into NVIDIA:dev Apr 26, 2024
16 checks passed

parthosa deleted the spark-rapids-tools-959 branch April 26, 2024 15:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve cluster node initialisation for CSPs #964

Improve cluster node initialisation for CSPs #964

parthosa commented Apr 25, 2024 •

edited

Loading

cindyyuanjiang Apr 25, 2024

parthosa Apr 25, 2024

cindyyuanjiang left a comment

nartal1 left a comment

amahussein left a comment

Improve cluster node initialisation for CSPs #964

Improve cluster node initialisation for CSPs #964

Conversation

parthosa commented Apr 25, 2024 • edited Loading

dataproc, databricks-aws, databricks-azure :

emr:

Testing

dataproc

databricks-aws/databricks-azure

emr

cindyyuanjiang Apr 25, 2024

Choose a reason for hiding this comment

parthosa Apr 25, 2024

Choose a reason for hiding this comment

cindyyuanjiang left a comment

Choose a reason for hiding this comment

nartal1 left a comment

Choose a reason for hiding this comment

amahussein left a comment

Choose a reason for hiding this comment

parthosa commented Apr 25, 2024 •

edited

Loading