Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve cluster node initialisation for CSPs #964

Merged
merged 1 commit into from
Apr 26, 2024

Conversation

parthosa
Copy link
Collaborator

@parthosa parthosa commented Apr 25, 2024

Fixes #959. Currently we initialise cluster nodes based on cluster information provided through three ways - cluster name, local cluster info file or cluster inference. However the current approach had different limitations for different CSPs.

dataproc, databricks-aws, databricks-azure :

  • When using a local cluster info file or cluster inference, there are two ways two determine the number of executor nodes:
    • Using the property - numInstances for dataproc and num_workers for databricks
    • Using number of entries in the list -instanceNames for dataproc and executors for databricks
  • Currently we the second method.
  • However if a user wants to test for a different cluster with different number of executor nodes, the user would have to add entries for each of those instances.
  • This PR improves this limitation by using the first method to get the number of executors and use default template for node configurations.
  • If numInstances != len(instanceNames), show a warning and use numInstances as number of executors.

emr:

  • When using a local cluster info file, we use the aws cmd aws emr list-instances --cluster-id to fetch the instance list which is used to calculate number of executor nodes.
  • However this requires the cluster to exist in AWS.
  • If a user wants to use a sample cluster info file, tools will crash because the cluster-id would not exist.
  • This PR improves this limitation by using Cluster --> InstanceGroups --> RequestedInstanceCount property to get the number of executors and use default template for node configurations.

Testing

Manually tested with using sample cluster files in user_tools/tests/spark_rapids_tools_ut/resources/cluster

dataproc

  • Removed entries from instanceNames and set numInstances to 10.
2024-04-25 10:25:09,087 WARNING rapids.tools.cluster: Cluster configuration: `instanceNames` count 0 does not match the `numInstances` value 10. Using generated names.

databricks-aws/databricks-azure

  • Removed entries from executors[] and set num_workers to 5.
2024-04-25 10:19:48,525 WARNING rapids.tools.cluster: Cluster configuration: `executors` count 0 does not match the `num_workers` value 5. Using generated names.

emr

  • Set RequestedInstanceCount to 5 for instance group CORE
2024-04-25 10:31:13,182 DEBUG rapids.tools.cmd: submitting system command: <aws emr list-instances --cluster-id j-383838383838383 --instance-group-id ig-e77347373737373>
2024-04-25 10:31:14,100 ERROR rapids.tools.cluster: Failed to fetch configurations for 1 instances in group ig-e77347373737373. Using generated names.
2024-04-25 10:31:14,102 DEBUG rapids.tools.cmd: submitting system command: <aws emr list-instances --cluster-id j-383838383838383 --instance-group-id ig-29292929292929>
2024-04-25 10:31:14,882 ERROR rapids.tools.cluster: Failed to fetch configurations for 5 instances in group ig-29292929292929. Using generated names.

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
@parthosa parthosa added bug Something isn't working user_tools Scope the wrapper module running CSP, QualX, and reports (python) labels Apr 25, 2024
@parthosa parthosa self-assigned this Apr 25, 2024
@parthosa parthosa changed the title Refactor cluster initialization to use instance counts Improve cluster node initialisation for CSPs Apr 25, 2024
executors_cnt = len(worker_nodes_from_conf) if worker_nodes_from_conf else 0
if num_workers != executors_cnt:
self.logger.warning('Cluster configuration: `executors` count %d does not match the '
'`num_workers` value %d. Using generated names.', executors_cnt,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it may be a bit difficult to understand "Using generated names" without context?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cindyyuanjiang. This refers to using default templates as node configurations. I will update the message.

Copy link
Collaborator

@cindyyuanjiang cindyyuanjiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MInor question, thanks @parthosa!

Copy link
Collaborator

@nartal1 nartal1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @parthosa !

Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @parthosa !
LGTME!

@amahussein amahussein merged commit d350eb6 into NVIDIA:dev Apr 26, 2024
16 checks passed
@parthosa parthosa deleted the spark-rapids-tools-959 branch April 26, 2024 15:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working user_tools Scope the wrapper module running CSP, QualX, and reports (python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Initialization scripts does not correct number of workers
4 participants