Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache CLI calls for node instance description #952

Merged
merged 2 commits into from
Apr 19, 2024

Conversation

parthosa
Copy link
Collaborator

@parthosa parthosa commented Apr 18, 2024

Fixes #949.

Description:

  • Currently the platform specific CLIs would run commands to fetch instance type description for every node.
    • CMD aws ec2 describe-instance-types --region us-west-2 --instance-types m5a.12xlarge will run 10 times if the cluster has 10 nodes.
  • This is inefficient and cause significant performance impact especially when running user tools for large clusters.
  • For databricks-azure, even though we do not use CLI command, we would be reading the azure-instances-catalog.json file for every node.

This PR fixes this issue by caching the instance description calls in the CLI driver.

Changes

  • Introduce instance_descriptions_cache to cache the instance type description in CMDDriverBase
  • Cache key is a platform specific tuple
    • This is chosen based on variables used in _build_platform_describe_node_instance()
    • Default implementation is to use node.instance_type.
    • Other fields such as region, zone are used based on platform.
  • Every platform implements _exec_platform_describe_node_instance(node) that returns a single entry having the node's instance description.
    • Previously, for databricks-azure, we would return the entire azure-instances-catalog.json file.
    • Now, we would return the single entry for the instance type.

Testing

emr

STDOUT Before
[time] INFO rapids.tools.qualification: Loading CPU cluster properties from file /Users/psarthi/Work/cluster/emr/emr-cpu.json
[time] DEBUG rapids.tools.cmd: submitting system command: <aws emr list-instances --cluster-id j-aaabbbbccc --instance-group-id ig-aaabbbcccc>
[time] DEBUG rapids.tools.cmd: submitting system command: <aws emr list-instances --cluster-id j-aaabbbbccc --instance-group-id ig-xxxyyyzzz>
[time] DEBUG rapids.tools.cmd: submitting system command: <aws ec2 describe-instance-types --region us-west-2 --instance-types m5a.12xlarge>
[time] DEBUG rapids.tools.cmd: submitting system command: <aws ec2 describe-instance-types --region us-west-2 --instance-types m5zn.6xlarge>
[time] DEBUG rapids.tools.cmd: submitting system command: <aws ec2 describe-instance-types --region us-west-2 --instance-types m5zn.6xlarge>
[time] DEBUG rapids.tools.cmd: submitting system command: <aws ec2 describe-instance-types --region us-west-2 --instance-types m5zn.6xlarge>
[time] DEBUG rapids.tools.cmd: submitting system command: <aws ec2 describe-instance-types --region us-west-2 --instance-types m5zn.6xlarge>
[time] INFO rapids.tools.qualification: Creating GPU cluster by converting the CPU cluster instances to GPU supported types
[time] DEBUG rapids.tools.cmd_driver: Skip converting Master nodes
[time] DEBUG rapids.tools.cmd: submitting system command: <aws ec2 describe-instance-types --region us-west-2 --instance-types m5a.12xlarge>
[time] DEBUG rapids.tools.cmd: submitting system command: <aws ec2 describe-instance-types --region us-west-2 --instance-types g5.4xlarge>
[time] DEBUG rapids.tools.cmd: submitting system command: <aws ec2 describe-instance-types --region us-west-2 --instance-types g5.4xlarge>
[time] DEBUG rapids.tools.cmd: submitting system command: <aws ec2 describe-instance-types --region us-west-2 --instance-types g5.4xlarge>
[time] DEBUG rapids.tools.cmd: submitting system command: <aws ec2 describe-instance-types --region us-west-2 --instance-types g5.4xlarge>
[time] DEBUG rapids.tools.cmd: submitting system command: <aws ec2 describe-instance-types --region us-west-2 --instance-types g5.4xlarge>
STDOUT After
[time] INFO rapids.tools.qualification: Loading CPU cluster properties from file /Users/psarthi/Work/cluster/emr/emr-cpu.json
[time] DEBUG rapids.tools.cmd: submitting system command: <aws emr list-instances --cluster-id j-aaabbbcccc --instance-group-id ig-aaaabbbcc>
[time] DEBUG rapids.tools.cmd: submitting system command: <aws emr list-instances --cluster-id j-aaabbbbccc --instance-group-id ig-xxxyyyzzz>
[time] DEBUG rapids.tools.cmd: submitting system command: <aws ec2 describe-instance-types --region us-west-2 --instance-types m5a.12xlarge>
[time] DEBUG rapids.tools.cmd: submitting system command: <aws ec2 describe-instance-types --region us-west-2 --instance-types m5zn.6xlarge>
[time] INFO rapids.tools.qualification: Creating GPU cluster by converting the CPU cluster instances to GPU supported types
[time] DEBUG rapids.tools.cmd_driver: Skip converting Master nodes
[time] DEBUG rapids.tools.cmd: submitting system command: <aws ec2 describe-instance-types --region us-west-2 --instance-types g5.4xlarge>

dataproc

STDOUT Before
[time] INFO rapids.tools.qualification: Loading CPU cluster properties from file /Users/psarthi/Work/cluster/dataproc/dataproc-cpu.json
[time] DEBUG rapids.tools.cmd: submitting system command: <gcloud compute machine-types describe n1-standard-16 --format json --zone us-central1-c>
[time] DEBUG rapids.tools.cmd: submitting system command: <gcloud compute machine-types describe n1-standard-16 --format json --zone us-central1-c>
[time] DEBUG rapids.tools.cmd: submitting system command: <gcloud compute machine-types describe n1-standard-16 --format json --zone us-central1-c>
[time] INFO rapids.tools.qualification: Creating GPU cluster by converting the CPU cluster instances to GPU supported types
[time] DEBUG rapids.tools.cmd_driver: Skip converting Master nodes
[time] INFO rapids.tools.cluster: Node with n1-standard-16 supports GPU devices.
[time] INFO rapids.tools.cluster: Node with n1-standard-16 supports GPU devices.
[time] DEBUG rapids.tools.cmd: submitting system command: <gcloud compute machine-types describe n1-standard-16 --format json --zone us-central1-c>
STDOUT After
[time] INFO rapids.tools.qualification: Loading CPU cluster properties from file /Users/psarthi/Work/cluster/dataproc/dataproc-cpu.json
[time] DEBUG rapids.tools.cmd: submitting system command: <gcloud compute machine-types describe n1-standard-16 --format json --zone us-central1-c>
[time] INFO rapids.tools.qualification: Creating GPU cluster by converting the CPU cluster instances to GPU supported types
[time] DEBUG rapids.tools.cmd_driver: Skip converting Master nodes
[time] INFO rapids.tools.cluster: Node with n1-standard-16 supports GPU devices.
[time] INFO rapids.tools.cluster: Node with n1-standard-16 supports GPU devices.

databricks-aws

STDOUT Before
[time] INFO rapids.tools.qualification: Loading CPU cluster properties from file /Users/psarthi/Work/cluster/databricks-aws/databricks-aws-cpu.json
[time] DEBUG rapids.tools.cmd: submitting system command: <aws ec2 describe-instance-types --region us-west-2 --instance-types m6gd.xlarge>
[time] DEBUG rapids.tools.cmd: submitting system command: <aws ec2 describe-instance-types --region us-west-2 --instance-types m6gd.xlarge>
[time] DEBUG rapids.tools.cmd: submitting system command: <aws ec2 describe-instance-types --region us-west-2 --instance-types m6gd.xlarge>
[time] INFO rapids.tools.qualification: Creating GPU cluster by converting the CPU cluster instances to GPU supported types
[time] DEBUG rapids.tools.cmd_driver: Skip converting Master nodes
[time] INFO rapids.tools.cluster: Converting node m6gd.xlarge into GPU supported instance-type g5.xlarge
[time] INFO rapids.tools.cluster: Converting node m6gd.xlarge into GPU supported instance-type g5.xlarge
[time] DEBUG rapids.tools.cmd: submitting system command: <aws ec2 describe-instance-types --region us-west-2 --instance-types g5.xlarge>
STDOUT After
[time] INFO rapids.tools.qualification: Loading CPU cluster properties from file /Users/psarthi/Work/cluster/databricks-aws/databricks-aws-cpu.json
[time] DEBUG rapids.tools.cmd: submitting system command: <aws ec2 describe-instance-types --region us-west-2 --instance-types m6gd.xlarge>
[time] INFO rapids.tools.qualification: Creating GPU cluster by converting the CPU cluster instances to GPU supported types
[time] DEBUG rapids.tools.cmd_driver: Skip converting Master nodes
[time] INFO rapids.tools.cluster: Converting node m6gd.xlarge into GPU supported instance-type g5.xlarge
[time] INFO rapids.tools.cluster: Converting node m6gd.xlarge into GPU supported instance-type g5.xlarge
[time] DEBUG rapids.tools.cmd: submitting system command: <aws ec2 describe-instance-types --region us-west-2 --instance-types g5.xlarge>

databricks-azure: (Multiple file reads)

STDOUT Before
[time] INFO rapids.tools.qualification: Loading CPU cluster properties from file /Users/psarthi/Work/cluster/databricks-azure/databricks-azure-cpu.json
[time] INFO rapids.tools.cmd_driver: The Azure instance type descriptions catalog is loaded from the cache
[time] INFO rapids.tools.cmd_driver: The Azure instance type descriptions catalog is loaded from the cache
[time] INFO rapids.tools.cmd_driver: The Azure instance type descriptions catalog is loaded from the cache
[time] INFO rapids.tools.qualification: Creating GPU cluster by converting the CPU cluster instances to GPU supported types
[time] DEBUG rapids.tools.cmd_driver: Skip converting Master nodes
[time] INFO rapids.tools.cluster: Converting node Standard_DS12_v2 into GPU supported instance-type Standard_NC4as_T4_v3
[time] INFO rapids.tools.cluster: Converting node Standard_DS12_v2 into GPU supported instance-type Standard_NC4as_T4_v3
[time] INFO rapids.tools.cmd_driver: The Azure instance type descriptions catalog is loaded from the cache
STDOUT After
[time] INFO rapids.tools.qualification: Loading CPU cluster properties from file /Users/psarthi/Work/cluster/databricks-azure/databricks-azure-cpu.json
[time] INFO rapids.tools.cmd_driver: The Azure instance type descriptions catalog is loaded from the cache
[time] INFO rapids.tools.qualification: Creating GPU cluster by converting the CPU cluster instances to GPU supported types
[time] DEBUG rapids.tools.cmd_driver: Skip converting Master nodes
[time] INFO rapids.tools.cluster: Converting node Standard_DS12_v2 into GPU supported instance-type Standard_NC4as_T4_v3
[time] INFO rapids.tools.cluster: Converting node Standard_DS12_v2 into GPU supported instance-type Standard_NC4as_T4_v3
[time] INFO rapids.tools.cmd_driver: The Azure instance type descriptions catalog is loaded from the cache

onprem

No Impact

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
@parthosa parthosa added bug Something isn't working user_tools Scope the wrapper module running CSP, QualX, and reports (python) labels Apr 18, 2024
@parthosa parthosa self-assigned this Apr 18, 2024
Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always wanted to do that change but never got the time to work on it.
Thanks @parthosa !
LGTME

@parthosa parthosa merged commit 8ce7b3e into NVIDIA:dev Apr 19, 2024
15 checks passed
@parthosa parthosa deleted the spark-rapids-tools-949 branch April 19, 2024 16:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working user_tools Scope the wrapper module running CSP, QualX, and reports (python)
Projects
None yet
3 participants