Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User tools fallback to default zone/region #1054

Merged
merged 3 commits into from
May 31, 2024

Conversation

nartal1
Copy link
Collaborator

@nartal1 nartal1 commented May 31, 2024

This fixes #1018.

This PR fallsback to default region/zone(where applicable) for CLI command when Region/Zone is not set by the user.
Earlier it would throw an error which doesn't show the exact reason though. With this PR, it continues with the default region with a warning that Region was not set and using the default values from environment variable.

In addition to it, updated the way the remaining environment variables are set in sp_types.py. Earlier condition would miss some of the environment variables.

Tested it on platform:

  1. dataproc
  2. emr
  3. databricks-aws

databricks-azure has already default defined.

Dataproc failure
 spark_rapids qualification --eventlogs=gs://< PATH TO EVENTLOGS>  --platform=dataproc 
RuntimeError: Error invoking CMD : 
        | ERROR: (gcloud.compute.machine-types.describe) Could not fetch resource:
        |  - Invalid value for field 'zone': 'None'. Must be a match of regex '[a-z](?:[-a-z0-9]{0,61}[a-z0-9])?'
Dataproc completion with this PR
Initialization Scripts:
-----------------------
To create a GPU cluster, run the following script:
#!/bin/bash

export CLUSTER_NAME="default-cluster-name"

gcloud dataproc clusters create $CLUSTER_NAME \
    --image-version=2.1.41-debian11 \
    --region us-central1 \
    --zone us-central1-b \
    --master-machine-type n1-standard-16 \
    --num-workers 8 \
    --worker-machine-type n1-standard-64 \
    --num-worker-local-ssds 2 \
    --enable-component-gateway \
    --subnet=default \
    --initialization-actions=gs://goog-dataproc-initialization-actions-us-central1/spark-rapids/spark-rapids.sh \
    --worker-accelerator type=nvidia-tesla-t4,count=2 \
    --properties 'spark:spark.driver.memory=50g'

Processing Completed!

EMR failure
spark_rapids qualification --eventlogs=s3://{PATH_TO_EVENTLOGS} --platform=emr --verbose 
File "/home/test/spark-rapids-tools/user_tools/src/spark_rapids_pytools/common/utilities.py", line 333, in exec
    raise RuntimeError(f'{cmd_err_msg}')
RuntimeError: Error invoking CMD : 
        | 
        | Could not connect to the endpoint URL: "https://elasticmapreduce.None.amazonaws.com/"
EMR completion with this PR
        Instance types conversions:
------------  --  ----------
m6gd.4xlarge  to  g5.4xlarge
------------  --  ----------
To support acceleration with T4 GPUs, switch the worker node instance types

Processing Completed!

Signed-off-by: Niranjan Artal <nartal@nvidia.com>
@nartal1 nartal1 added the user_tools Scope the wrapper module running CSP, QualX, and reports (python) label May 31, 2024
@nartal1 nartal1 self-assigned this May 31, 2024
@amahussein amahussein added the usability track issues related to the Tools's user experience label May 31, 2024
Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nartal1 !
A couple of minor comments.

@@ -55,15 +55,17 @@
},
{
"envVariableKey": "CLOUDSDK_DATAPROC_REGION",
"confProperty": "region"
"confProperty": "region",
"defaultValue": "us-central1"
},
{
"envVariableKey": "CLOUDSDK_COMPUTE_REGION",
"confProperty": "region"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a default for CLOUDSDK_COMPUTE_REGION too?

@@ -55,15 +55,17 @@
},
{
"envVariableKey": "CLOUDSDK_DATAPROC_REGION",
"confProperty": "region"
"confProperty": "region",
"defaultValue": "us-central1"
},
{
"envVariableKey": "CLOUDSDK_COMPUTE_REGION",
"confProperty": "region"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a default for CLOUDSDK_COMPUTE_REGION too?

Comment on lines 732 to 734
for prop_entry in properties_map_arr:
prop_entry_key = prop_entry.get('propKey')
if self.ctxt.get(prop_entry_key) is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment to explain what we are trying to do?
I know that the original code did not have much explanation, but it will be nice to add that since we are already modifying the code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment. PTAL

Copy link
Collaborator

@parthosa parthosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nartal1.

Since this updates region in our env_vars and not the actual CLI configuration, CLI cmds such as
aws emr describe-cluster --cluster-id {cluster_id} might crash because it will try to get region from the CLI config. We might have to add the region explicitly in these CLI cmds.

Repro CMD for EMR:
spark_rapids qualification --cluster my-cluster --platform emr --eventlogs <eventlog>

user_tools/src/spark_rapids_pytools/cloud_api/sp_types.py Outdated Show resolved Hide resolved
Signed-off-by: Niranjan Artal <nartal@nvidia.com>
@nartal1
Copy link
Collaborator Author

nartal1 commented May 31, 2024

Since this updates region in our env_vars and not the actual CLI configuration, CLI cmds such as
aws emr describe-cluster --cluster-id {cluster_id} might crash because it will try to get region from the CLI config. We might have to add the region explicitly in these CLI cmds.

Yes @parthosa ! You are correct. Region becomes mandatory when --cluster is provided as the argument. We get a better error now to set it explicitly:

Error invoking CMD <aws emr list-clusters --query 'Clusters[?Name==`emr_perfio_on_filecache_on_us_east_1a`]'>:
        |
        | You must specify a region. You can also configure your region by running "aws configure".

parthosa
parthosa previously approved these changes May 31, 2024
Copy link
Collaborator

@parthosa parthosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nartal1. That makes sense. We cannot identify a cluster without the region.

Signed-off-by: Niranjan Artal <nartal@nvidia.com>
Copy link
Collaborator

@cindyyuanjiang cindyyuanjiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nartal1!

Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nartal1 for the fix.
Great team work @cindyyuanjiang and @parthosa

@nartal1 nartal1 merged commit 270763d into NVIDIA:dev May 31, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
usability track issues related to the Tools's user experience user_tools Scope the wrapper module running CSP, QualX, and reports (python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Handle CLI cmds when CSP SDK is not configured
4 participants