-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve cluster node initialisation for CSPs #964
Conversation
Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
executors_cnt = len(worker_nodes_from_conf) if worker_nodes_from_conf else 0 | ||
if num_workers != executors_cnt: | ||
self.logger.warning('Cluster configuration: `executors` count %d does not match the ' | ||
'`num_workers` value %d. Using generated names.', executors_cnt, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it may be a bit difficult to understand "Using generated names" without context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cindyyuanjiang. This refers to using default templates as node configurations. I will update the message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MInor question, thanks @parthosa!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @parthosa !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @parthosa !
LGTME!
Fixes #959. Currently we initialise cluster nodes based on cluster information provided through three ways - cluster name, local cluster info file or cluster inference. However the current approach had different limitations for different CSPs.
dataproc, databricks-aws, databricks-azure :
numInstances
for dataproc andnum_workers
for databricksinstanceNames
for dataproc andexecutors
for databricksnumInstances != len(instanceNames)
, show a warning and usenumInstances
as number of executors.emr:
aws emr list-instances --cluster-id
to fetch the instance list which is used to calculate number of executor nodes.Cluster --> InstanceGroups --> RequestedInstanceCount
property to get the number of executors and use default template for node configurations.Testing
Manually tested with using sample cluster files in
user_tools/tests/spark_rapids_tools_ut/resources/cluster
dataproc
instanceNames
and setnumInstances
to 10.databricks-aws/databricks-azure
executors[]
and setnum_workers
to 5.emr
RequestedInstanceCount
to 5 for instance groupCORE