-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Include number of executors per node in cluster information #1119
Conversation
Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
b4a4ac0
to
3d45b42
Compare
@@ -865,9 +865,13 @@ class QualificationAppInfo( | |||
logWarning(s"Application $appId: Cluster with variable executor cores detected. " + | |||
s"Using maximum value.") | |||
} | |||
// Group by host name, find max executors per host, and number of unique hosts | |||
val groupedHosts = activeHosts.groupBy(identity) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we don't want to just use activeHosts, we want to use all hosts to determine execs per node. That will be more likely right in case as an application finishes, if it has dynamic allocation on, it could release executors and thus the final active count could be inaccurate.
Put the number of executor nodes back to the active hosts for now but really we should try to do some timelining of this to see what was the max number in use at any time, but filed #1121 to followup with this. Put it back to what it was for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also it would be good to get a event log where dynamic allocation is disabled and the number of execs/hosts change over time. This is relatively easy in interactive, where you just run something then wait for the execs to idle timeout. then maybe run something again, maybe smaller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't want to just use activeHosts, we want to use all hosts to determine execs per node.
Updated the code to use all hosts.
Put the number of executor nodes back to the active hosts for now
Reverted to the original logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created a scenario for this (I think this meant dynamic allocation to be enabled):
get a event log where dynamic allocation is disabled and the number of execs/hosts change over time
Steps:
- Created a dataproc cluster with 4
n1-standard-16
worker nodes. - Start a spark shell with
--conf spark.dynamicAllocation.enabled=true --conf spark.executor.instances=8 --conf spark.executor.cores=8
- Run a large spark application. UI shows all 8 executors are active (2 per worker node).
- Wait for 30min. 7 executors are dead.
- Run a small spark application. UI shows it 1 executor is active.
Now since we are calculating execs from all hosts, we will get correct num of exec/hosts as 2 instead of 1.
Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have any tests with the dynamic allocation test you had in the description?
Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
Added unit test for dynamic allocation with comment |
There is no way for us to know this on some platforms like yarn where they select what it schedules by. We will for now just have to make an assumption that it isn't but document and warn user |
Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me> This is a follow-up to NVIDIA#1119 This fix only works for Dataproc when the cluster argument is not in the CLI. The instance type will be set after multiplying `numExecutorCores` by `numExecutorsPerNode`
Fixes #1117. Currently for cluster information, we calculate the number of nodes correctly but do not track the number of executors per node. This can generate wrong GPU cluster recommendations because there can be multiple executors per node.
This PR adds
numExecsPerNode
in the cluster information output file.Changes:
Core/Java:
numExecsPerNode
as maximum number of executors in any hostClusterInfo
and related methods.Num Executor Per Node
as new field in cluster information output CSV file.Output
Cluster Information Generated from Core:
File: rapids_4_spark_qualification_output_cluster_information.json
Previously:
After this fix:
Follow Up
coresPerNode = numExecsPerNode * coresPerExecutor
since a node maybe oversubscribed.