Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qual tool tuning rec based on CPU event log coherently recommend tunings and node setup and infer cluster from eventlog #1188

Merged
merged 81 commits into from
Jul 19, 2024

Conversation

tgravescs
Copy link
Collaborator

fixes #1160 and #997

This has a bunch of changes in it. Currently we have basic integration to be able to run qualification auto tuner against cpu event logs.. But it isn't tied to any node recommedations or other features in the python user tools so user can get incorrect or inconsistent behavior, this fixes that.
Note, the assumption here is that you run a similar sized GPU cluster as compared to the CPU cluster. Similar sized to me means same number of executors with similar cores/memory (might not be exact). If you change number of executors then you change the parallelism possibilities.

Here are some of the changes:

  • tie together the options from python for passing clusters, if the cluster isn't provided we need infer the cluster setup from the event log and infer the gpu cluster from that. if the cluster is provided we keep the same behavior as before.To make this happen we need a mapping to instance type in the scala code.
  • Make sure the recommendations for memory are really based on the instance type to make sure heap, offheap, etc all will fit.
  • Make sure that we are recommending the correct number of nodes based on the instance type. For instance CPU might run 2 executors on a 32 core box. But for that CSP there are no GPU nodes that can have 2 gpus on it, so you have to make that 2 nodes of 16 cores each with 1gpu to keep the number of executors the same.
  • Output in the python side should be a per app recommendation for the cluster setup and tunings unless the --cluster option is passed into the python user_tools. if --cluster passed its assumed each job ran on that type of cluster (but still has bugs in my opinion).
  • Tie back the scala code tunings and node recommendation inferred to the final python output in the cluster shape recommendation
  • Qualification output should include tuning configs and node recommendations in summary output and add some * to try to get attention to how we are qualifying jobs

There are a bunch of things that still need further enhancement and most of those have issues or are documented in #1152

Tested via existing unit tests and then I manually tested qualification against, databricks aws, emr, databricks azure, dataproc (all of the flavors), and on prem.

Example output running on Dataproc without passing the --cluster option so it does a per app level recommendation:

+----+-----------------------+--------------------------------+-----------------+------------------+-------------------------------------+------------------------------------+
|    | App Name              | App ID                         | Estimated GPU   | Qualified Node   | Full Cluster                        | GPU Config                         |
|    |                       |                                | Speedup         | Recommendation   | Config                              | Recommendation                     |
|    |                       |                                | Category**      |                  | Recommendations*                    | Breakdown*                         |
|----+-----------------------+--------------------------------+-----------------+------------------+-------------------------------------+------------------------------------|
|  0 | NDS - Power Run       | application_1709275675195_0233 | Large           | n1-standard-32   | application_1709275675195_0233.conf | application_1709275675195_0233.log |
|  1 | NDS - transcode - orc | application_1716441747790_0001 | Large           | n1-standard-16   | application_1716441747790_0001.conf | application_1716441747790_0001.log |
+----+-----------------------+--------------------------------+-----------------+------------------+-------------------------------------+------------------------------------+
* Config Recommendations can be found in /home/user/workspace/spark-rapids-tools2/user_tools/qual_20240712195330_eD3Ff366/rapids_4_spark_qualification_output/tuning
** Estimated GPU Speedup Category assumes the user is using the node type recommended and config recommendations with the same size cluster as was used with the CPU side.

Same recommendation when you do pass in the --cluster option - which implies all the apps ran on this type of cluster and it uses that cluster instead of cluster info from event log:


+----+-----------------------+--------------------------------+-----------------+------------------+-------------------------------------+------------------------------------+
|    | App Name              | App ID                         | Estimated GPU   | Qualified Node   | Full Cluster                        | GPU Config                         |
|    |                       |                                | Speedup         | Recommendation   | Config                              | Recommendation                     |
|    |                       |                                | Category**      |                  | Recommendations*                    | Breakdown*                         |
|----+-----------------------+--------------------------------+-----------------+------------------+-------------------------------------+------------------------------------|
|  0 | NDS - Power Run       | application_1709275675195_0233 | Large           | n1-standard-2    | application_1709275675195_0233.conf | application_1709275675195_0233.log |
|  1 | NDS - transcode - orc | application_1716441747790_0001 | Large           | n1-standard-2    | application_1716441747790_0001.conf | application_1716441747790_0001.log |
+----+-----------------------+--------------------------------+-----------------+------------------+-------------------------------------+------------------------------------+
* Config Recommendations can be found in /home/user/workspace/spark-rapids-tools2/user_tools/qual_20240712195506_6DD8468e/rapids_4_spark_qualification_output/tuning
** Estimated GPU Speedup Category assumes the user is using the node type recommended and config recommendations with the same size cluster as was used with the CPU side.

@amahussein amahussein added affect-output A change that modifies the output (add/remove/rename files, add/remove/rename columns) user_tools Scope the wrapper module running CSP, QualX, and reports (python) labels Jul 15, 2024
Copy link
Collaborator

@parthosa parthosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tgravescs for these changes. Made an initial pass to test e2e:

  • Currently, the python tool generates a file intermediate_output/cluster_shape_recommendation.json to store the final cluster shape recommendation.
    • This PR does not seem to update the above JSON file correctly.
    • After this PR, we would now have potentially two JSON files with cluster shape recommendation (one generated by scala side and one by python side)
    • Users might get confused as which file to refer to.
  • In the console output, Col Qualified Node Recommendation: Should this be only node or cluster shape?
  • nit: Can we put 'Estimated GPU Speedup Category assumes the user is using the node...' in Notes sections
Notes:
--------------------
 - Apps with the same name are grouped together and their metrics are averaged
 - 'Estimated GPU Speedup Category' assumes the user is using the node type recommended and config recommendations with the same size cluster as was used with the CPU side.


conversion_items_summary = {}
if self.ctxt.get_ctxt('cpuClusterProxy'):
cpu_cluster_info = self.ctxt.get_ctxt('cpuClusterProxy')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we simplify the nested if conditions by moving cpu_cluster_info and gpu_cluster_info before the first if ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow, what first if statement?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably refactor as:

if self.ctxt.get_ctxt('cpuClusterProxy'):
   cpu_cluster_info = self.ctxt.get_ctxt('cpuClusterProxy')
   gpu_cluster_info = self.ctxt.get_ctxt('gpuClusterProxy')
   if cpu_cluster_info is not None and gpu_cluster_info is not None:

as

cpu_cluster_info = self.ctxt.get_ctxt('cpuClusterProxy')
gpu_cluster_info = self.ctxt.get_ctxt('gpuClusterProxy')
if cpu_cluster_info:
    cpu_instance_type = cpu_cluster_info.get_worker_node().instance_type
    if gpu_cluster_info:

if self.ctxt.get_ctxt('cpuClusterProxy'):
cpu_cluster_info = self.ctxt.get_ctxt('cpuClusterProxy')
gpu_cluster_info = self.ctxt.get_ctxt('gpuClusterProxy')
if cpu_cluster_info is not None and gpu_cluster_info is not None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we store *_cluster_info.get_worker_node().instance_type as variables to avoid redundant calls and clean up (similar to below)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updating

@@ -989,13 +1106,44 @@ def __infer_cluster_and_update_savings(self, cluster_info_df: pd.DataFrame):

# Log the inferred cluster information and set the context
self._log_inferred_cluster_info(cpu_cluster_obj)
self.logger.info('Inferred Cluster cpu node %s', cpu_cluster_obj)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might not want to print the entire cluster object here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tgravescs for the huge effort putting all those pieces together!
I was running the changes and I noticed the following:

  • For onPRem it shows nan. We might need to clarify that this is on purpose in the PR description. I think it will look better if nan is handled and replaced by N/A.
  • I see log mesages that could be due to incorrect handling of variables.
     2024-07-15 16:34:12,444 WARNING rapids.tools.qualification: Failed to get the worker node information for the GPU cluster AttributeError:'NoneType' object has no attribute 'get_worker_node'
     2024-07-15 16:34:12,444 WARNING rapids.tools.qualification: Cannot generate the cluster recommendation report because the cluster information is not available.
     2024-07-15 16:34:12,444 ERROR rapids.tools.qualification: Error generating the cluster recommendation report. Reason - AttributeError:'NoneType' object has no attribute 'get_cluster_configuration'
    
  • We previously fixed the flow to avoid requiring CSP authentication to retrieve some node information. It looks like the AutoTuning recommendation did not pick the same path that has that fix. All the node recommendations are nan for all CSP without valid CSP authentication. Perhaps @parthosa can help on closing that gap.
  • I was looking into the intermediate_output/cluster_shape_recommendation.csv . Both sourceCluster/targetCluster include gpuInfo set to T4. In fact bother clusters have the exact same information for dataproc. (attached here cluster_shape_recommendation.json)
  • Mentioning that we need to followup on this PR by revisiting Recommended cluster should use executors_per_node and cores_per_executor #1138 to fix the recommendations for Dataproc environments.

@tgravescs
Copy link
Collaborator Author

tgravescs commented Jul 16, 2024

Thanks @tgravescs for these changes. Made an initial pass to test e2e:

  • Currently, the python tool generates a file intermediate_output/cluster_shape_recommendation.json to store the final cluster shape recommendation.

    • This PR does not seem to update the above JSON file correctly.
    • After this PR, we would now have potentially two JSON files with cluster shape recommendation (one generated by scala side and one by python side)
    • Users might get confused as which file to refer to.
  • In the console output, Col Qualified Node Recommendation: Should this be only node or cluster shape?

  • nit: Can we put 'Estimated GPU Speedup Category assumes the user is using the node...' in Notes sections

Notes:
--------------------
 - Apps with the same name are grouped together and their metrics are averaged
 - 'Estimated GPU Speedup Category' assumes the user is using the node type recommended and config recommendations with the same size cluster as was used with the CPU side.

I intentionally didn't modify the intermediate_output/cluster_shape_recommendation.json as its in intermediate output and that can be a followup. In my opinion if its in this directory it shouldn't be being used. I wanted to stop at some point and just get in what we have so multiple people can work on it. It doesn't change what it is for the worse.

  • In the console output, Col Qualified Node Recommendation: Should this be only node or cluster shape?

Right now it prints node recommendation so title matches. If we want to print more cluster information that is an enhancment. This is only in the text stdout so doesn't seem like a big deal to me but open to opinions.. But either way it can be a followup since I see cluster shape recommendation a followup thing we can do as long as coherent before release.

  • nit: Can we put 'Estimated GPU Speedup Category assumes the user is using the node...' in Notes sections

What Notes section? I am not familiar with it. If you are suggesting to put it there in addition I guess its fine, but how is that different from where it is now? I want it as close to that table as possible

@tgravescs
Copy link
Collaborator Author

Thanks @tgravescs for the huge effort putting all those pieces together! I was running the changes and I noticed the following:

  • For onPRem it shows nan. We might need to clarify that this is on purpose in the PR description. I think it will look better if nan is handled and replaced by N/A.

sure

  • I see log mesages that could be due to incorrect handling of variables.
     2024-07-15 16:34:12,444 WARNING rapids.tools.qualification: Failed to get the worker node information for the GPU cluster AttributeError:'NoneType' object has no attribute 'get_worker_node'
     2024-07-15 16:34:12,444 WARNING rapids.tools.qualification: Cannot generate the cluster recommendation report because the cluster information is not available.
     2024-07-15 16:34:12,444 ERROR rapids.tools.qualification: Error generating the cluster recommendation report. Reason - AttributeError:'NoneType' object has no attribute 'get_cluster_configuration'
    

I can look at the noneType, honestly there were a lot before. The middle one is explicit and should be there.

* We previously fixed the flow to avoid requiring CSP authentication to retrieve some node information. It looks like the AutoTuning recommendation did not pick the same path that has that fix. All the node recommendations are `nan` for all CSP without valid CSP authentication. Perhaps @parthosa can help on closing that gap.

Sure let me know.

  • I was looking into the intermediate_output/cluster_shape_recommendation.csv . Both sourceCluster/targetCluster include gpuInfo set to T4. In fact bother clusters have the exact same information for dataproc. (attached here cluster_shape_recommendation.json)

ok, what is the question or bug here you are pointing out? I was trying to not really touch this file. may have in modifying some variables.

…Category

stdout since we don't want them in csv
Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, what is the question or bug here you are pointing out? I was trying to not really touch this file. may have in modifying some variables.

My question here that the source cluster is a CPU-cluster. I don't expect to see GPU information within the source cluster.

Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTME.
Thanks @tgravescs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affect-output A change that modifies the output (add/remove/rename files, add/remove/rename columns) feature request New feature or request user_tools Scope the wrapper module running CSP, QualX, and reports (python)
Projects
None yet
3 participants