Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix java Qual tool Autotuner output when GPU device is missing #1085

Merged
merged 1 commit into from
Jun 7, 2024

Conversation

cindyyuanjiang
Copy link
Collaborator

@cindyyuanjiang cindyyuanjiang commented Jun 7, 2024

Fixes #1030

Problem
Java Qual tool Autotuner outputs inconsistent results when GPU device is not provided in worker info file (--worker-info) and platform is set to databricks-aws-t4:

  • - GPU device is missing. Setting default to a10G.
  • --conf spark.rapids.sql.concurrentGpuTasks=2 -> indicates GPU is T4

Changes

  • When setting default GPU if name is missing, autotuner should consider the GPU device provided in the platform first
  • Make platform list more exhaustive

Testing

export TOOLS_JAR
export EVENTLOGS
export WORKER_INFO_FILE
java -XX:+UseG1GC -Xmx50g -cp $TOOLS_JAR:$SPARK_HOME/jars/* com.nvidia.spark.rapids.tool.qualification.QualificationMain --platform databricks-aws-t4 --num-threads 6 --auto-tuner --worker-info $WORKER_INFO_FILE $EVENTLOGS

In WORKER_INFO_FILE:

system:
  numCores: 32
  memory: 131072MiB
  numWorkers: 4
softwareProperties:
  spark.scheduler.mode: FAIR
  spark.sql.cbo.enabled: 'true'
  spark.ui.port: '0'
  spark.yarn.am.memory: 640m

…orm list

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
@cindyyuanjiang cindyyuanjiang self-assigned this Jun 7, 2024
@cindyyuanjiang cindyyuanjiang added bug Something isn't working core_tools Scope the core module (scala) usability track issues related to the Tools's user experience labels Jun 7, 2024
@tgravescs
Copy link
Collaborator

thanks for fixing this, it would be nice to add the new fixed output to the description.

@tgravescs
Copy link
Collaborator

This is likely a separate issue, but it seems odd ot me that the tool doesn't fail if we pass in a bad parameter for --platform... For instance if I pass in databricks-aws-r4 it happily goes along and just changes it to be something else -

24/06/07 07:43:57 INFO PlatformFactory: Using platform: databricks-aws

I would have expected this to error out so user knows they are not going to get what they expect. @mattahrens @amahussein thoughts on this, if you agree we can file a separate issue.

@amahussein
Copy link
Collaborator

This is likely a separate issue, but it seems odd ot me that the tool doesn't fail if we pass in a bad parameter for --platform... For instance if I pass in databricks-aws-r4 it happily goes along and just changes it to be something else -

24/06/07 07:43:57 INFO PlatformFactory: Using platform: databricks-aws

I would have expected this to error out so user knows they are not going to get what they expect. @mattahrens @amahussein thoughts on this, if you agree we can file a separate issue.

Yes, we can file a separate issue for that

@amahussein amahussein merged commit f6e1351 into NVIDIA:dev Jun 7, 2024
16 checks passed
@cindyyuanjiang cindyyuanjiang deleted the spark-rapids-tools-1030 branch June 7, 2024 17:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working core_tools Scope the core module (scala) usability track issues related to the Tools's user experience
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Qual tool platform databricks-aws-t4 print out - GPU device is missing. Setting default to a10G
3 participants