Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the Qual tool AutoTuner Heuristics against CPU event logs #1069

Merged
merged 24 commits into from
Jun 7, 2024

Conversation

tgravescs
Copy link
Collaborator

fixes #1068

This enhances heuristics around the spark.executor.memory and handles cases where the memory to core ratio is to small. It will throw an exception and not put out tunings if the core/memory ratio is to small. In the future we should just tag this and recommend the sizes.

This also adds in extra overhead since worst case we need space for both pinned memory and spill memory. It gets a little complicated since spill will use pinned memory, but if its used it will use regular off heap. So here we set things at worst case which is it needs both.

I also added in heuristics for configuring the multithreaded readers - num threads and some sizes and also the shuffle reader/writer thread pools based on the number of cores.
Most of the heuristics are based on what we saw from real customer workloads and NDS results.

Most of this testing was on CSPs, I will try to apply more to onprem later.

note most of this functionality needs the worker information passed in --worker-info ./worker_info-demo-gpu-cluster.yaml

Example:

system:
  numCores: 8 
  memory: 15360MiB
  numWorkers: 4
softwareProperties:
  spark.scheduler.mode: FAIR

With the worker info:

Spark Properties:
--conf spark.executor.cores=8
--conf spark.executor.instances=4
--conf spark.executor.memory=8192m
--conf spark.rapids.filecache.enabled=true
--conf spark.rapids.memory.pinnedPool.size=3584m
--conf spark.rapids.shuffle.multiThreaded.reader.threads=20
--conf spark.rapids.shuffle.multiThreaded.writer.threads=20
--conf spark.rapids.sql.batchSizeBytes=2147483647
--conf spark.rapids.sql.concurrentGpuTasks=3
--conf spark.rapids.sql.multiThreadedRead.numThreads=20
--conf spark.shuffle.manager=com.nvidia.spark.rapids.spark321db.RapidsShuffleManager
--conf spark.sql.adaptive.coalescePartitions.minPartitionSize=4m
--conf spark.sql.adaptive.coalescePartitions.parallelismFirst=false
--conf spark.sql.shuffle.partitions=200
--conf spark.task.resource.gpu.amount=0.125

Without the worker info:

Spark Properties:
--conf spark.rapids.filecache.enabled=true
--conf spark.shuffle.manager=com.nvidia.spark.rapids.spark321db.RapidsShuffleManager
--conf spark.sql.shuffle.partitions=200

@tgravescs tgravescs added the feature request New feature or request label Jun 5, 2024
@tgravescs tgravescs self-assigned this Jun 5, 2024
@amahussein
Copy link
Collaborator

--worker-info ./worker_info-demo-gpu-cluster.yaml

Should we add that file to the repo? Perhaps inside inside tests/resources?

@tgravescs
Copy link
Collaborator Author

Should we add that file to the repo? Perhaps inside inside tests/resources?

Sure I can add it.

I also realized I wanted to add a few more tests to the Suite so I'll do that and push some updates shortly.

@amahussein amahussein added the core_tools Scope the core module (scala) label Jun 6, 2024
Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tgravescs

Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you filed #1078

Thanks @tgravescs
LGTME

Copy link
Collaborator

@nartal1 nartal1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tgravescs !

@tgravescs tgravescs merged commit baa0f46 into NVIDIA:dev Jun 7, 2024
15 checks passed
@tgravescs tgravescs deleted the autoheur branch June 7, 2024 13:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core_tools Scope the core module (scala) feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Enhance qualification tool Auto Tuner CPU event log recommendations
3 participants