Don't create more streams than required in cuda_pool
#864
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Create number of streams based on
get_num_worker_threads
if the runtime is up, otherwise fall back tohardware_concurrency
. This is intended to work around issues with HIP event creation being significantly slower if many streams are created. E.g. on an AMD EPYC system with 128 hardware threads (hardware_concurrency == 128
) thecuda_pool
would create(3 + 3) * hardware_concurrency = 768
streams per rank, even if the runtime is only started with e.g. 8 threads (8 ranks per node, without hyperthreading). If the actual number of runtime threads is used, the number of threads created is instead 48. This can significantly improve performance if many kernels are submitted.The
cuda_pool
still falls back to usinghardware_concurrency
if it's created before the pika runtime is started.