-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve CUDA polling #609
Merged
Merged
Improve CUDA polling #609
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Performance test reportpika PerformanceComparison
Info
Explanation of Symbols
|
msimberg
requested changes
Feb 23, 2023
libs/pika/async_cuda/include/pika/async_cuda/cuda_polling_helper.hpp
Outdated
Show resolved
Hide resolved
bors try |
tryBuild failed: |
Performance test reportpika PerformanceComparison
Info
Explanation of Symbols
|
When polling for ready events, push continuations/status onto a temporary vector whilst lock is held, then invoke he continuations outside of the lock, to allow other threads to process events whilst our continuations are running.
To prevent serialization of continuations, only handle one successful event on each test. This allows another thread to immediately poll and process another without a single thread having N continuations queued up
Instead of exiting after handling an event, reenter the polling loop and look for another one. Do not hold the lock except when modifying the event vectors etc.
Both async_cuda and async_mpi have a get/set pool name function which can be used to tell the libraries independently which pool to use for polling, but by default, both use "pika:polling" (formerly "pika:mpi")
Completed events are added to a lockfree queue and any thread(s) can invoke the completion(s), whilst another thread is still polling for the actual ready state under a lock.
bors try |
tryBuild failed: |
msimberg
approved these changes
Mar 7, 2023
bors try |
tryBuild failed: |
bors merge |
bors bot
added a commit
that referenced
this pull request
Mar 8, 2023
609: Improve CUDA polling r=aurianer a=biddisco Redesign the way polling for cuda events is handled Cuda events are polled (by any thread on the pool on which polling is enabled) and passed to a lockfree queue when ready. The polling loop first checks ready events and invokes callbacks, and only then takes the lock and checks outstanding events which are placed on the ready queue. This means that as soon as events are ready, any thread can invoke the callback - a single polling thread can find N events are ready and place them in the ready queue and N other threads can start handling the completions - instead of only allowing the polling thread to handle them. The locking and completion handling has been reworked significantly and gives much better results. Co-authored-by: John Biddiscombe <biddisco@cscs.ch>
Build failed: |
bors merge |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Redesign the way polling for cuda events is handled
Cuda events are polled (by any thread on the pool on which polling is enabled) and passed to a lockfree queue when ready. The polling loop first checks ready events and invokes callbacks, and only then takes the lock and checks outstanding events which are placed on the ready queue.
This means that as soon as events are ready, any thread can invoke the callback - a single polling thread can find N events are ready and place them in the ready queue and N other threads can start handling the completions - instead of only allowing the polling thread to handle them.
The locking and completion handling has been reworked significantly and gives much better results.