Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve CUDA polling #609

Merged
merged 10 commits into from
Mar 8, 2023
Merged

Improve CUDA polling #609

merged 10 commits into from
Mar 8, 2023

Conversation

biddisco
Copy link
Contributor

Redesign the way polling for cuda events is handled

Cuda events are polled (by any thread on the pool on which polling is enabled) and passed to a lockfree queue when ready. The polling loop first checks ready events and invokes callbacks, and only then takes the lock and checks outstanding events which are placed on the ready queue.
This means that as soon as events are ready, any thread can invoke the callback - a single polling thread can find N events are ready and place them in the ready queue and N other threads can start handling the completions - instead of only allowing the polling thread to handle them.

The locking and completion handling has been reworked significantly and gives much better results.

@pika-bot
Copy link
Collaborator

Performance test report

pika Performance

Comparison

BENCHMARKNO-EXECUTOR
Future Overhead - Create Thread Hierarchical - Latch--

Info

PropertyBeforeAfter
pika Datetime2022-09-16T08:18:06+00:002023-02-23T08:51:15+00:00
pika Commit190f18941b75b0
Compiler/apps/daint/SSL/pika/spack/lib/spack/env/clang/clang++ 11.0.1/apps/daint/SSL/pika/spack/lib/spack/env/clang/clang++ 11.0.1
Hostnamenid00074nid00025
Envfile
Clusternamedaintdaint
Datetime2022-09-16T10:25:01.976661+02:002023-02-23T10:03:21.247414+01:00

Explanation of Symbols

SymbolMEANING
=No performance change (confidence interval within ±1%)
(=)Probably no performance change (confidence interval within ±2%)
(+)/(-)Very small performance improvement/degradation (≤1%)
+/-Small performance improvement/degradation (>10%)
++/--Large performance improvement/degradation (>10%)
+++/---Very large performance improvement/degradation (>10%)
?Probably no change, but quite large uncertainty (confidence interval with ±5%)
??Unclear result, very large uncertainty (±10%)
???Something unexpected…

@msimberg
Copy link
Contributor

bors try

bors bot added a commit that referenced this pull request Feb 23, 2023
@msimberg msimberg changed the title Cuda polling Improve CUDA polling Feb 23, 2023
@bors
Copy link
Contributor

bors bot commented Feb 23, 2023

try

Build failed:

@pika-bot
Copy link
Collaborator

pika-bot commented Mar 6, 2023

Performance test report

pika Performance

Comparison

BENCHMARKNO-EXECUTOR
Future Overhead - Create Thread Hierarchical - Latch--

Info

PropertyBeforeAfter
pika Datetime2022-09-16T08:18:06+00:002023-03-06T16:13:45+00:00
pika Commit190f18940b9399
Hostnamenid00074nid00629
Envfile
Clusternamedaintdaint
Datetime2022-09-16T10:25:01.976661+02:002023-03-06T17:36:16.002686+01:00
Compiler/apps/daint/SSL/pika/spack/lib/spack/env/clang/clang++ 11.0.1/apps/daint/SSL/pika/spack/lib/spack/env/clang/clang++ 11.0.1

Explanation of Symbols

SymbolMEANING
=No performance change (confidence interval within ±1%)
(=)Probably no performance change (confidence interval within ±2%)
(+)/(-)Very small performance improvement/degradation (≤1%)
+/-Small performance improvement/degradation (>10%)
++/--Large performance improvement/degradation (>10%)
+++/---Very large performance improvement/degradation (>10%)
?Probably no change, but quite large uncertainty (confidence interval with ±5%)
??Unclear result, very large uncertainty (±10%)
???Something unexpected…

biddisco added 10 commits March 7, 2023 08:00
When polling for ready events, push continuations/status onto a
temporary vector whilst lock is held, then invoke he continuations
outside of the lock, to allow other threads to process events
whilst our continuations are running.
To prevent serialization of continuations, only handle one
successful event on each test. This allows another thread to
immediately poll and process another without a single thread
having N continuations queued up
Instead of exiting after handling an event, reenter the
polling loop and look for another one. Do not hold the lock
except when modifying the event vectors etc.
Both async_cuda and async_mpi have a get/set pool name function
which can be used to tell the libraries independently which pool
to use for polling, but by default, both use "pika:polling"
(formerly "pika:mpi")
Completed events are added to a lockfree queue and any thread(s) can
invoke the completion(s), whilst another thread is still polling for
the actual ready state under a lock.
@aurianer
Copy link
Contributor

aurianer commented Mar 7, 2023

bors try

bors bot added a commit that referenced this pull request Mar 7, 2023
@bors
Copy link
Contributor

bors bot commented Mar 7, 2023

try

Build failed:

@msimberg msimberg added this to the 0.14.0 milestone Mar 7, 2023
@msimberg
Copy link
Contributor

msimberg commented Mar 7, 2023

bors try

bors bot added a commit that referenced this pull request Mar 7, 2023
@bors
Copy link
Contributor

bors bot commented Mar 7, 2023

try

Build failed:

@aurianer
Copy link
Contributor

aurianer commented Mar 8, 2023

bors merge

bors bot added a commit that referenced this pull request Mar 8, 2023
609: Improve CUDA polling r=aurianer a=biddisco

Redesign the way polling for cuda events is handled

Cuda events are polled (by any thread on the pool on which polling is enabled) and passed to a lockfree queue when ready. The polling loop first checks ready events and invokes callbacks, and only then takes the lock and checks outstanding events which are placed on the ready queue. 
This means that as soon as events are ready, any thread can invoke the callback - a single polling thread can find N events are ready and place them in the ready queue and N other threads can start handling the completions - instead of only allowing the polling thread to handle them.

The locking and completion handling has been reworked significantly and gives much better results.

Co-authored-by: John Biddiscombe <biddisco@cscs.ch>
@bors
Copy link
Contributor

bors bot commented Mar 8, 2023

Build failed:

@msimberg
Copy link
Contributor

msimberg commented Mar 8, 2023

bors merge

@bors
Copy link
Contributor

bors bot commented Mar 8, 2023

@bors bors bot merged commit 78e0435 into main Mar 8, 2023
@bors bors bot deleted the cuda_polling branch March 8, 2023 09:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants