Improve CUDA polling #609

biddisco · 2023-02-23T08:51:13Z

Redesign the way polling for cuda events is handled

Cuda events are polled (by any thread on the pool on which polling is enabled) and passed to a lockfree queue when ready. The polling loop first checks ready events and invokes callbacks, and only then takes the lock and checks outstanding events which are placed on the ready queue.
This means that as soon as events are ready, any thread can invoke the callback - a single polling thread can find N events are ready and place them in the ready queue and N other threads can start handling the completions - instead of only allowing the polling thread to handle them.

The locking and completion handling has been reworked significantly and gives much better results.

pika-bot · 2023-02-23T09:03:40Z

Performance test report

pika Performance

Comparison

BENCHMARK	NO-EXECUTOR
Future Overhead - Create Thread Hierarchical - Latch	--

Info

Property	Before	After
pika Datetime	2022-09-16T08:18:06+00:00	2023-02-23T08:51:15+00:00
pika Commit	`190f189`	`41b75b0`
Compiler	/apps/daint/SSL/pika/spack/lib/spack/env/clang/clang++ 11.0.1	/apps/daint/SSL/pika/spack/lib/spack/env/clang/clang++ 11.0.1
Hostname	nid00074	nid00025
Envfile
Clustername	daint	daint
Datetime	2022-09-16T10:25:01.976661+02:00	2023-02-23T10:03:21.247414+01:00

Explanation of Symbols

Symbol	MEANING
=	No performance change (confidence interval within ±1%)
(=)	Probably no performance change (confidence interval within ±2%)
(+)/(-)	Very small performance improvement/degradation (≤1%)
+/-	Small performance improvement/degradation (>10%)
++/--	Large performance improvement/degradation (>10%)
+++/---	Very large performance improvement/degradation (>10%)
?	Probably no change, but quite large uncertainty (confidence interval with ±5%)
??	Unclear result, very large uncertainty (±10%)
???	Something unexpected…

libs/pika/async_cuda/include/pika/async_cuda/cuda_polling_helper.hpp

msimberg · 2023-02-23T11:01:21Z

bors try

bors · 2023-02-23T11:06:17Z

try

Build failed:

github/linux/hip/fast

pika-bot · 2023-03-06T16:36:47Z

Performance test report

pika Performance

Comparison

BENCHMARK	NO-EXECUTOR
Future Overhead - Create Thread Hierarchical - Latch	--

Info

Property	Before	After
pika Datetime	2022-09-16T08:18:06+00:00	2023-03-06T16:13:45+00:00
pika Commit	`190f189`	`40b9399`
Hostname	nid00074	nid00629
Envfile
Clustername	daint	daint
Datetime	2022-09-16T10:25:01.976661+02:00	2023-03-06T17:36:16.002686+01:00
Compiler	/apps/daint/SSL/pika/spack/lib/spack/env/clang/clang++ 11.0.1	/apps/daint/SSL/pika/spack/lib/spack/env/clang/clang++ 11.0.1

Explanation of Symbols

Symbol	MEANING
=	No performance change (confidence interval within ±1%)
(=)	Probably no performance change (confidence interval within ±2%)
(+)/(-)	Very small performance improvement/degradation (≤1%)
+/-	Small performance improvement/degradation (>10%)
++/--	Large performance improvement/degradation (>10%)
+++/---	Very large performance improvement/degradation (>10%)
?	Probably no change, but quite large uncertainty (confidence interval with ±5%)
??	Unclear result, very large uncertainty (±10%)
???	Something unexpected…

When polling for ready events, push continuations/status onto a temporary vector whilst lock is held, then invoke he continuations outside of the lock, to allow other threads to process events whilst our continuations are running.

To prevent serialization of continuations, only handle one successful event on each test. This allows another thread to immediately poll and process another without a single thread having N continuations queued up

Instead of exiting after handling an event, reenter the polling loop and look for another one. Do not hold the lock except when modifying the event vectors etc.

Both async_cuda and async_mpi have a get/set pool name function which can be used to tell the libraries independently which pool to use for polling, but by default, both use "pika:polling" (formerly "pika:mpi")

Completed events are added to a lockfree queue and any thread(s) can invoke the completion(s), whilst another thread is still polling for the actual ready state under a lock.

aurianer · 2023-03-07T07:22:24Z

bors try

bors · 2023-03-07T07:38:00Z

try

Build failed:

github/macos/debug

msimberg · 2023-03-07T15:15:07Z

bors try

bors · 2023-03-07T15:28:23Z

try

Build failed:

jenkins/cscs-daint/gcc-11-debug

aurianer · 2023-03-08T08:10:55Z

bors merge

609: Improve CUDA polling r=aurianer a=biddisco Redesign the way polling for cuda events is handled Cuda events are polled (by any thread on the pool on which polling is enabled) and passed to a lockfree queue when ready. The polling loop first checks ready events and invokes callbacks, and only then takes the lock and checks outstanding events which are placed on the ready queue. This means that as soon as events are ready, any thread can invoke the callback - a single polling thread can find N events are ready and place them in the ready queue and N other threads can start handling the completions - instead of only allowing the polling thread to handle them. The locking and completion handling has been reworked significantly and gives much better results. Co-authored-by: John Biddiscombe <biddisco@cscs.ch>

bors · 2023-03-08T08:25:47Z

Build failed:

ci/circleci: arm64_build

msimberg · 2023-03-08T08:46:57Z

bors merge

bors · 2023-03-08T09:42:07Z

Build succeeded:

biddisco requested review from aurianer and msimberg as code owners February 23, 2023 08:51

msimberg requested changes Feb 23, 2023

View reviewed changes

libs/pika/async_cuda/include/pika/async_cuda/cuda_polling_helper.hpp Outdated Show resolved Hide resolved

bors bot added a commit that referenced this pull request Feb 23, 2023

Try #609:

af919d5

msimberg changed the title ~~Cuda polling~~ Improve CUDA polling Feb 23, 2023

msimberg assigned biddisco Feb 23, 2023

biddisco force-pushed the cuda_polling branch from eba0687 to 8c4e0fd Compare March 6, 2023 16:13

biddisco added 10 commits March 7, 2023 08:00

Only process one cuda event/continuation per poll

feac9c5

To prevent serialization of continuations, only handle one successful event on each test. This allows another thread to immediately poll and process another without a single thread having N continuations queued up

Keep polling until no ready events are available

1badce2

Instead of exiting after handling an event, reenter the polling loop and look for another one. Do not hold the lock except when modifying the event vectors etc.

cuda polling by default will use the same pool as mpi

589e8cd

Both async_cuda and async_mpi have a get/set pool name function which can be used to tell the libraries independently which pool to use for polling, but by default, both use "pika:polling" (formerly "pika:mpi")

allow set/get name for pool used by cuda event polling

7fb873b

Cleanup cuda debug output

251259f

Redesign polling completion invocation so all threads can help

226dcee

Completed events are added to a lockfree queue and any thread(s) can invoke the completion(s), whilst another thread is still polling for the actual ready state under a lock.

Invoke ready callbacks at start and end of polling function

6ec2d87

Use resource manager default pool name instead of hardcoded one

867b023

Reset cuda polling pool back to "default"

ede5ed4

biddisco force-pushed the cuda_polling branch from 8c4e0fd to ede5ed4 Compare March 7, 2023 07:00

bors bot added a commit that referenced this pull request Mar 7, 2023

Try #609:

71fa62f

msimberg approved these changes Mar 7, 2023

View reviewed changes

msimberg added this to the 0.14.0 milestone Mar 7, 2023

bors bot added a commit that referenced this pull request Mar 7, 2023

Try #609:

ec07877

bors bot merged commit 78e0435 into main Mar 8, 2023

bors bot deleted the cuda_polling branch March 8, 2023 09:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve CUDA polling #609

Improve CUDA polling #609

biddisco commented Feb 23, 2023

pika-bot commented Feb 23, 2023

pika Performance

Comparison

Info

Explanation of Symbols

msimberg commented Feb 23, 2023

bors bot commented Feb 23, 2023

pika-bot commented Mar 6, 2023

pika Performance

Comparison

Info

Explanation of Symbols

aurianer commented Mar 7, 2023

bors bot commented Mar 7, 2023

msimberg commented Mar 7, 2023

bors bot commented Mar 7, 2023

aurianer commented Mar 8, 2023

bors bot commented Mar 8, 2023

msimberg commented Mar 8, 2023

bors bot commented Mar 8, 2023

Improve CUDA polling #609

Improve CUDA polling #609

Conversation

biddisco commented Feb 23, 2023

pika-bot commented Feb 23, 2023

pika Performance

Comparison

Info

Explanation of Symbols

msimberg commented Feb 23, 2023

bors bot commented Feb 23, 2023

try

pika-bot commented Mar 6, 2023

pika Performance

Comparison

Info

Explanation of Symbols

aurianer commented Mar 7, 2023

bors bot commented Mar 7, 2023

try

msimberg commented Mar 7, 2023

bors bot commented Mar 7, 2023

try

aurianer commented Mar 8, 2023

bors bot commented Mar 8, 2023

msimberg commented Mar 8, 2023

bors bot commented Mar 8, 2023