Switch to task-focused synchronization model #374

jpsamaroo · 2023-02-05T21:00:47Z

This PR switches AMDGPU to a task-focused synchronization model, similar to CUDA.jl. In such a model, tasks have their own device and queue assigned, and all GPU operations on that task are launched in serial. That is to say, kernels don't start until any previously-launched kernels have completed. It is thus no longer necessary to wait on the result of @roc, and instead, one can call AMDGPU.synchronize() on the host to wait for all currently-executing kernels (launched from this task) to complete.

There are many reasons why we'd want to switch to this model:

The rest of the ecosystem generally expects this, as CUDA.jl pioneered this approach
HSA queues are actually in-order, so there's no reason to pretend that they can execute work concurrently
It simplifies most common forms of GPU programming, as users are used to thinking about serial computations
The programming model approximately matches how Julia tasks execute

The existing wait-focused model is still fully available, so for the most part, behavior remains the same (assuming users were previously only using one task to launch GPU kernels); the previous single-queue behavior can be enabled by calling AMDGPU.Runtime.set_alloc_queue_pool_max!([1,0,0]), which shares a single queue per device, and disables the queue pooling logic for low- and high-priority queues.

Todo:

vchuravy

Love it.

src/highlevel.jl

src/tls.jl

luraess · 2023-02-05T21:08:21Z

That's awesome @jpsamaroo ! Thanks, we will look into this with @omlins and @utkinis to finalise AMDGPU support in ImplicitGlobalGrid.jl and ParallelStencil.jl.

Add HIPDevice and HIPStream abstractions Add `synchronize(::HIPStream)`

Stores a task-local device, queue, context, and stream via `task_local_storage`. Task state can be accessed with `task_local_state`, and stored with `task_local_state!`. The `task_local_state!` function has a functional form for locally-scoped task state changes. `device()`, `queue()`, `context()`, and `stream()` are available to query TLS for their appropriate values, and `device!()`, `queue!()`, and `stream!()` can be used to set a new value in TLS. Additionally, queue and stream priorities are accessible with `priority()` and settable with `priority!()`. `default_queue()` now forwards to `queue()`, and is deprecated. `default_device()` and friends still work globally, but users should prefer to use `device()` and `device!()` instead. There is no longer a single default queue per device; instead, queues are pooled per-device and are assigned to tasks in a round-robin fashion. Separate pool sizes are configurable on a per-queue-priority basis. Active kernel tracking is now handled by a per-queue background task which waits on and cleans up the active kernel list (which is now moved into the `ROCQueue` object). A new function `synchronize()` is available to wait on all currently-active kernels on the current queue and/or stream to complete.

Switch library handles to be allocated per-task Copy HandleCache mechanism from CUDA.jl Always include library wrapper code

Also use `queue` instead of `default_queue` to avoid triggering depwarn.

- Use current default device when creating HIPStream from raw hipStream_t. Since there appears to be no way to get the device from just hipStream_t. - Use passed device in TLS for HIPStream creation.

- Construct HIPStream using device from current HIPContext. - Add docs clarifying distinction between `get_default_device()` & `device()`.

This linked list is append-only and singly-linked, and allows multiple threads to concurrently read from it, while one task may advance the list head serially.

Uses the new `LinkedList` in place of `Vector` for active kernel tracking, which allows mostly-lockless concurrent access to the active kernel set, often with no or minimal allocations. Propagates errors from the current or prior kernels in `synchronize` by default, and adds an `errors::Bool` kwarg to `synchronize` to allow skipping error checking if desired. Disables all kernel resource cleanup during `wait`, and instead moves this fully into the kernel monitor. Also fixes a bug in the kernel monitor where a dead queue may not have all kernel resources cleaned up. Removes auto-reset of the queue in TLS, and instead provides `reset_dead_queue!()` to reset the queue if it's dead. Properly catches and ignores kernel errors in the queue monitor. Makes the `queue` field in `ROCQueue` no longer atomic, as we do not change it.

src/runtime/queue.jl

jpsamaroo added enhancement New feature or request performance needs tests needs docs labels Feb 5, 2023

vchuravy reviewed Feb 5, 2023

View reviewed changes

src/highlevel.jl Outdated Show resolved Hide resolved

src/tls.jl Outdated Show resolved Hide resolved

jpsamaroo force-pushed the jps/tls-queue branch 2 times, most recently from 5563c12 to ec06fe4 Compare February 21, 2023 02:20

jpsamaroo force-pushed the jps/tls-queue branch from ec06fe4 to 290abc8 Compare February 23, 2023 20:30

This was linked to issues Feb 23, 2023

State of queues and streams #337

Closed

rocBLAS: Remove old hand-wrapped code #384

Closed

pxl-th mentioned this pull request Mar 3, 2023

Migrate to KernelAbstractions.jl for handwritten kernels FluxML/NNlib.jl#479

Open

jpsamaroo force-pushed the jps/tls-queue branch 3 times, most recently from cbc005a to c43c447 Compare March 8, 2023 20:22

pxl-th mentioned this pull request Mar 15, 2023

StackOverflow Error on get_backend(cu(rand(3))) JuliaGPU/KernelAbstractions.jl#366

Open

pxl-th mentioned this pull request Mar 29, 2023

Update rocBLAS wrapper #402

Closed

jpsamaroo and others added 10 commits March 29, 2023 23:05

Improve HIP support

44752b2

Add HIPDevice and HIPStream abstractions Add `synchronize(::HIPStream)`

Remove semi_safe_load

abe8324

CI: Limit to 1.9 runners

0317f60

rocBLAS: Remove old wrappers

26eb429

Add generic at-check and check

bf7ca6b

Improve ROCm library handle integrations

bde0913

Switch library handles to be allocated per-task Copy HandleCache mechanism from CUDA.jl Always include library wrapper code

rocBLAS: Generate new bindings

30c5952

Fix rocBLAS wrapper & refactor it

eea2364

Also use `queue` instead of `default_queue` to avoid triggering depwarn.

Fix queue(device)

3877234

pxl-th force-pushed the jps/tls-queue branch from 6414dc0 to 3877234 Compare March 29, 2023 21:45

pxl-th added 2 commits March 30, 2023 00:59

Disable test

0587d3b

Query stream priority when creating from raw hipStream

8b0acba

- Use current default device when creating HIPStream from raw hipStream_t. Since there appears to be no way to get the device from just hipStream_t. - Use passed device in TLS for HIPStream creation.

pxl-th added 2 commits March 31, 2023 15:37

Refactor stream priority

9c309c5

Replace usage of default device with TLS device

1c3a0d1

- Construct HIPStream using device from current HIPContext. - Add docs clarifying distinction between `get_default_device()` & `device()`.

jpsamaroo mentioned this pull request Mar 31, 2023

Rocsparse Support for AMDGPU.jl #298

Closed

pxl-th and others added 5 commits March 31, 2023 22:29

Simplify queue pooling

e7f6c23

runtime: Add a LinkedList implementation

e30e816

This linked list is append-only and singly-linked, and allows multiple threads to concurrently read from it, while one task may advance the list head serially.

TLS: Don't auto-reset queue, add reset helper

73abab8

fixup! Refactor queue monitoring

eb0c16e

pxl-th reviewed Mar 31, 2023

View reviewed changes

src/runtime/queue.jl Outdated Show resolved Hide resolved

pxl-th added 6 commits April 1, 2023 14:48

Reset dead queue in tests after signal timeout error

6ee577f

Update docs a bit

8370e99

Minor refactor

c18d44d

Add more docs

2fc85b7

Improve type-stability

c90e6fa

Refactor

cb59349

pxl-th mentioned this pull request Apr 3, 2023

Update to KA 0.9 & remove runtime dispatches JuliaNeuralGraphics/Nerf.jl#9

Merged

jpsamaroo marked this pull request as ready for review April 3, 2023 15:15

pxl-th added 3 commits April 3, 2023 19:54

Add ROCQueue docs

bbf7b22

Build documentation in CI first

ff0dc48

Remove AMDGPU from docs project

6f82440

pxl-th force-pushed the jps/tls-queue branch 3 times, most recently from 05c41ed to 6f82440 Compare April 3, 2023 17:59

pxl-th and others added 2 commits April 3, 2023 21:30

Trigger CI

dd60a1d

Update SECRET_DOCUMENTER_KEY

6ba1613

jpsamaroo removed needs tests needs docs labels Apr 3, 2023

pxl-th merged commit ed61beb into master Apr 3, 2023

jpsamaroo deleted the jps/tls-queue branch April 3, 2023 20:18

luraess mentioned this pull request Apr 4, 2023

unsafe_copy3d! requires 2^4 alignment #330

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to task-focused synchronization model #374

Switch to task-focused synchronization model #374

jpsamaroo commented Feb 5, 2023 •

edited

Loading

vchuravy left a comment

luraess commented Feb 5, 2023

Switch to task-focused synchronization model #374

Switch to task-focused synchronization model #374

Conversation

jpsamaroo commented Feb 5, 2023 • edited Loading

vchuravy left a comment

Choose a reason for hiding this comment

luraess commented Feb 5, 2023

jpsamaroo commented Feb 5, 2023 •

edited

Loading