Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement stream-ordered allocations in the HAL. #9572

Open
3 of 4 tasks
benvanik opened this issue Jun 22, 2022 · 0 comments
Open
3 of 4 tasks

Implement stream-ordered allocations in the HAL. #9572

benvanik opened this issue Jun 22, 2022 · 0 comments
Assignees
Labels
hal/api IREE's public C hardware abstraction layer API performance ⚡ Performance/optimization related work across the compiler and runtime

Comments

@benvanik
Copy link
Collaborator

benvanik commented Jun 22, 2022

Idea is to add two HAL device methods:

// Allocates a transient |out_buffer| of the given |allocation_size| after
// |wait_fence| is reached and signals |signal_fence| when it is ready for use.
// The contents of the buffer are initially undefined.
IREE_API_EXPORT iree_status_t iree_hal_device_queue_alloca(
    iree_hal_device_t* device, iree_hal_queue_affinity_t queue_affinity,
    iree_hal_fence_t* wait_fence, iree_hal_fence_t* signal_fence,
    iree_hal_buffer_params_t params, iree_device_size_t allocation_size,
    iree_hal_buffer_t** out_buffer);

// Deallocates a transient |buffer| after |wait_fence| is reached and signals
// |signal_fence| once the memory is available for reuse. The contents of the
// buffer are undefined immediately after the |wait_fence| is reached and must
// not be accessed by the host or device.
IREE_API_EXPORT iree_status_t iree_hal_device_queue_dealloca(
    iree_hal_device_t* device, iree_hal_queue_affinity_t queue_affinity,
    iree_hal_fence_t* wait_fence, iree_hal_fence_t* signal_fence,
    iree_hal_buffer_t* buffer);

The current stream.resource.alloca/dealloca ops would lower into these using fences to model the !stream.timepoints. We aren't yet caching command buffers but when we do the buffer would become a key that would invalidate the cache (if the underlying allocation changes we'd need to re-record) while the dynamic offset will need to be passed into dispatches via a uniform buffer that we update in the per-submission primary command buffer. So for an invocation on wait_fence->signal_fence it'd look like:

iree_hal_device_queue_alloca(wait_fence, temp_fence_0, &transient_buffer);
if (cached_transient_buffer allocated buffer != transient_buffer allocated buffer) {
  // note that anything that could invalidate the command buffer should be keyed on here
  rebuild_secondary_command_buffer(...);
}
new_uniforms[0] = iree_hal_buffer_byte_offset(transient_buffer);
begin_primary_command_buffer();
iree_hal_command_buffer_update_buffer(uniform_buffer, new_uniforms);
iree_hal_command_buffer_execute(secondary_command_buffer);
end_primary_command_buffer();
submit(temp_fence_0, primary_command_buffer, temp_fence_1);
iree_hal_device_queue_dealloca(temp_fence_1, signal_fence, transient_buffer);

Conceptually what this is doing is tracking which allocations from the device pool are available when: when an alloca is requested the wait_fence is compared with the signal fences of all prior deallocations to find a compatible slot, similar to what a normal allocator does. If the pool is out of memory but can be serviced after some deallocations the wait fence of the pending deallocation and the provided alloca wait fence can be joined. The new buffer subspan range is recorded with the signal fence (temp_fence_0 above) for future use and the subspan is immediately returned. On deallocation the range is marked as unused when the dealloca wait fence is hit (temp_fence_1 above) indicating that there are no more live users, and if needed the signal fence can be used to block execution if for example defragmentation is required. Some details to work through but it's effectively just live range analysis using timeline semaphores. The exact policies we want is up to the program, the user, and the devices so the only thing we can really prescribe is the correctness semantics - implementations are allowed to block and synchronize with the device at every alloca/dealloc if needed.

The actual implementation of the methods is up to the HAL backend; CUDA could use the native support by wrapping cuMemAllocFromPoolAsync (ensuring to order it with the submission) while pretty much everything else can use our own implementations. The reason to use the CUDA implementation would be driver-level sharing across multiple devices and hosts, which you could do otherwise but it'd be trickier. We may still want to allow the use of ours for testing/ease of analysis.

There's several implementations we could have depending on scenario. On the local non-bare-metal CPU we'd likely want to just reserve a big slab of virtual address space (many GB) and then serve out of that committing/decommitting as needed, but when unavailable we'd need to implement it with a tighter ringbuffer or something. The requirement is that we can allocate as much memory as we need to be live at any single point in time during execution and by design we know that through the queue forward progress guarantees we should be ok: if at any point we run out of memory we can just block until some deallocations retire (with the goal being to not do that).

Major tasks:

  • Complete fences and submission behavior
  • New iree_hal_device_queue_alloca/dealloc methods
  • No-op implementation that just blocks and waits (same behavior as allocations today)
  • Reference implementation using a basic ringbuffer (will work everywhere since no MMU required)
@benvanik benvanik added performance ⚡ Performance/optimization related work across the compiler and runtime hal/api IREE's public C hardware abstraction layer API labels Jun 22, 2022
@benvanik benvanik self-assigned this Jun 22, 2022
benvanik added a commit that referenced this issue Jul 3, 2022
benvanik added a commit that referenced this issue Jul 3, 2022
This is a heap-allocated set of semaphores and payload values that
performs timeline joining by default. It will allow us to expose a safe
immutable semaphore list to the user-facing API and VM while still being
a struct-of-arrays that we can directly pass to lower-level driver APIs.
In the future we can extend this struct to hold additional internal
tracking like semaphore timepoints for resource management.

Progress on #9572.
benvanik added a commit that referenced this issue Jul 3, 2022
benvanik added a commit that referenced this issue Jul 3, 2022
Waiting is currently stubbed out and will come in future changes.

Progress on #9572.
benvanik added a commit that referenced this issue Jul 3, 2022
benvanik added a commit that referenced this issue Jul 3, 2022
This is a heap-allocated set of semaphores and payload values that
performs timeline joining by default. It will allow us to expose a safe
immutable semaphore list to the user-facing API and VM while still being
a struct-of-arrays that we can directly pass to lower-level driver APIs.
In the future we can extend this struct to hold additional internal
tracking like semaphore timepoints for resource management.

Progress on #9572.
benvanik added a commit that referenced this issue Jul 3, 2022
benvanik added a commit that referenced this issue Jul 3, 2022
Waiting is currently stubbed out and will come in future changes.

Progress on #9572.
benvanik added a commit that referenced this issue Jul 4, 2022
benvanik added a commit that referenced this issue Jul 4, 2022
This is a heap-allocated set of semaphores and payload values that
performs timeline joining by default. It will allow us to expose a safe
immutable semaphore list to the user-facing API and VM while still being
a struct-of-arrays that we can directly pass to lower-level driver APIs.
In the future we can extend this struct to hold additional internal
tracking like semaphore timepoints for resource management.

Progress on #9572.
benvanik added a commit that referenced this issue Jul 4, 2022
benvanik added a commit that referenced this issue Jul 4, 2022
Waiting is currently stubbed out and will come in future changes.

Progress on #9572.
benvanik added a commit that referenced this issue Jul 4, 2022
This is a heap-allocated set of semaphores and payload values that
performs timeline joining by default. It will allow us to expose a safe
immutable semaphore list to the user-facing API and VM while still being
a struct-of-arrays that we can directly pass to lower-level driver APIs.
In the future we can extend this struct to hold additional internal
tracking like semaphore timepoints for resource management.

Progress on #9572.
benvanik added a commit that referenced this issue Jul 4, 2022
benvanik added a commit that referenced this issue Jul 4, 2022
Waiting is currently stubbed out and will come in future changes.

Progress on #9572.
benvanik added a commit that referenced this issue Jul 4, 2022
Waiting is currently stubbed out and will come in future changes.

Progress on #9572.
benvanik added a commit that referenced this issue Jul 4, 2022
benvanik added a commit that referenced this issue Jul 4, 2022
Waiting is currently stubbed out and will come in future changes.

Progress on #9572.
benvanik added a commit that referenced this issue Jul 4, 2022
This is a heap-allocated set of semaphores and payload values that
performs timeline joining by default. It will allow us to expose a safe
immutable semaphore list to the user-facing API and VM while still being
a struct-of-arrays that we can directly pass to lower-level driver APIs.
In the future we can extend this struct to hold additional internal
tracking like semaphore timepoints for resource management.

Progress on #9572.
benvanik added a commit that referenced this issue Jul 4, 2022
benvanik added a commit that referenced this issue Jul 4, 2022
Waiting is currently stubbed out and will come in future changes.

Progress on #9572.
benvanik added a commit that referenced this issue Jul 4, 2022
This is a heap-allocated set of semaphores and payload values that
performs timeline joining by default. It will allow us to expose a safe
immutable semaphore list to the user-facing API and VM while still being
a struct-of-arrays that we can directly pass to lower-level driver APIs.
In the future we can extend this struct to hold additional internal
tracking like semaphore timepoints for resource management.

Progress on #9572.
benvanik added a commit that referenced this issue Jul 4, 2022
benvanik added a commit that referenced this issue Jul 4, 2022
benvanik added a commit that referenced this issue Jul 6, 2022
This is a heap-allocated set of semaphores and payload values that
performs timeline joining by default. It will allow us to expose a safe
immutable semaphore list to the user-facing API and VM while still being
a struct-of-arrays that we can directly pass to lower-level driver APIs.
In the future we can extend this struct to hold additional internal
tracking like semaphore timepoints for resource management.

Progress on #9572.
benvanik added a commit that referenced this issue Jul 6, 2022
benvanik added a commit that referenced this issue Jul 6, 2022
benvanik added a commit that referenced this issue Jul 6, 2022
benvanik added a commit that referenced this issue Jul 6, 2022
This is a heap-allocated set of semaphores and payload values that
performs timeline joining by default. It will allow us to expose a safe
immutable semaphore list to the user-facing API and VM while still being
a struct-of-arrays that we can directly pass to lower-level driver APIs.
In the future we can extend this struct to hold additional internal
tracking like semaphore timepoints for resource management.

Progress on #9572.
benvanik added a commit that referenced this issue Jul 6, 2022
benvanik added a commit that referenced this issue Jul 6, 2022
benvanik added a commit that referenced this issue Jul 6, 2022
benvanik added a commit that referenced this issue Jul 6, 2022
This is a heap-allocated set of semaphores and payload values that
performs timeline joining by default. It will allow us to expose a safe
immutable semaphore list to the user-facing API and VM while still being
a struct-of-arrays that we can directly pass to lower-level driver APIs.
In the future we can extend this struct to hold additional internal
tracking like semaphore timepoints for resource management.

Progress on #9572.
benvanik added a commit that referenced this issue Jul 6, 2022
benvanik added a commit that referenced this issue Jul 6, 2022
benvanik added a commit that referenced this issue Jul 6, 2022
benvanik added a commit that referenced this issue Jul 6, 2022
This is a heap-allocated set of semaphores and payload values that
performs timeline joining by default. It will allow us to expose a safe
immutable semaphore list to the user-facing API and VM while still being
a struct-of-arrays that we can directly pass to lower-level driver APIs.
In the future we can extend this struct to hold additional internal
tracking like semaphore timepoints for resource management.

Progress on #9572.
benvanik added a commit that referenced this issue Jul 6, 2022
benvanik added a commit that referenced this issue Jul 6, 2022
benvanik added a commit that referenced this issue Aug 2, 2022
Adding compiler/runtime support for lowering the asynchronous stream dialect ops into HAL ops, materializing a timeline (today just one but multiple in the future), and passing through to the runtime HAL module. This allowed for the removal of the existing placeholder submit_and_wait op and enables queue-ordered allocations to be implemented in the HAL.

This is likely not the final design but unblocks work on coroutines, queue-ordered allocations, webgpu, and plumbing fences through the user-facing API/native ABI. Future refinements may create overrides that use semaphores instead of fences to avoid fence heap allocations when not required, but for most single-function classic ML models once we plumb fences through the ABI no internal fences are required. The current timeline materialization also strictly orders all invocations where instead we should be able to elide those when there's no internal program state to protect.

Because the various HAL backends all need work (CUDA/ROCM in particular need massive work) nearly everything is synchronized exactly as it was before but now that synchronization happens in the IR we emit and we can selectively start supporting async per target.

Progress on #1285 (just need to put fences on the ABI!).
Progress on #8093 (added yieldable fence waits).
Progress on #9572 (added compiler/runtime glue for queue-ordered allocs).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hal/api IREE's public C hardware abstraction layer API performance ⚡ Performance/optimization related work across the compiler and runtime
Projects
No open projects
Status: In Progress
Development

No branches or pull requests

1 participant