Implement stream-ordered allocations in the HAL. #9572

benvanik · 2022-06-22T00:29:26Z

Idea is to add two HAL device methods:

// Allocates a transient |out_buffer| of the given |allocation_size| after
// |wait_fence| is reached and signals |signal_fence| when it is ready for use.
// The contents of the buffer are initially undefined.
IREE_API_EXPORT iree_status_t iree_hal_device_queue_alloca(
    iree_hal_device_t* device, iree_hal_queue_affinity_t queue_affinity,
    iree_hal_fence_t* wait_fence, iree_hal_fence_t* signal_fence,
    iree_hal_buffer_params_t params, iree_device_size_t allocation_size,
    iree_hal_buffer_t** out_buffer);

// Deallocates a transient |buffer| after |wait_fence| is reached and signals
// |signal_fence| once the memory is available for reuse. The contents of the
// buffer are undefined immediately after the |wait_fence| is reached and must
// not be accessed by the host or device.
IREE_API_EXPORT iree_status_t iree_hal_device_queue_dealloca(
    iree_hal_device_t* device, iree_hal_queue_affinity_t queue_affinity,
    iree_hal_fence_t* wait_fence, iree_hal_fence_t* signal_fence,
    iree_hal_buffer_t* buffer);

The current stream.resource.alloca/dealloca ops would lower into these using fences to model the !stream.timepoints. We aren't yet caching command buffers but when we do the buffer would become a key that would invalidate the cache (if the underlying allocation changes we'd need to re-record) while the dynamic offset will need to be passed into dispatches via a uniform buffer that we update in the per-submission primary command buffer. So for an invocation on wait_fence->signal_fence it'd look like:

iree_hal_device_queue_alloca(wait_fence, temp_fence_0, &transient_buffer);
if (cached_transient_buffer allocated buffer != transient_buffer allocated buffer) {
  // note that anything that could invalidate the command buffer should be keyed on here
  rebuild_secondary_command_buffer(...);
}
new_uniforms[0] = iree_hal_buffer_byte_offset(transient_buffer);
begin_primary_command_buffer();
iree_hal_command_buffer_update_buffer(uniform_buffer, new_uniforms);
iree_hal_command_buffer_execute(secondary_command_buffer);
end_primary_command_buffer();
submit(temp_fence_0, primary_command_buffer, temp_fence_1);
iree_hal_device_queue_dealloca(temp_fence_1, signal_fence, transient_buffer);

Conceptually what this is doing is tracking which allocations from the device pool are available when: when an alloca is requested the wait_fence is compared with the signal fences of all prior deallocations to find a compatible slot, similar to what a normal allocator does. If the pool is out of memory but can be serviced after some deallocations the wait fence of the pending deallocation and the provided alloca wait fence can be joined. The new buffer subspan range is recorded with the signal fence (temp_fence_0 above) for future use and the subspan is immediately returned. On deallocation the range is marked as unused when the dealloca wait fence is hit (temp_fence_1 above) indicating that there are no more live users, and if needed the signal fence can be used to block execution if for example defragmentation is required. Some details to work through but it's effectively just live range analysis using timeline semaphores. The exact policies we want is up to the program, the user, and the devices so the only thing we can really prescribe is the correctness semantics - implementations are allowed to block and synchronize with the device at every alloca/dealloc if needed.

The actual implementation of the methods is up to the HAL backend; CUDA could use the native support by wrapping cuMemAllocFromPoolAsync (ensuring to order it with the submission) while pretty much everything else can use our own implementations. The reason to use the CUDA implementation would be driver-level sharing across multiple devices and hosts, which you could do otherwise but it'd be trickier. We may still want to allow the use of ours for testing/ease of analysis.

There's several implementations we could have depending on scenario. On the local non-bare-metal CPU we'd likely want to just reserve a big slab of virtual address space (many GB) and then serve out of that committing/decommitting as needed, but when unavailable we'd need to implement it with a tighter ringbuffer or something. The requirement is that we can allocate as much memory as we need to be live at any single point in time during execution and by design we know that through the queue forward progress guarantees we should be ok: if at any point we run out of memory we can just block until some deallocations retire (with the goal being to not do that).

Major tasks:

Complete fences and submission behavior
New iree_hal_device_queue_alloca/dealloc methods
No-op implementation that just blocks and waits (same behavior as allocations today)
Reference implementation using a basic ringbuffer (will work everywhere since no MMU required)

The text was updated successfully, but these errors were encountered:

Progress on #9572.

This is a heap-allocated set of semaphores and payload values that performs timeline joining by default. It will allow us to expose a safe immutable semaphore list to the user-facing API and VM while still being a struct-of-arrays that we can directly pass to lower-level driver APIs. In the future we can extend this struct to hold additional internal tracking like semaphore timepoints for resource management. Progress on #9572.

Progress on #9572.