Implement initial simple command buffer memoization. #7679

benvanik · 2021-11-17T18:13:26Z

There's a great many ways we can improve the compiler to reduce the runtime overhead of command buffer recording and expensive driver-level optimizations (looking at you, CUDA graphs). The long term goal is that we segment command buffers by frequency of change and work hard to cache the parts that change less frequently (the head/tail of command buffers usually differ based on I/O, but the body is often just dealing with buffers we internally allocate and control). Secondary command buffers help with this:

cache on first use/startup:
  command_buffer_begin(static_cmd, SECONDARY | REUSABLE)  // optimize for reuse
  command_buffer_dispatch(static_cmd, body0)
  command_buffer_dispatch(static_cmd, body1)
  command_buffer_end(static_cmd)

per invocation:
  command_buffer_begin(dynamic_cmd, PRIMARY | ONE_SHOT)  //  don't optimize/etc
  command_buffer_dispatch(dynamic_cmd, input_handler)
  command_buffer_execute(static_cmd)  // call out to prerecorded static command buffer
  command_buffer_dispatch(dynamic_cmd, output_handler)
  command_buffer_end(dynamic_cmd)
  device_submit(dynamic_cmd)

In the CUDA case we'd use streams for the dynamic_cmd and graphs for static_cmd and in Vulkan we'd just pass the bits to vkAllocateCommandBuffers. In the CPU case we'd have the fully constructed task system DAG baked out and ready for fast execution.

Things like dynamic shapes and multi-chunk ringbuffers can complicate this. In the new streams dialect IR we have a good place to do the high level splitting by partitioning the stream.cmd.execute ops based on which resources/dynamic parameters are used. Once we convert to HAL and actually record the !hal.command_buffer instances we can rely on the granularity being established and just the caching remaining.

For a first shot an idea is to add a hal.command_buffer.memoize op that we lower into from stream.cmd.execute:

%memoized_cmd = hal.command_buffer.memoize(%buffer as %capture0: !hal.buffer, %value as %capture1: f32, ....) -> !hal.command_buffer {
  %cmd = hal.command_buffer.create ...
  hal.command_buffer.begin<%cmd>
  hal.command_buffer.dispatch<%cmd>
  hal.command_buffer.end<%cmd>
  hal.yield %cmd
}
hal.device.submit<%device> %memoized_cmd

If we don't want to memoize we just inline the region out and end up with exactly what we have today (create+begin+record+end+submit), but if we do we have the captured operands that dictate whether any two command buffers are the same. The logic to expand the memoize op would create weak globals for all the captured resources and insert the code to update/compare, like:

util.global private mutable @memoized_cmd_capture0 : util.weak<!hal.buffer>
util.global private mutable @memoized_cmd_capture1 : f32
util.global private mutable @memoized_cmd : !hal.command_buffer
func private @memoize_cmd(%capture0: !hal.buffer, %capture1: f32) -> !hal.command_buffer {
  %eq = ... cmp %capture0 to @memoized_cmd_capture0 && %capture1 to @memoized_cmd_capture1
  %result = scf.if %eq -> !hal.command_buffer {
    %memoized_cmd = util.global.load @memoized_cmd : !hal.command_buffer
    scf.yield %memoized_cmd
  } else {
    // the contents of the original region
    %new_cmd = hal.command_buffer.create ...
    hal.command_buffer.begin<%new_cmd>
    hal.command_buffer.dispatch<%new_cmd>
    hal.command_buffer.end<%new_cmd>
    util.global.store %capture0, @memoized_cmd_capture0
    util.global.store %capture1, @memoized_cmd_capture1
    util.global.store %new_cmd, @memoized_cmd
    scf.yield %new_cmd
  }
  return %result : !hal.command_buffer
}
...
  %memoized_cmd = call @memoize_cmd(%buffer, %value)
  hal.device.submit<%device> %memoized_cmd

This way if any buffer (either a user-provided one or our internal ringbuffers) or parameter (dynamic shape dims/etc) changes we regenerate the command buffer. Moving dynamic push constants to uniform buffers would let us remove the invalidation on parameter changes and make dynamic shapes (mostly) work. Adding a small LRU could help with heavy code reuse that has different buffers. Doing the splitting based on frequency would let us remove captures that are likely to change (user input buffers/etc) so that the bulk is invalidated less frequently, etc etc. But the above simple approach would work well with most models where the user provides consistent input buffers and shapes.

This approach does require weak references in the VM (#6909) as we don't want to hang on to buffers and keep them live just because we reference them in our cache.

(hal.command_buffer.memoize could just become util.memoize, as that would be useful for other things as well like descriptor sets)

The text was updated successfully, but these errors were encountered:

benvanik · 2024-07-11T15:00:42Z

Duplicate of #10144.

benvanik added compiler/dialects Relating to the IREE compiler dialects (flow, hal, vm) performance ⚡ Performance/optimization related work across the compiler and runtime labels Nov 17, 2021

benvanik assigned benvanik and ScottTodd Nov 17, 2021

benvanik closed this as not planned Won't fix, can't repro, duplicate, stale Jul 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement initial simple command buffer memoization. #7679

Implement initial simple command buffer memoization. #7679

benvanik commented Nov 17, 2021 •

edited

Loading

benvanik commented Jul 11, 2024

Implement initial simple command buffer memoization. #7679

Implement initial simple command buffer memoization. #7679

Comments

benvanik commented Nov 17, 2021 • edited Loading

benvanik commented Jul 11, 2024

benvanik commented Nov 17, 2021 •

edited

Loading