Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement initial simple command buffer memoization. #7679

Closed
benvanik opened this issue Nov 17, 2021 · 1 comment
Closed

Implement initial simple command buffer memoization. #7679

benvanik opened this issue Nov 17, 2021 · 1 comment
Assignees
Labels
compiler/dialects Relating to the IREE compiler dialects (flow, hal, vm) performance ⚡ Performance/optimization related work across the compiler and runtime

Comments

@benvanik
Copy link
Collaborator

benvanik commented Nov 17, 2021

There's a great many ways we can improve the compiler to reduce the runtime overhead of command buffer recording and expensive driver-level optimizations (looking at you, CUDA graphs). The long term goal is that we segment command buffers by frequency of change and work hard to cache the parts that change less frequently (the head/tail of command buffers usually differ based on I/O, but the body is often just dealing with buffers we internally allocate and control). Secondary command buffers help with this:

cache on first use/startup:
  command_buffer_begin(static_cmd, SECONDARY | REUSABLE)  // optimize for reuse
  command_buffer_dispatch(static_cmd, body0)
  command_buffer_dispatch(static_cmd, body1)
  command_buffer_end(static_cmd)

per invocation:
  command_buffer_begin(dynamic_cmd, PRIMARY | ONE_SHOT)  //  don't optimize/etc
  command_buffer_dispatch(dynamic_cmd, input_handler)
  command_buffer_execute(static_cmd)  // call out to prerecorded static command buffer
  command_buffer_dispatch(dynamic_cmd, output_handler)
  command_buffer_end(dynamic_cmd)
  device_submit(dynamic_cmd)

In the CUDA case we'd use streams for the dynamic_cmd and graphs for static_cmd and in Vulkan we'd just pass the bits to vkAllocateCommandBuffers. In the CPU case we'd have the fully constructed task system DAG baked out and ready for fast execution.

Things like dynamic shapes and multi-chunk ringbuffers can complicate this. In the new streams dialect IR we have a good place to do the high level splitting by partitioning the stream.cmd.execute ops based on which resources/dynamic parameters are used. Once we convert to HAL and actually record the !hal.command_buffer instances we can rely on the granularity being established and just the caching remaining.

For a first shot an idea is to add a hal.command_buffer.memoize op that we lower into from stream.cmd.execute:

%memoized_cmd = hal.command_buffer.memoize(%buffer as %capture0: !hal.buffer, %value as %capture1: f32, ....) -> !hal.command_buffer {
  %cmd = hal.command_buffer.create ...
  hal.command_buffer.begin<%cmd>
  hal.command_buffer.dispatch<%cmd>
  hal.command_buffer.end<%cmd>
  hal.yield %cmd
}
hal.device.submit<%device> %memoized_cmd

If we don't want to memoize we just inline the region out and end up with exactly what we have today (create+begin+record+end+submit), but if we do we have the captured operands that dictate whether any two command buffers are the same. The logic to expand the memoize op would create weak globals for all the captured resources and insert the code to update/compare, like:

util.global private mutable @memoized_cmd_capture0 : util.weak<!hal.buffer>
util.global private mutable @memoized_cmd_capture1 : f32
util.global private mutable @memoized_cmd : !hal.command_buffer
func private @memoize_cmd(%capture0: !hal.buffer, %capture1: f32) -> !hal.command_buffer {
  %eq = ... cmp %capture0 to @memoized_cmd_capture0 && %capture1 to @memoized_cmd_capture1
  %result = scf.if %eq -> !hal.command_buffer {
    %memoized_cmd = util.global.load @memoized_cmd : !hal.command_buffer
    scf.yield %memoized_cmd
  } else {
    // the contents of the original region
    %new_cmd = hal.command_buffer.create ...
    hal.command_buffer.begin<%new_cmd>
    hal.command_buffer.dispatch<%new_cmd>
    hal.command_buffer.end<%new_cmd>
    util.global.store %capture0, @memoized_cmd_capture0
    util.global.store %capture1, @memoized_cmd_capture1
    util.global.store %new_cmd, @memoized_cmd
    scf.yield %new_cmd
  }
  return %result : !hal.command_buffer
}
...
  %memoized_cmd = call @memoize_cmd(%buffer, %value)
  hal.device.submit<%device> %memoized_cmd

This way if any buffer (either a user-provided one or our internal ringbuffers) or parameter (dynamic shape dims/etc) changes we regenerate the command buffer. Moving dynamic push constants to uniform buffers would let us remove the invalidation on parameter changes and make dynamic shapes (mostly) work. Adding a small LRU could help with heavy code reuse that has different buffers. Doing the splitting based on frequency would let us remove captures that are likely to change (user input buffers/etc) so that the bulk is invalidated less frequently, etc etc. But the above simple approach would work well with most models where the user provides consistent input buffers and shapes.

This approach does require weak references in the VM (#6909) as we don't want to hang on to buffers and keep them live just because we reference them in our cache.

(hal.command_buffer.memoize could just become util.memoize, as that would be useful for other things as well like descriptor sets)

@benvanik benvanik added compiler/dialects Relating to the IREE compiler dialects (flow, hal, vm) performance ⚡ Performance/optimization related work across the compiler and runtime labels Nov 17, 2021
@benvanik
Copy link
Collaborator Author

Duplicate of #10144.

@benvanik benvanik closed this as not planned Won't fix, can't repro, duplicate, stale Jul 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/dialects Relating to the IREE compiler dialects (flow, hal, vm) performance ⚡ Performance/optimization related work across the compiler and runtime
Projects
No open projects
Status: No status
Development

No branches or pull requests

2 participants