Implement initial simple command buffer memoization. #7679
Labels
compiler/dialects
Relating to the IREE compiler dialects (flow, hal, vm)
performance ⚡
Performance/optimization related work across the compiler and runtime
There's a great many ways we can improve the compiler to reduce the runtime overhead of command buffer recording and expensive driver-level optimizations (looking at you, CUDA graphs). The long term goal is that we segment command buffers by frequency of change and work hard to cache the parts that change less frequently (the head/tail of command buffers usually differ based on I/O, but the body is often just dealing with buffers we internally allocate and control). Secondary command buffers help with this:
In the CUDA case we'd use streams for the dynamic_cmd and graphs for static_cmd and in Vulkan we'd just pass the bits to vkAllocateCommandBuffers. In the CPU case we'd have the fully constructed task system DAG baked out and ready for fast execution.
Things like dynamic shapes and multi-chunk ringbuffers can complicate this. In the new streams dialect IR we have a good place to do the high level splitting by partitioning the stream.cmd.execute ops based on which resources/dynamic parameters are used. Once we convert to HAL and actually record the
!hal.command_buffer
instances we can rely on the granularity being established and just the caching remaining.For a first shot an idea is to add a
hal.command_buffer.memoize
op that we lower into fromstream.cmd.execute
:If we don't want to memoize we just inline the region out and end up with exactly what we have today (create+begin+record+end+submit), but if we do we have the captured operands that dictate whether any two command buffers are the same. The logic to expand the memoize op would create weak globals for all the captured resources and insert the code to update/compare, like:
This way if any buffer (either a user-provided one or our internal ringbuffers) or parameter (dynamic shape dims/etc) changes we regenerate the command buffer. Moving dynamic push constants to uniform buffers would let us remove the invalidation on parameter changes and make dynamic shapes (mostly) work. Adding a small LRU could help with heavy code reuse that has different buffers. Doing the splitting based on frequency would let us remove captures that are likely to change (user input buffers/etc) so that the bulk is invalidated less frequently, etc etc. But the above simple approach would work well with most models where the user provides consistent input buffers and shapes.
This approach does require weak references in the VM (#6909) as we don't want to hang on to buffers and keep them live just because we reference them in our cache.
(
hal.command_buffer.memoize
could just becomeutil.memoize
, as that would be useful for other things as well like descriptor sets)The text was updated successfully, but these errors were encountered: