Improve Queue Phase parallelization and other small optimizations #4899

james7132 · 2022-06-02T06:41:21Z

Objective

Fixes #3548. A mutable reference to PipelineCache prevents a good chunk of the systems in queue from parallelizing with each other, even though it's primary use with Specialized*Pipelines<T> should rarely require mutation. Likewise &mut RenderPhase<I> on each render phase also requires exclusive access, which prohibits multiple systems from enqueuing phase items at the same time.

Solution

Selectively leverage internal mutability and thread-local in a way that avoids adding per-entity overheads.

RenderPhase

Use ThreadLocal inside RenderPhase to create thread local queues of phase items.
Add a phase_scope for enqueuing on each thread separately without locking overhead.
Collect all of the thread-local phase items into one vec before sorting.
Shrink the size of IDs used to reduce the amount of memory being shuffled around.
Use Vec::sort_unstable_by_key for a slight sorting speedup.
Change all of the &mut RenderPhase<T> to &RenderPhase<T>, allow for increased parallelism.
Can we just shove all of these into a std::collections::BinaryHeap and drain?

PipelineCache

Introduce LockablePipelineCache, a wrapper around RwLock<PipelineCache>.
Change Specialized*Pipelines to take a &LockablePipelineCache instead of &mut PipelineCache. Update systems to match.
Introduce two systems lock_pipeline_cache (in Extract) and unlock_pipeline_cache (in PhaseSort) that take the existing PipelineCache resource and wraps it into a LockablePipelineCache and vice versa. This allows Render phase draw functions to read from the cache without any contention, at the cost of one small command buffer.

Performance

I tested this on the default configuration of many_foxes, which has several heavy queue systems.

The direct effects are as expected. The queue phase results show a 30% speedup on my machine due to the increased parallelism. (yellow is this PR, red is main)

For sort phase, there is a slight regression due to the additional copy into the sorted vec before sorting. This is slightly alleviated by shrinking the draw function type sizes. This can be further addressed by using more optimized sorting algorithms (i.e. voracious).

Overall, this sees a rough 0.3ms improvement (2 FPS, 73 -> 75) improvement on my machine.

Future Work

The changes to enable internal mutability in a thread-safe manner can be extended to also allow internal parallelism in heavy queue tasks.

Changelog

TODO

Migration Guide

TODO

crates/bevy_render/src/render_phase/mod.rs

Co-authored-by: Giacomo Stevanato <giaco.stevanato@gmail.com>

# Objective Partially addresses #4291. Speed up the sort phase for unbatched render phases. ## Solution Split out one of the optimizations in #4899 and allow implementors of `PhaseItem` to change what kind of sort is used when sorting the items in the phase. This currently includes Stable, Unstable, and Unsorted. Each of these corresponds to `Vec::sort_by_key`, `Vec::sort_unstable_by_key`, and no sorting at all. The default is `Unstable`. The last one can be used as a default if users introduce a preliminary depth prepass. ## Performance This will not impact the performance of any batched phases, as it is still using a stable sort. 2D's only phase is unchanged. All 3D phases are unbatched currently, and will benefit from this change. On `many_cubes`, where the primary phase is opaque, this change sees a speed up from 907.02us -> 477.62us, a 47.35% reduction. ![image](https://user-images.githubusercontent.com/3137680/174471253-22424874-30d5-4db5-b5b4-65fb2c612a9c.png) ## Future Work There were prior discussions to add support for faster radix sorts in #4291, which in theory should be a `O(n)` instead of a `O(nlog(n))` time. [`voracious`](https://crates.io/crates/voracious_radix_sort) has been proposed, but it seems to be optimize for use cases with more than 30,000 items, which may be atypical for most systems. Another optimization included in #4899 is to reduce the size of a few of the IDs commonly used in `PhaseItem` implementations to shrink the types to make swapping/sorting faster. Both `CachedPipelineId` and `DrawFunctionId` could be reduced to `u32` instead of `usize`. Ideally, this should automatically change to use stable sorts when `BatchedPhaseItem` is implemented on the same phase item type, but this requires specialization, which may not land in stable Rust for a short while. --- ## Changelog Added: `PhaseItem::sort` ## Migration Guide RenderPhases now default to a unstable sort (via `slice::sort_unstable_by_key`). This can typically improve sort phase performance, but may produce incorrect batching results when implementing `BatchedPhaseItem`. To revert to the older stable sort, manually implement `PhaseItem::sort` to implement a stable sort (i.e. via `slice::sort_by_key`). Co-authored-by: Federico Rinaldi <gisquerin@gmail.com> Co-authored-by: Robert Swain <robert.swain@gmail.com> Co-authored-by: colepoirier <colepoirier@gmail.com>

# Objective Partially addresses bevyengine#4291. Speed up the sort phase for unbatched render phases. ## Solution Split out one of the optimizations in bevyengine#4899 and allow implementors of `PhaseItem` to change what kind of sort is used when sorting the items in the phase. This currently includes Stable, Unstable, and Unsorted. Each of these corresponds to `Vec::sort_by_key`, `Vec::sort_unstable_by_key`, and no sorting at all. The default is `Unstable`. The last one can be used as a default if users introduce a preliminary depth prepass. ## Performance This will not impact the performance of any batched phases, as it is still using a stable sort. 2D's only phase is unchanged. All 3D phases are unbatched currently, and will benefit from this change. On `many_cubes`, where the primary phase is opaque, this change sees a speed up from 907.02us -> 477.62us, a 47.35% reduction. ![image](https://user-images.githubusercontent.com/3137680/174471253-22424874-30d5-4db5-b5b4-65fb2c612a9c.png) ## Future Work There were prior discussions to add support for faster radix sorts in bevyengine#4291, which in theory should be a `O(n)` instead of a `O(nlog(n))` time. [`voracious`](https://crates.io/crates/voracious_radix_sort) has been proposed, but it seems to be optimize for use cases with more than 30,000 items, which may be atypical for most systems. Another optimization included in bevyengine#4899 is to reduce the size of a few of the IDs commonly used in `PhaseItem` implementations to shrink the types to make swapping/sorting faster. Both `CachedPipelineId` and `DrawFunctionId` could be reduced to `u32` instead of `usize`. Ideally, this should automatically change to use stable sorts when `BatchedPhaseItem` is implemented on the same phase item type, but this requires specialization, which may not land in stable Rust for a short while. --- ## Changelog Added: `PhaseItem::sort` ## Migration Guide RenderPhases now default to a unstable sort (via `slice::sort_unstable_by_key`). This can typically improve sort phase performance, but may produce incorrect batching results when implementing `BatchedPhaseItem`. To revert to the older stable sort, manually implement `PhaseItem::sort` to implement a stable sort (i.e. via `slice::sort_by_key`). Co-authored-by: Federico Rinaldi <gisquerin@gmail.com> Co-authored-by: Robert Swain <robert.swain@gmail.com> Co-authored-by: colepoirier <colepoirier@gmail.com>

# Objective This includes one part of #4899. The aim is to improve CPU-side rendering performance by reducing the memory footprint and bandwidth required. ## Solution Shrink `DrawFunctionId` to `u32`. Enforce that `u32 as usize` conversions are always safe by forbidding compilation on 16-bit platforms. This shouldn't be a breaking change since #4736 disabled compilation of `bevy_ecs` on those platforms. Shrinking `DrawFunctionId` shrinks all of the `PhaseItem` types, which is integral to sort and render phase performance. Testing against `many_cubes`, the sort phase improved by 22% (174.21us -> 141.76us per frame). ![image](https://user-images.githubusercontent.com/3137680/207345422-a512b4cf-1680-46e0-9973-ea72494ebdfe.png) The main opaque pass also imrproved by 9% (5.49ms -> 5.03ms) ![image](https://user-images.githubusercontent.com/3137680/207346436-cbee7209-6450-4964-b566-0b64cfa4b4ea.png) Overall frame time improved by 5% (14.85ms -> 14.09ms) ![image](https://user-images.githubusercontent.com/3137680/207346895-9de8676b-ef37-4cb9-8445-8493f5f90003.png) There will be a followup PR that likewise shrinks `CachedRenderPipelineId` which should yield similar results on top of these improvements.

alice-i-cecile · 2023-01-16T15:31:09Z

Closing in favor of #7205.

james7132 · 2023-01-16T15:54:43Z

There's still the other half of this that splays the RenderPhase into thread local queues. I'll open a separate PR for that.

# Objective This includes one part of bevyengine#4899. The aim is to improve CPU-side rendering performance by reducing the memory footprint and bandwidth required. ## Solution Shrink `DrawFunctionId` to `u32`. Enforce that `u32 as usize` conversions are always safe by forbidding compilation on 16-bit platforms. This shouldn't be a breaking change since bevyengine#4736 disabled compilation of `bevy_ecs` on those platforms. Shrinking `DrawFunctionId` shrinks all of the `PhaseItem` types, which is integral to sort and render phase performance. Testing against `many_cubes`, the sort phase improved by 22% (174.21us -> 141.76us per frame). ![image](https://user-images.githubusercontent.com/3137680/207345422-a512b4cf-1680-46e0-9973-ea72494ebdfe.png) The main opaque pass also imrproved by 9% (5.49ms -> 5.03ms) ![image](https://user-images.githubusercontent.com/3137680/207346436-cbee7209-6450-4964-b566-0b64cfa4b4ea.png) Overall frame time improved by 5% (14.85ms -> 14.09ms) ![image](https://user-images.githubusercontent.com/3137680/207346895-9de8676b-ef37-4cb9-8445-8493f5f90003.png) There will be a followup PR that likewise shrinks `CachedRenderPipelineId` which should yield similar results on top of these improvements.

# Objective Partially addresses bevyengine#4291. Speed up the sort phase for unbatched render phases. ## Solution Split out one of the optimizations in bevyengine#4899 and allow implementors of `PhaseItem` to change what kind of sort is used when sorting the items in the phase. This currently includes Stable, Unstable, and Unsorted. Each of these corresponds to `Vec::sort_by_key`, `Vec::sort_unstable_by_key`, and no sorting at all. The default is `Unstable`. The last one can be used as a default if users introduce a preliminary depth prepass. ## Performance This will not impact the performance of any batched phases, as it is still using a stable sort. 2D's only phase is unchanged. All 3D phases are unbatched currently, and will benefit from this change. On `many_cubes`, where the primary phase is opaque, this change sees a speed up from 907.02us -> 477.62us, a 47.35% reduction. ![image](https://user-images.githubusercontent.com/3137680/174471253-22424874-30d5-4db5-b5b4-65fb2c612a9c.png) ## Future Work There were prior discussions to add support for faster radix sorts in bevyengine#4291, which in theory should be a `O(n)` instead of a `O(nlog(n))` time. [`voracious`](https://crates.io/crates/voracious_radix_sort) has been proposed, but it seems to be optimize for use cases with more than 30,000 items, which may be atypical for most systems. Another optimization included in bevyengine#4899 is to reduce the size of a few of the IDs commonly used in `PhaseItem` implementations to shrink the types to make swapping/sorting faster. Both `CachedPipelineId` and `DrawFunctionId` could be reduced to `u32` instead of `usize`. Ideally, this should automatically change to use stable sorts when `BatchedPhaseItem` is implemented on the same phase item type, but this requires specialization, which may not land in stable Rust for a short while. --- ## Changelog Added: `PhaseItem::sort` ## Migration Guide RenderPhases now default to a unstable sort (via `slice::sort_unstable_by_key`). This can typically improve sort phase performance, but may produce incorrect batching results when implementing `BatchedPhaseItem`. To revert to the older stable sort, manually implement `PhaseItem::sort` to implement a stable sort (i.e. via `slice::sort_by_key`). Co-authored-by: Federico Rinaldi <gisquerin@gmail.com> Co-authored-by: Robert Swain <robert.swain@gmail.com> Co-authored-by: colepoirier <colepoirier@gmail.com>

# Objective This includes one part of bevyengine#4899. The aim is to improve CPU-side rendering performance by reducing the memory footprint and bandwidth required. ## Solution Shrink `DrawFunctionId` to `u32`. Enforce that `u32 as usize` conversions are always safe by forbidding compilation on 16-bit platforms. This shouldn't be a breaking change since bevyengine#4736 disabled compilation of `bevy_ecs` on those platforms. Shrinking `DrawFunctionId` shrinks all of the `PhaseItem` types, which is integral to sort and render phase performance. Testing against `many_cubes`, the sort phase improved by 22% (174.21us -> 141.76us per frame). ![image](https://user-images.githubusercontent.com/3137680/207345422-a512b4cf-1680-46e0-9973-ea72494ebdfe.png) The main opaque pass also imrproved by 9% (5.49ms -> 5.03ms) ![image](https://user-images.githubusercontent.com/3137680/207346436-cbee7209-6450-4964-b566-0b64cfa4b4ea.png) Overall frame time improved by 5% (14.85ms -> 14.09ms) ![image](https://user-images.githubusercontent.com/3137680/207346895-9de8676b-ef37-4cb9-8445-8493f5f90003.png) There will be a followup PR that likewise shrinks `CachedRenderPipelineId` which should yield similar results on top of these improvements.

# Objective There's a repeating pattern of `ThreadLocal<Cell<Vec<T>>>` which is very useful for low overhead, low contention multithreaded queues that have cropped up in a few places in the engine. This pattern is surprisingly useful when building deferred mutation across multiple threads, as noted by it's use in `ParallelCommands`. However, `ThreadLocal<Cell<Vec<T>>>` is not only a mouthful, it's also hard to ensure the thread-local queue is replaced after it's been temporarily removed from the `Cell`. ## Solution Wrap the pattern into `bevy_utils::Parallel<T>` which codifies the entire pattern and ensures the user follows the contract. Instead of fetching indivdual cells, removing the value, mutating it, and replacing it, `Parallel::get` returns a `ParRef<'a, T>` which contains the temporarily removed value and a reference back to the cell, and will write the mutated value back to the cell upon being dropped. I would like to use this to simplify the remaining part of #4899 that has not been adopted/merged. --- ## Changelog TODO --------- Co-authored-by: Joseph <21144246+JoJoJet@users.noreply.github.com>

# Objective There's a repeating pattern of `ThreadLocal<Cell<Vec<T>>>` which is very useful for low overhead, low contention multithreaded queues that have cropped up in a few places in the engine. This pattern is surprisingly useful when building deferred mutation across multiple threads, as noted by it's use in `ParallelCommands`. However, `ThreadLocal<Cell<Vec<T>>>` is not only a mouthful, it's also hard to ensure the thread-local queue is replaced after it's been temporarily removed from the `Cell`. ## Solution Wrap the pattern into `bevy_utils::Parallel<T>` which codifies the entire pattern and ensures the user follows the contract. Instead of fetching indivdual cells, removing the value, mutating it, and replacing it, `Parallel::get` returns a `ParRef<'a, T>` which contains the temporarily removed value and a reference back to the cell, and will write the mutated value back to the cell upon being dropped. I would like to use this to simplify the remaining part of bevyengine#4899 that has not been adopted/merged. --- ## Changelog TODO --------- Co-authored-by: Joseph <21144246+JoJoJet@users.noreply.github.com>

james7132 added 2 commits June 1, 2022 15:36

Use thread local to improve render parallelism

7e2a6ec

Further optimizations

652a8b5

james7132 added A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times M-Needs-Migration-Guide A breaking change to Bevy's public API that needs to be noted in a migration guide labels Jun 2, 2022

james7132 added 2 commits June 1, 2022 23:58

Use copyless more

f898b62

Merge branch 'main' into render-phase-thread-local

cfdb056

infmagic2047 reviewed Jun 2, 2022

View reviewed changes

crates/bevy_render/src/render_phase/mod.rs Outdated Show resolved Hide resolved

james7132 mentioned this pull request Jun 3, 2022

[Merged by Bors] - Add ParallelCommands system parameter #4749

Closed

james7132 added 2 commits June 9, 2022 05:00

Merge branch 'main' into render-phase-thread-local

50f4cd1

Add ALLOWS_UNSTABLE_SORT to allow each phase to choose how to sort

cb4680c

james7132 requested a review from superdump June 9, 2022 12:16

SkiFire13 reviewed Jun 9, 2022

View reviewed changes

crates/bevy_render/src/render_phase/mod.rs Outdated Show resolved Hide resolved

james7132 and others added 2 commits June 9, 2022 18:18

Pre-reserve the length of sorted to avoid reallocations

c6b54ff

Co-authored-by: Giacomo Stevanato <giaco.stevanato@gmail.com>

Merge branch 'main' into render-phase-thread-local

891840a

james7132 mentioned this pull request Jun 19, 2022

[Merged by Bors] - Allow unbatched render phases to use unstable sorts #5049

Closed

james7132 added 3 commits June 26, 2022 16:55

Merge branch 'main' into render-phase-thread-local

0ce6987

Fix build

dafb2ef

Formatting

59bdf9c

Weibye added the S-Adopt-Me The original PR author has no intent to complete this work. Pick me up! label Aug 10, 2022

james7132 removed the S-Adopt-Me The original PR author has no intent to complete this work. Pick me up! label Sep 14, 2022

james7132 mentioned this pull request Dec 13, 2022

[Merged by Bors] - Shrink DrawFunctionId #6944

Closed

Merge branch 'main' into render-phase-thread-local

8fa4562

james7132 mentioned this pull request Jan 15, 2023

[Merged by Bors] - Make PipelineCache internally mutable. #7205

Closed

alice-i-cecile closed this Jan 16, 2023

james7132 mentioned this pull request Jan 24, 2023

refactor: Extract parallel queue abstraction #7348

Merged

james7132 mentioned this pull request Feb 19, 2024

Enable queue phase parallelization by using Parallel to back RenderPhase #11984

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Queue Phase parallelization and other small optimizations #4899

Improve Queue Phase parallelization and other small optimizations #4899

james7132 commented Jun 2, 2022 •

edited

Loading

alice-i-cecile commented Jan 16, 2023

james7132 commented Jan 16, 2023

Improve Queue Phase parallelization and other small optimizations #4899

Improve Queue Phase parallelization and other small optimizations #4899

Conversation

james7132 commented Jun 2, 2022 • edited Loading

Objective

Solution

Performance

Future Work

Changelog

Migration Guide

alice-i-cecile commented Jan 16, 2023

james7132 commented Jan 16, 2023

james7132 commented Jun 2, 2022 •

edited

Loading