-
-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use radix sort for sort phase and sprite sorting #4291
Comments
I added a I tried |
According to wikipedia radixsort is O(nw) where n is the amount of elements and w is the length of the element. It is provably impossible to have a sort with a worst case better than O(n log n). |
I think that is only true for comparison sorts. Radix sort is not a comparison sort. |
I tried
So it is at least consistently better than |
|
I realised that for the sort in |
I tried to rework
My assumption that z are relatively few compared to sprites was wrong and that wasn't a well-considered optimisation attempt. I was mostly following the path of the radix sort hammer working well for small sort keys. Ultimately, I dropped that path. I don't know of a good way to improve
I guess Summary
|
would it be possible to adapt the sort based on counts or previous execution? do one frame with each, then do the next 500 frames with the best one? |
I suppose you could, but looking at the results, we could implement the use of voracious_stable_sort for the sort phase, and rayon::par_sort_unstable_by for queue_sprites and probably obtain the fastest for all cases. However, I didn’t make a PR for this because the sort phase gains are small and scene-dependent for the additional dependency on voracious_radix_sort, and distaste was expressed for including rayon due to competing threadpools and in the supposedly more realistic case of bevymark, there was no benefit. Did I miss something in what you were asking? |
Cart linked to #3460 (review) on Discord as relevant investigation of and discussion about how batching works in general and for sprites. |
For an additional option, I've just released This had a huge impact on bloat: |
Nice, thanks! |
As discussed in discord, apart from the first few frames this is mostly just sorting already sorted Vecs. So it's really checking how fast you can detect that the Vec is already sorted. I just released In terms of bevy making use of radix sort, I think it would be good to work out what a good representative scene is. Trying to think up a worst-case scenario, some sort of sprite based particle system would probably be a pathological case for the sort phase. Bevymark and many_sprites both don't stress this system very much as after the initial spawning, they don't alter the sort order of the relevant items. |
One test would be to have a combination of sprites, plain 2d meshes, and 2d meshes with custom materials, at various z depths. This would result in items queued (pushed to the phase vec) from separate systems that then need sorting. I did like with rdst’s design that it operates byte-wise over the sort key and I think it somehow ignores bytes that have no impact on the order because they’re all the same? |
That would be ideal, it would also tell us the impact on batching etc. I realised that bevymark can actually test changing data at least, by simply using more waves! So maybe a better test would be to set waves to like 2000, so the new data never stops.
I've tried to keep it entirely agnostic to the underlying types by sticking religiously to sorting byte-by-byte. On the downside, rdst can't take shortcuts by truly comparison-sorting small arrays in the same way as voracious etc. because I don't require orderable types in rdst. I have a cludgy byte-by-byte comparison sort as a diversion for small arrays... But it will never be as fast as a single comparison of And yes, both rdst and voracious will generally skip sorting a "level" (nth byte) if all bytes in that level are the same. With this trick, you can't skip the If you want to skip counting too, you can just specify that there are less bytes to sort. This will usually require newtyping as the default sort key implementations naturally assume you're using the whole type :) I can't see this being useful for floats, but for i/usize data where only a few of the bytes can change, it's a huge speed boost. |
@nessex I don't know how the bit position radix sorting algorithm works, but would it be better to use u32 or u64 as the agnostic sort key data type given that CPU registers are 32/64 bits? Do you leverage SIMD already for packing the u8s into registers and operating on them? I imagine you could still do so for 32-/64-bit shuffles and masks and such too? |
rdst and voracious don't explicitly pack things into SIMD registers as it depends on the For sort_key, floats are fine I think. The operations are quite primitive and vectorize well: Actually, I've just pushed |
After #5049, we're in a position to slot any of these radix sorts, and change which algorithm we use depending on sort mode. We can handle the stable sorts separately from the unstable ones. For 3D use cases, where unstable sorts can be used for every phase, we can easily use radsort or voracious and get a 5-7x speedup, and decide on an appropriate approach for the stable/batched 2D cases. |
Probably worth retesting with latest versions of the three radix sort crates I tried. |
# Objective Partially addresses #4291. Speed up the sort phase for unbatched render phases. ## Solution Split out one of the optimizations in #4899 and allow implementors of `PhaseItem` to change what kind of sort is used when sorting the items in the phase. This currently includes Stable, Unstable, and Unsorted. Each of these corresponds to `Vec::sort_by_key`, `Vec::sort_unstable_by_key`, and no sorting at all. The default is `Unstable`. The last one can be used as a default if users introduce a preliminary depth prepass. ## Performance This will not impact the performance of any batched phases, as it is still using a stable sort. 2D's only phase is unchanged. All 3D phases are unbatched currently, and will benefit from this change. On `many_cubes`, where the primary phase is opaque, this change sees a speed up from 907.02us -> 477.62us, a 47.35% reduction. ![image](https://user-images.githubusercontent.com/3137680/174471253-22424874-30d5-4db5-b5b4-65fb2c612a9c.png) ## Future Work There were prior discussions to add support for faster radix sorts in #4291, which in theory should be a `O(n)` instead of a `O(nlog(n))` time. [`voracious`](https://crates.io/crates/voracious_radix_sort) has been proposed, but it seems to be optimize for use cases with more than 30,000 items, which may be atypical for most systems. Another optimization included in #4899 is to reduce the size of a few of the IDs commonly used in `PhaseItem` implementations to shrink the types to make swapping/sorting faster. Both `CachedPipelineId` and `DrawFunctionId` could be reduced to `u32` instead of `usize`. Ideally, this should automatically change to use stable sorts when `BatchedPhaseItem` is implemented on the same phase item type, but this requires specialization, which may not land in stable Rust for a short while. --- ## Changelog Added: `PhaseItem::sort` ## Migration Guide RenderPhases now default to a unstable sort (via `slice::sort_unstable_by_key`). This can typically improve sort phase performance, but may produce incorrect batching results when implementing `BatchedPhaseItem`. To revert to the older stable sort, manually implement `PhaseItem::sort` to implement a stable sort (i.e. via `slice::sort_by_key`). Co-authored-by: Federico Rinaldi <gisquerin@gmail.com> Co-authored-by: Robert Swain <robert.swain@gmail.com> Co-authored-by: colepoirier <colepoirier@gmail.com>
# Objective Partially addresses bevyengine#4291. Speed up the sort phase for unbatched render phases. ## Solution Split out one of the optimizations in bevyengine#4899 and allow implementors of `PhaseItem` to change what kind of sort is used when sorting the items in the phase. This currently includes Stable, Unstable, and Unsorted. Each of these corresponds to `Vec::sort_by_key`, `Vec::sort_unstable_by_key`, and no sorting at all. The default is `Unstable`. The last one can be used as a default if users introduce a preliminary depth prepass. ## Performance This will not impact the performance of any batched phases, as it is still using a stable sort. 2D's only phase is unchanged. All 3D phases are unbatched currently, and will benefit from this change. On `many_cubes`, where the primary phase is opaque, this change sees a speed up from 907.02us -> 477.62us, a 47.35% reduction. ![image](https://user-images.githubusercontent.com/3137680/174471253-22424874-30d5-4db5-b5b4-65fb2c612a9c.png) ## Future Work There were prior discussions to add support for faster radix sorts in bevyengine#4291, which in theory should be a `O(n)` instead of a `O(nlog(n))` time. [`voracious`](https://crates.io/crates/voracious_radix_sort) has been proposed, but it seems to be optimize for use cases with more than 30,000 items, which may be atypical for most systems. Another optimization included in bevyengine#4899 is to reduce the size of a few of the IDs commonly used in `PhaseItem` implementations to shrink the types to make swapping/sorting faster. Both `CachedPipelineId` and `DrawFunctionId` could be reduced to `u32` instead of `usize`. Ideally, this should automatically change to use stable sorts when `BatchedPhaseItem` is implemented on the same phase item type, but this requires specialization, which may not land in stable Rust for a short while. --- ## Changelog Added: `PhaseItem::sort` ## Migration Guide RenderPhases now default to a unstable sort (via `slice::sort_unstable_by_key`). This can typically improve sort phase performance, but may produce incorrect batching results when implementing `BatchedPhaseItem`. To revert to the older stable sort, manually implement `PhaseItem::sort` to implement a stable sort (i.e. via `slice::sort_by_key`). Co-authored-by: Federico Rinaldi <gisquerin@gmail.com> Co-authored-by: Robert Swain <robert.swain@gmail.com> Co-authored-by: colepoirier <colepoirier@gmail.com>
# Objective Partially addresses bevyengine#4291. Speed up the sort phase for unbatched render phases. ## Solution Split out one of the optimizations in bevyengine#4899 and allow implementors of `PhaseItem` to change what kind of sort is used when sorting the items in the phase. This currently includes Stable, Unstable, and Unsorted. Each of these corresponds to `Vec::sort_by_key`, `Vec::sort_unstable_by_key`, and no sorting at all. The default is `Unstable`. The last one can be used as a default if users introduce a preliminary depth prepass. ## Performance This will not impact the performance of any batched phases, as it is still using a stable sort. 2D's only phase is unchanged. All 3D phases are unbatched currently, and will benefit from this change. On `many_cubes`, where the primary phase is opaque, this change sees a speed up from 907.02us -> 477.62us, a 47.35% reduction. ![image](https://user-images.githubusercontent.com/3137680/174471253-22424874-30d5-4db5-b5b4-65fb2c612a9c.png) ## Future Work There were prior discussions to add support for faster radix sorts in bevyengine#4291, which in theory should be a `O(n)` instead of a `O(nlog(n))` time. [`voracious`](https://crates.io/crates/voracious_radix_sort) has been proposed, but it seems to be optimize for use cases with more than 30,000 items, which may be atypical for most systems. Another optimization included in bevyengine#4899 is to reduce the size of a few of the IDs commonly used in `PhaseItem` implementations to shrink the types to make swapping/sorting faster. Both `CachedPipelineId` and `DrawFunctionId` could be reduced to `u32` instead of `usize`. Ideally, this should automatically change to use stable sorts when `BatchedPhaseItem` is implemented on the same phase item type, but this requires specialization, which may not land in stable Rust for a short while. --- ## Changelog Added: `PhaseItem::sort` ## Migration Guide RenderPhases now default to a unstable sort (via `slice::sort_unstable_by_key`). This can typically improve sort phase performance, but may produce incorrect batching results when implementing `BatchedPhaseItem`. To revert to the older stable sort, manually implement `PhaseItem::sort` to implement a stable sort (i.e. via `slice::sort_by_key`). Co-authored-by: Federico Rinaldi <gisquerin@gmail.com> Co-authored-by: Robert Swain <robert.swain@gmail.com> Co-authored-by: colepoirier <colepoirier@gmail.com>
# Objective Partially addresses bevyengine#4291. Speed up the sort phase for unbatched render phases. ## Solution Split out one of the optimizations in bevyengine#4899 and allow implementors of `PhaseItem` to change what kind of sort is used when sorting the items in the phase. This currently includes Stable, Unstable, and Unsorted. Each of these corresponds to `Vec::sort_by_key`, `Vec::sort_unstable_by_key`, and no sorting at all. The default is `Unstable`. The last one can be used as a default if users introduce a preliminary depth prepass. ## Performance This will not impact the performance of any batched phases, as it is still using a stable sort. 2D's only phase is unchanged. All 3D phases are unbatched currently, and will benefit from this change. On `many_cubes`, where the primary phase is opaque, this change sees a speed up from 907.02us -> 477.62us, a 47.35% reduction. ![image](https://user-images.githubusercontent.com/3137680/174471253-22424874-30d5-4db5-b5b4-65fb2c612a9c.png) ## Future Work There were prior discussions to add support for faster radix sorts in bevyengine#4291, which in theory should be a `O(n)` instead of a `O(nlog(n))` time. [`voracious`](https://crates.io/crates/voracious_radix_sort) has been proposed, but it seems to be optimize for use cases with more than 30,000 items, which may be atypical for most systems. Another optimization included in bevyengine#4899 is to reduce the size of a few of the IDs commonly used in `PhaseItem` implementations to shrink the types to make swapping/sorting faster. Both `CachedPipelineId` and `DrawFunctionId` could be reduced to `u32` instead of `usize`. Ideally, this should automatically change to use stable sorts when `BatchedPhaseItem` is implemented on the same phase item type, but this requires specialization, which may not land in stable Rust for a short while. --- ## Changelog Added: `PhaseItem::sort` ## Migration Guide RenderPhases now default to a unstable sort (via `slice::sort_unstable_by_key`). This can typically improve sort phase performance, but may produce incorrect batching results when implementing `BatchedPhaseItem`. To revert to the older stable sort, manually implement `PhaseItem::sort` to implement a stable sort (i.e. via `slice::sort_by_key`). Co-authored-by: Federico Rinaldi <gisquerin@gmail.com> Co-authored-by: Robert Swain <robert.swain@gmail.com> Co-authored-by: colepoirier <colepoirier@gmail.com>
Looking at noisy M1 Mac, Chrome
Really fascinating. Not sure what's going on with rdst multi-threading. I did at least verify that on those runs, rayon was either in or out of the dep tree. Also re-ran the Didn't include voracious because it doesn't seem possible to write a comparable key function. |
I threw glidesort in there for fun. |
@rparrett I was revisiting this to see if it could be closed now. What I take from the above table is that glidesort is faster on ordered arrays and radsort is faster on random ones. In which case, I think sticking with radsort is the way to go. Did I understand correctly? |
Sounds about right / matches what I was told about radsort / sounds good to me. There may be reason to rebenchmark later. The author (of radsort) is apparently cooking up a new release. |
Background
We currently use
sort_by_key
to sortPhaseItem
s in the sort phase, andsort_unstable_by
to sort extracted sprites inqueue_sprites
.sort_by_key
- This sort is stable (i.e., does not reorder equal elements) and O(m * n * log(n)) worst-case, where the key function is O(m).sort_unstable_by
- This sort is unstable (i.e., may reorder equal elements), in-place (i.e., does not allocate), and O(n * log(n)) worst-case.For large numbers of items to sort, such as in
bevymark
ormany_sprites
wheren > 100_000
, and given sort keys that are a small and fixed number of bits such as the f32 that we are commonly using, a radix sort could/should be much faster.crates
I investigated sorting 10M random
f32
s in the range0.1f32..1000.0f32
(the default perspective camera near and far planes) and observed the following with the available radix sort crates:rdst
andvoracious_mt_sort
results above are multi-threaded usingrayon
, all the rest of the results are single-threadedrdst
has a hard dependency onrayon
, though it can be configured to do single-threaded sorting. I created an issue about making multi-threading optional as we already have a threadpool in bevy and maybe we don't want to add another: Make rayon optional nessex/rdst#2voracious_radix_sort
has a non-default multi-threading featureradsort
is single-threaded onlyradsort
- 17.1kBrdst
- 34.3kBvoracious_radix_sort
- 178kBradsort
was the easiest to integrate using itssort_by_key
functionrdst
andvoracious_radix_sort
both require implementation of some traits on the type being sorted, and require that the type implementCopy
. This cannot be done for batched sprites currently becauseBatchedPhaseItem
contains aRange<f32>
for which it is not possible to implementCopy
Proof of Concept
I made a branch here that uses
radsort
: https://github.com/superdump/bevy/tree/render-radix-sort .The sorting of
ExtractedSprites
inqueue_sprites
is significantly improved by usingradsort::sort_by_key
like this:With that, on an M1 Max, the median execution time of
queue_sprites
inbevymark -- 20000 8
over 1500 frames increased from 9.82ms to 17.09ms, which makes no sense to me. Inmany_sprites
it decreased from 11.34ms to 9.47ms.The median execution time of
sort_phase_system
for the relevant phase (Transparent2d
for sprites,Opaque3d
for 3D meshes) inmany_cubes -- sphere
it decreased from 0.728ms to 0.094ms. Inbevymark -- 20000 8
it increased from 0.184ms to 1.95ms, which makes no sense. And inmany_sprites
it increased from 0.106ms to 1.19ms.This is quite confusing. I haven't been able to figure out why the sort performance gets worse.
radsort
claims both best and worst execution time of O(n), space complexity of O(n), and that it is a stable sort so it does not reorder equal items. On the branch, all thesort_key
implementations are inlined.Next steps
radsort
is much slower in many casesvoracious_radix_sort
orrdst
to see if they perform consistently better, thoughrdst
would likely not be approved unless its multi-threading were made optional.rdst
if it turns out to be a preferable solution due to the smaller crate footprintThe text was updated successfully, but these errors were encountered: