Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor performance on basic rectangle benchmark #8100

Open
SUPERCILEX opened this issue Mar 15, 2023 · 17 comments
Open

Poor performance on basic rectangle benchmark #8100

SUPERCILEX opened this issue Mar 15, 2023 · 17 comments
Labels
A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times

Comments

@SUPERCILEX
Copy link
Contributor

Bevy version

0.10

[Optional] Relevant system information

AdapterInfo { name: "Intel(R) UHD Graphics (CML GT2)", vendor: 32902, device: 39876, device_type: IntegratedGpu, driver: "Intel open-source Mesa driver", driver_info: "Mesa 22.3.5", backend: Vulkan }
SystemInfo { os: "Linux 22.04 Pop!_OS", kernel: "6.2.0-76060200-generic", cpu: "Intel(R) Core(TM) i7-10875H CPU @ 2.30GHz", core_count: "8", memory: "62.5 GiB" }

What you did

https://github.com/SUPERCILEX/bevy-vs-pixi

Web is broken right now, so just use native cargo r --release (and probably remove lto so the build is faster).

What went wrong

Performance is unacceptably bad (I can't even hit 30fps with 2000 rectangles even though Pixi can handle 8000 rectangles at ~50 fps) and I'm not sure why unfortunately. It seems like a lot of time in a trace is spent during extraction?

fn prepare_uniform_components<C: Component>(

I'd really appreciate if someone could investigate this. Otherwise, any pointers on what could be the root cause would be great.

@SUPERCILEX SUPERCILEX added C-Bug An unexpected or incorrect behavior S-Needs-Triage This issue needs to be labelled labels Mar 15, 2023
@alice-i-cecile alice-i-cecile added A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times and removed C-Bug An unexpected or incorrect behavior S-Needs-Triage This issue needs to be labelled labels Mar 15, 2023
@rparrett
Copy link
Contributor

rparrett commented Mar 15, 2023

Bevy is lot faster at drawing rectangles with SpriteBundle because it can batch draw calls. That's not yet implemented (see #89) for Mesh2d which bevy_prototype_lyon uses under the hood.

But a single SpriteBundle wouldn't quite be apples-to-apples because the other benchmark is drawing rectangles with borders.

If I were trying to game this benchmark, I would try a version with two SpriteBundles for each object. One for the border and another just above for the fill.

@SUPERCILEX
Copy link
Contributor Author

Like this? SUPERCILEX/bevy-vs-pixi@e94e9a6

Sadly no dice. Now the trace item that seems to be causing stuttering is bevy_core_pipeline::core_2d::main_pass_2d_node.

@rparrett
Copy link
Contributor

rparrett commented Mar 15, 2023

Like this? SUPERCILEX/bevy-vs-pixi@e94e9a6

Precisely. That's a ~2x speedup on my machine, so "some dice," perhaps.

I think we're also seeing poor batching due to rectangles being given random z values over the entire range of f32 though.

edit: or perhaps as a result of the "throwing another sprite of a different color on top" thing, actually.

It seems like we're seeing poor batching for... some reason though.

@rparrett
Copy link
Contributor

rparrett commented Mar 16, 2023

Haha,

Bevy is batching poorly because one of the sprites is Color::WHITE, which is treated differently because (it's the default for textured sprites and so it doesn't need vertex colors?).

Make your rectangles GRAY for another ~2x speedup.

@rparrett
Copy link
Contributor

rparrett commented Mar 16, 2023

With that change, this is now (on my machine) 2x faster than pixi (native, no lto), but 4x slower than pixi (web)

But we are at a disadvantage due to drawing two things per rectangle.

@SUPERCILEX
Copy link
Contributor Author

over the entire range of f32 though.

gen is [0, 1) for floats. But this is actually it!

Make your rectangles GRAY and they should go brr.

Wat. Lol, this also fixes it.

But we are at a disadvantage due to drawing two things per rectangle.

Yeah, once instancing is a thing I'll go back to lyon.


Tweak comparisons:

  • Single black sprite, 0 Z: 140fps
  • Single black sprite, [0, 1) Z: 140fps
  • Layered sprites, 0 Z, WHITE: 110fps
  • Layered sprites, 0 Z, GRAY: 100fps
  • Layered sprites, 0 Z primary, epsilon Z overlay, WHITE: 105fps
  • Layered sprites, 0 Z primary, epsilon Z overlay, GRAY: 100fps
  • Layered sprites, [0, 1) Z primary, epsilon Z overlay, WHITE: 27fps
  • Layered sprites, [0, 1) Z primary, epsilon Z overlay, GRAY: 60fps with oddly high variance (brief spikes to 80 and drops to 55)
  • Layered sprites, [0, 1) Z primary, 0 Z overlay, WHITE: 30fps
  • Layered sprites, [0, 1) Z primary, 0 Z overlay, GRAY: inconsistent 60

All of the "Layered sprites, 0 Z" categories are about the same +- some noise, and same goes for Zs with GRAY.

Fixing the WHITE bug will bring uniformity to the Zs, but I do find the two tests I highlighted pretty odd. Why would offsetting the base sprite lead to such a significant drop in performance? Is there a better way to glue these sprites together?

@rparrett
Copy link
Contributor

gen is [0, 1) for floats. But this is actually it!

Yeah, I think that forcing fewer layers just masked the "white color" problem.

@SUPERCILEX
Copy link
Contributor Author

Does that explain the (less severe) perf drop with gray though?

@rparrett
Copy link
Contributor

I think there may be an additional interesting thing happening with regard to the "colored" vs "non-colored" sprites breaking up batches and that not being factored into the pre-batch sorting.

I know that sorting itself also usually comes up when profiling this, and that may have very different characteristics when the set of z values is mostly random vs. mostly the same.

@rparrett
Copy link
Contributor

rparrett commented Sep 20, 2023

Checking in on this in light of recent rendering changes. Situation doesn't seem great.

M1 Max (native), LTO disabled, 64k rects

version fps
0.11.2 52
main e8b3892 (pre-#9236) 52
main 4f1d9a6 (#9236) 30
main 87f7d01 (latest) 38
batching #9685 33

The good news is that the "white minus epsilon hack" is no longer needed. (same fps with white)

@superdump
Copy link
Contributor

I think I’m going to have to have a look at what this example does because bevymark ended up faster than main pre-#9236 from my previous testing.

@rparrett
Copy link
Contributor

I think that the difference can be attributed to sorting, although I don't understand how sorting behavior would have actually changed in 9236.

But if I modify bevymark to use a random z value instead of an incremental one, I see the same dive in performance after 9236.

@superdump
Copy link
Contributor

You mean you see in traces that it is due to sorting?

@superdump
Copy link
Contributor

Ok. #9236 did two things:

  • Instead of doing an unstable sort in queue_sprites before doing batching, and then doing a second stable sort in the sort system, it moved to only doing the stable sort.
    • This could be improved by using a faster unstable sort on the distance and then the Entity. I’d suggest both trying radsort and sort unstable.
  • Instead of queuing phase items for batches, it queues phase items for each sprite.
    • This is part of the current design. I will think about what can be done about it but we should test changing the sorting first.

github-merge-queue bot pushed a commit that referenced this issue Sep 21, 2023
# Objective

Fix a performance regression in the "[bevy vs
pixi](https://github.com/SUPERCILEX/bevy-vs-pixi)" benchmark.

This benchmark seems to have a slightly pathological distribution of `z`
values -- Sprites are spawned with a random `z` value with a child
sprite at `f32::EPSILON` relative to the parent.

See discussion here:
#8100 (comment)

## Solution

Use `radsort` for sorting `Transparent2d` `PhaseItem`s.

Use random `z` values in bevymark to stress the phase sort. Add an
`--ordered-z` option to `bevymark` that uses the old behavior.

## Benchmarks

mac m1 max

| benchmark | fps before | fps after | diff |
| - | - | - | - |
| bevymark --waves 120 --per-wave 1000 --random-z | 42.16 | 47.06 | 🟩
+11.6% |
| bevymark --waves 120 --per-wave 1000 | 52.50 | 52.29 | 🟥 -0.4% |
| bevymark --waves 120 --per-wave 1000 --mode mesh2d --random-z | 9.64 |
10.24 | 🟩 +6.2% |
| bevymark --waves 120 --per-wave 1000 --mode mesh2d | 15.83 | 15.59 | 🟥
-1.5% |
| bevy-vs-pixi | 39.71 | 59.88 | 🟩 +50.1% |

## Discussion

It's possible that `TransparentUi` should also change. We could probably
use `slice::sort_unstable_by_key` with the current sort key though, as
its items are always sorted and unique. I'd prefer to follow up later to
look into that.

Here's a survey of sorts used by other `PhaseItem`s

#### slice::sort_by_key
`Transparent2d`, `TransparentUi`

#### radsort
`Opaque3d`, `AlphaMask3d`, `Transparent3d`, `Opaque3dPrepass`,
`AlphaMask3dPrepass`, `Shadow`

I also tried `slice::sort_unstable_by_key` with a compound sort key
including `Entity`, but it didn't seem as promising and I didn't test it
as thoroughly.

---------

Co-authored-by: Alice Cecile <alice.i.cecile@gmail.com>
Co-authored-by: Robert Swain <robert.swain@gmail.com>
@rparrett
Copy link
Contributor

rparrett commented Sep 21, 2023

Checking in again, bevy 0.12-dev is looking pretty solid!

64k rects, mac m1 max, chrome 118, no lto, no strip, no opt-s.

note: bevy is actually drawing 128k sprites here.

engine fps
pixi 20.8
bevy 11.2 webgl2 15.43
bevy 11.2 webgpu 14.30
bevy 11.2 native 46.31
bevy main webgl2 27.83
bevy main webgpu 29.15
bevy main native 59.07
bevy main native (single mesh2d) 11.08
bevy main webgpu (single mesh2d) 6.46*
bevy main webgl2 (single mesh2d) 3.19

*crashed

Unfortunately, it doesn't seem like automatic batching has helped out much with mesh2d for this particular benchmark.

For reference in my mesh2d tests I'm just building meshes on demand with this function and spawning a MaterialMesh2dBundle.

@rparrett
Copy link
Contributor

For reference in my mesh2d tests I'm just building meshes on demand with this function and spawning a MaterialMesh2dBundle.

It seems that this strategy was never going to work -- every entity having a unique mesh is a dealbreaker for batching. Maybe a custom material with a simple vertex shader would help, (or something more involved to do instancing?) but that's not really "stock bevy" anymore so perhaps two sprites per box is as good as it gets.

@SUPERCILEX
Copy link
Contributor Author

I think this is fixed?

rdrpenguin04 pushed a commit to rdrpenguin04/bevy that referenced this issue Jan 9, 2024
# Objective

Fix a performance regression in the "[bevy vs
pixi](https://github.com/SUPERCILEX/bevy-vs-pixi)" benchmark.

This benchmark seems to have a slightly pathological distribution of `z`
values -- Sprites are spawned with a random `z` value with a child
sprite at `f32::EPSILON` relative to the parent.

See discussion here:
bevyengine#8100 (comment)

## Solution

Use `radsort` for sorting `Transparent2d` `PhaseItem`s.

Use random `z` values in bevymark to stress the phase sort. Add an
`--ordered-z` option to `bevymark` that uses the old behavior.

## Benchmarks

mac m1 max

| benchmark | fps before | fps after | diff |
| - | - | - | - |
| bevymark --waves 120 --per-wave 1000 --random-z | 42.16 | 47.06 | 🟩
+11.6% |
| bevymark --waves 120 --per-wave 1000 | 52.50 | 52.29 | 🟥 -0.4% |
| bevymark --waves 120 --per-wave 1000 --mode mesh2d --random-z | 9.64 |
10.24 | 🟩 +6.2% |
| bevymark --waves 120 --per-wave 1000 --mode mesh2d | 15.83 | 15.59 | 🟥
-1.5% |
| bevy-vs-pixi | 39.71 | 59.88 | 🟩 +50.1% |

## Discussion

It's possible that `TransparentUi` should also change. We could probably
use `slice::sort_unstable_by_key` with the current sort key though, as
its items are always sorted and unique. I'd prefer to follow up later to
look into that.

Here's a survey of sorts used by other `PhaseItem`s

#### slice::sort_by_key
`Transparent2d`, `TransparentUi`

#### radsort
`Opaque3d`, `AlphaMask3d`, `Transparent3d`, `Opaque3dPrepass`,
`AlphaMask3dPrepass`, `Shadow`

I also tried `slice::sort_unstable_by_key` with a compound sort key
including `Entity`, but it didn't seem as promising and I didn't test it
as thoroughly.

---------

Co-authored-by: Alice Cecile <alice.i.cecile@gmail.com>
Co-authored-by: Robert Swain <robert.swain@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times
Projects
None yet
Development

No branches or pull requests

4 participants