Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Declining performance with unreleased 0.3.x vs 0.2.2 #350

Closed
John-Nagle opened this issue Feb 5, 2022 · 15 comments
Closed

Declining performance with unreleased 0.3.x vs 0.2.2 #350

John-Nagle opened this issue Feb 5, 2022 · 15 comments
Labels
client: animats-viewer Needed for Animats-Viewer module: core Core issues with the renderer or interface module: routines Issues with the render routines tracking Tracks sets of issues to a larger end goal

Comments

@John-Nagle
Copy link
Contributor

(Related to #348, but not entirely about memory bloat.)

Frame rate has dropped since 0.2.2.

With Rend3 0.2.2, my program was about 58 FPS on this test case, which was fine. Lost about 10FPS with the new version. Frame rate will drop as low as 23 FPS if peak memory usage exceeds the GPU's memory. It will also drop when other threads are loading textures and meshes.

This is the ideal case for frame rate. The entire scene, all textures and meshes, are all in the GPU. The camera is not moving. The same image is being displayed over and over. No mipmapping. Rend3 Unreleased (pre 0.3.x). Ubuntu 20.04 LTS. Nvidia 3070. Ryzen 5 with 6 cores, 12 hyperthreads.

Here's an initial Tracy profile. Call stacks are not being captured, so this is rather coarse data. My own code is barely doing anything here; it and the window event system are using 32us per frame.

The problems:

  • Basic refresh is too slow and is CPU-bound.
  • Adding meshes and textures from other threads impacts the refresh rate severely, although it's not supposed to.
  • Performance is getting worse as more features go into Rend3.

I'm attempting to build, in Rust, a client for Second Life / Open Simulator, because the existing C++ OpenGL clients are single-thread and too slow. Rend3's performance used to be well above the existing programs, but it's now not much better, and in some cases is worse. This is puzzling, because Rend3 is using Vulkan and has all the data resident in the GPU, while the C++ clients are using OpenGL and making huge numbers of draw calls.

I'm very concerned about this. If Rend3's performance doesn't improve substantially, my whole effort was a waste.

@John-Nagle
Copy link
Contributor Author

image

@cwfitzgerald cwfitzgerald added client: animats-viewer Needed for Animats-Viewer module: core Core issues with the renderer or interface module: routines Issues with the render routines tracking Tracks sets of issues to a larger end goal labels Feb 5, 2022
@cwfitzgerald
Copy link
Member

cwfitzgerald commented Feb 5, 2022

This is a longer form comment of what I posted on the matrix:

I first want to emphasize that it may take a little bit to iron out all these issues, but they are all definitely fixable in one way or another.

Basic refresh is too slow and is CPU-bound.

This tracy trace is actually quite helpful as it shows me clearly that your program is showing different performance characteristics from the ones I've been testing and gives me some hints about what might be causing it.

How many materials do you have and how many textures do you have? Material upload seems to take a bit and I've identified texture-related bottleneck inside of wgpu's run_render_pass before -- which could explain the performance with mipmapping (more mips = more textures to track).

Adding meshes and textures from other threads impacts the refresh rate severely, although it's not supposed to.

I currently still do the mesh work on the main thread because there would be synchronization issues if I didn't do this. The mesh upload is not particularly optimized at this second and I have plans on how to improve that.

Even if I do split out the mesh work onto another thread, both this and textures becoming truely multithreaded is blocked on gfx-rs/wgpu#2272. This is something I want to get to, but is a quite large task.

Performance is getting worse as more features go into Rend3.

I haven't noticed any performance regressions in my testing, so I want to put together some kind of test case that has similar traits to yours so that I can keep track of performance in this use case. This is also something I want to put together in general as I need to ensure each particular way of using rend3 improves (and doesn't regress) in performance.

I'm very concerned about this. If Rend3's performance doesn't improve substantially, my whole effort was a waste.

Finally I do want to say that these problems are all fixable. I can't promise it will get done immediately, I'm currently but one person (though @setzer22 has recently joined the project 👋🏻) and have a ton on my plate, but everything will be fixed. My goal, as progress goes on is that rend3's performance should end up well above what it was in 0.2. I think this is totally achievable.

@cwfitzgerald
Copy link
Member

cwfitzgerald commented Feb 5, 2022

Copying some numbers and conclusions from our discussion on the matrix here so I don't lose them. Scene stats:

Loaded: 14231 meshes, 19765372 vertices, 58364409 triangles, 12574 textures, 311291520 texture bytes.
Reused: meshes: 39054, textures: 32718
Prims: mesh generated: 0, mesh reused: 0
Textures in use: 12018, peak 12018, Texture bytes in use: 308087680, peak 308087680

Todos:

  1. I have a hunch there's a single extra function call I make in 0.3 that could make a big difference for your case that I want to test.
  2. backport that PR so you can revert to 0.2 for the time being
  3. autogenerate a test scene that has artificially high texture/material counts to replicate the symptoms
  4. make material updates sparse -- currently I upload all materials to the gpu every frame, which is a lot of work for both the cpu and gpu with so many. It's non-trivial but not difficult to only upload the diffs.
  5. Improve the texture tracking code in wgpu to be faster. I'm not sure how it'll be done, but it has to be and I've had my eye on it for a bit

@John-Nagle
Copy link
Contributor Author

"I first want to emphasize that it may take a little bit to iron out all these issues, but they are all definitely fixable in one way or another."

That's good to hear.

"Material upload seems to take a bit"

Most changes to materials already in use are only changes to the texture handles involved. While anything can change, usually, most things don't. If an API call for changing only texture handles would help performance, I could make such calls.

"I currently still do the mesh work on the main thread because there would be synchronization issues if I didn't do this. "

That explains some things. Back in October when I made that video, I was loading all the meshes with no textures, and then turned on concurrent texture loading. Performance looked good back then. Then I started loading meshes from one thread and textures from another, while refreshing from a third thread. Performance dropped to down around 20 FPS at times while meshes and textures were being loaded. Loading textures still degrades the frame rate but not, it seems, as badly as loading meshes.

"I haven't noticed any performance regressions in my testing, so I want to put together some kind of test case that has similar traits to yours so that I can keep track of performance in this use case."

All those non-reused textures and meshes are a problem. But that's user-created content for you. If the NFT metaverse crowd ever actually gets 3D worlds going, they'll face that. By the way, Unreal Engine 5's Nanite system is heavily dependent on reusing instances of objects. In their world, a mesh is a directed acyclic graph in which subsections of the mesh are shared. Something like a chain-link fence is represented by a very small number of unique mesh parts shared within a single data structure. It's very clever, but their demos rely heavily on instancing.

@cwfitzgerald
Copy link
Member

Good news, I can reproduce this with a simple code-based test case. 10k meshes/materials/textures repos nicely.

image.

By the way, Unreal Engine 5's Nanite system is heavily dependent on reusing instances of objects.

Interesting, I knew about the rendering tech but I never looked too much into how it's actually stored. That makes sense.

@John-Nagle
Copy link
Contributor Author

Oh, good. A simple test case always helps.
Meanwhile, now that I have Tracy profiling running, I'm looking at how the threads are interacting. I'll have more to say on that soon.

@John-Nagle
Copy link
Contributor Author

Screenshot from 2022-02-06 20-25-51

Profiling just as texture loading caught up. This shows the difference between frame times while textures are being loaded from other threads, and while they are not. Around 35ms/frame while textures are being loaded, down to 23ms/frame once loading is done.

"triage suspected" accounts for some of the difference, but not all of it.

@cwfitzgerald
Copy link
Member

cwfitzgerald commented Feb 7, 2022

This problem has totally nerd sniped me. Been faffing about in wgpu trying to get performance improvements, and so far have gotten my demo from 39fps up to 100fps. I still need to upstream the changes, which will require being less hacky with my changes, but that should all happen.

@John-Nagle
Copy link
Contributor Author

That's great! I'm working on profiling my own stuff now.

Tracy isn't showing all my threads, even ones that are using substantial CPU. Not clear why. That capture above should have shown three more threads which do different things, not just the main thread and the multiple asset loader threads. Any ideas? I just started using Tracy and probably missed something.

@cwfitzgerald
Copy link
Member

Tracy will only show threads that have spans on them, so if you want your threads to show, you need to annotate the work done with spans (you can use profiling for this)

@John-Nagle
Copy link
Contributor Author

Ah. That's it. Thanks. More profiling data soon.

@John-Nagle
Copy link
Contributor Author

John-Nagle commented Feb 8, 2022

More profiling data. Large Tracy file:
babbagepalisade01.zip

This is the usual Babbage Palisade scene, from startup through loading to just sitting there refreshing. The part at the end, where the CPU load drops way down, is when the scene is just redrawing without changes.

What all those threads are doing:

  • Main thread - the window events and refreshing. Nothing else. This is Rend3's main thread.
  • Mesh loader - reading large JSON files, creating meshes, feeding them to Rend3. If you zoom in far enough, you'll see "Add mesh", which is the actual call to renderer.add_mesh(). It's so fast that it's clear it's queuing an instruction for the main thread, not doing the work in the calling thread.
  • Asset loader - not doing all that much here, because its main job is to start the loading of textures at various LODs as the camera moves, and in this run, the camera is stationary.
  • Asset fetcher 0 .. 4. These are loading textures. For this run, they're all in the local file system cache, so there's little network I/O except for some that go out to the servers and get 404 errors. Mostly this is loading .PNG files from the cache. If you zoom in far enough you'll see "Add texture 2D" and "write texture" down in Rend3.
  • Priority queue - priority queue manager for the asset fetcher threads. Doesn't use much time.

Notes:

  • Highest CPU usage is 59% over all 6 cores, 12 hyperthreads, so we're not out of compute power. I want to make the asset fetchers run their work at lower priority, but that's not in yet. There might be a priority inversion problem if that was done, and they went into Rend3 at low priority and held locks there.
  • 46-48 FPS in the final stable state where the scene is not changing and all the loading code has gone idle.
  • Tracy Profiler 0.7.8 can read this, once unzipped.

So that's more detail.

@cwfitzgerald
Copy link
Member

Just giving an update on tracking performance improvements -- this unfortunately was a regression for other wgpu projects so couldn't be brought in as a whole. That being said I have some good ideas for proving both cases. I would, for now, stick with 0.2, as these get sorted, I can't promise they happen with any speed with how much is on my plate right now.

Will work on the backport shortly.

@John-Nagle
Copy link
Contributor Author

Thanks.

I've converted over to "unreleased" from a few weeks back, and it's working well, although sluggish on big scenes. I'm working on another part of the system, concurrent mesh loading, and that's keeping me busy. So don't worry about the backport too much. The general speedup is more useful at this point.

@cwfitzgerald
Copy link
Member

Closing due to #593

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
client: animats-viewer Needed for Animats-Viewer module: core Core issues with the renderer or interface module: routines Issues with the render routines tracking Tracks sets of issues to a larger end goal
Projects
None yet
Development

No branches or pull requests

2 participants