Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PGO support #797

Closed
zamazan4ik opened this issue Apr 24, 2023 · 7 comments
Closed

Add PGO support #797

zamazan4ik opened this issue Apr 24, 2023 · 7 comments

Comments

@zamazan4ik
Copy link

zamazan4ik commented Apr 24, 2023

For gaining more performance - did anyone try to apply Profile-Guided Optimization (PGO) to CEMU? It could help with better compiler optimizations like inlining, hot-cold code split, etc. For projects like Rust, Clang, YDB, and CPython PGO brings a good performance boost (usually up to 20%).

If it really could help - would be nice to see PGO support in the upstream.

@Exzap
Copy link
Member

Exzap commented Apr 24, 2023

I tested MSVC's PGO and it didn't make a difference. In Cemu most of the hot code is already hand-optimized to generate the most efficient assembly (e.g. avoiding branches, using SIMD where possible, even streamlining variable loads/stores). There just isn't enough wiggle room for PGO to really make a dent.

That said, I haven't specifically tested PGO on clang or gcc but I'd expect similar results.

If you believe there is performance to be gained, try it out for yourself and open a PR if you see any actual improvements.

@Exzap Exzap closed this as not planned Won't fix, can't repro, duplicate, stale Apr 24, 2023
@zamazan4ik
Copy link
Author

@Exzap is there available any benchmark for CEMU? Or the best way to collect PGO profiles and compare non-PGO vs PGO versions is to run games with CEMU and compare FPS/CPU utilization between the versions?

@Exzap
Copy link
Member

Exzap commented Apr 25, 2023

Nah we have no benchmark suite. Just run a bunch of games multiple times and write down the metrics. Make sure the conditions are as equal as possible. E.g. for some games like Super Mario 3D World or BotW you can load a save and get the same camera angle every time if you dont touch the controls after loading. This way of testing will give you some noise in the results but if the improvement is smaller than the noise it's not worth it anyway.

@zamazan4ik
Copy link
Author

zamazan4ik commented Apr 25, 2023

Well, I did some benchmarks.

My setup:

  • OS: Fedora 38.
  • Linux kernel: 6.2.11-300 (mainline Fedora 38 kernel right now)
  • Videocard: Nvidia 3080 Ti with Nvidia proprietary drivers
  • CPU: Ryzen 9 5900X 12 core
  • RAM: 48 Gib
  • SSD: Samsung 980 Pro 2 Tib
  • Compiler: Clang 16 (from Fedora repo)
  • llvm-bolt, perf2bolt versions: from Fedora 38 repositories - 16.0.0
  • Cemu: master branch
  • Tested game: Mario Kart 8

At first, I cheated a little bit and chose CPU render mode since afaik it would be the only execution mode capped by CPU and not by Video card nor 60 FPS cap.

PGO with instrumentation slows Cemu in twice (drop from 25 FPS to 10-12). However, even after collecting the profile with instrumentation and recompiling the optimization build, I found that FPS dropped (smth to 20 FPS from 25-28 FPS in the usual release build).

Next, I tried PGO with sampling (AutoFDO). I recorded a profile with perf record -b on Mario Kart 8 (finish one race), then compiled Cemu once again with profile and -gline-tables-only -Wl,--emit-relocs flags. Good news - I found improvements in average FPS. I didn't find a built-in option to measure the average FPS of a play session so I tried to do it with my eyes. I would say, the average improvement is something like 2-5 FPS. It was measured by trying to preserve the same camera angle as possible, with the same amount of rendered things (like the number of AI players, etc.). So boost was from 25-28 FPS to 28-32 FPS. Not huge, of course, but the result was stable (the results were reproduced multiple times on the same hardware). Important note - I didn't measure an average/median CPU load in non-PGO vs PGO modes.

As a last step, I tried to apply BOLT to the AutoFDO-optimized build. Here the experiment was short - Cemu segfaults after optimization by BOLT. I will report it to the upstream later.

Not sure, what should we do next with this result - let's at least it would be written for the history here :)

@zamazan4ik
Copy link
Author

Also tried to test with Bayonetta 2 but it SIGSEGV's in both (usual release and AutoFDO-optimized) Cemu binaries on a load stage.

@jn64
Copy link
Contributor

jn64 commented Apr 25, 2023

Also tried to test with Bayonetta 2 but it SIGSEGV's in both (usual release and AutoFDO-optimized) Cemu binaries on a load stage.

That sounds like #781. By default Bayonetta 2 uses multi-core recompiler, while Mario Kart 8 use single-core recompiler.

@Squall-Leonhart
Copy link
Contributor

That sounds like #781. By default Bayonetta 2 uses multi-core recompiler, while Mario Kart 8 use single-core recompiler.

Mario Kart 8 was a bad profile, has since been updated to be allowed to use multi-core.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants