-
-
Notifications
You must be signed in to change notification settings - Fork 238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluate Profile-Guided Optimization (PGO) and LLVM BOLT #741
Comments
Running PGO to optimize up to 12% some specific tasks doesn't seem worth the potential degradation of non optimized ones, which is inherent to PGO. |
Right now there is no proof that PGO will be degraded important for the users' scenarios. You can check how PGO is integrated into other projects like Clang, Rustc, Python, and others (more integrations are here - https://github.com/zamazan4ik/awesome-pgo#pgo-showcases). If you have good coverage of all scenarios, you can collect multiple profiles, merge them, and then PGO will optimize for all scenarios. Even this generic merged scenario can be helpful with optimizing the program in general (e.g. completely the same thing does Rustc with its PGO pipeline). If you think PGO profiles from If we are not able to collect generic-enough profiles - okay. We can perform PGO benchmarks, document the results in the documentation, and integrate PGO build mode into the build scripts. So the users/maintainers can decide on their own - do they want to optimize Broot with PGO or no. |
I don't see why broot should use a experimental technology, hard to maintain for one dev, for few hypothetical speed when anyway broot is more likely to be bottleneck by OS/hardware anyway. I wonder if the benchmark test have even few IO in them. |
Same feeling. I've never seen impressive results in my tests of PGO and it never seemed worth the pain. So I'm not going to invest here unless I see new results. |
It depends on your definition of "experimental" :) If "experimental" means "new to Broot" - I agree. But PGO itself is not a novel technique at all. E.g. PGO was implemented in GCC somewhere near the 4.5 version (I am too young to remember such releases in practice), and Clang also implemented PGO for a while. Cannot quickly find when PGO was implemented in Rustc, but Rust's implementation fully relies on the LLVM one. From the usage perspective, PGO is used as an optimization technique for the project itself during years (good examples are all Chromium-based browsers, Clang/GCC/Rust itself, CPython). From the companies' perspective, Google and Facebook are major users of PGO. E.g. Google uses PGO (in Sampling mode aka AutoFDO, but that's just an implementation detail). About Google experience you can read here. So I do not agree that PGO is an experimental technology across the industry but agree that PGO adoption overall is less compared to "-O3" and "LTO" optimization options. Update: forgot to mention LLVM BOLT. This technology I agree to consider this "experimental" even if Facebook/Meta has huge experience with adopting it to their servers. According to my tests, there are a lot of caveats with BOLT in practice like bugs, ridiculous memory consumption, etc.
Of course, I cannot estimate on your side, how hard for is the maintenance of this thing. You can how PGO is integrated into other projects here. You have multiple options of how to integrate PGO into a project with different maintenance cost:
That's why I showed you PGO improvement results on the Broot benchmarks :) If you think that these improvements are not important - okay, but in this case I do not understand why you have such benchmarks :D if you have CPU-bound benchmarks for something that means that they are important for you. Also, here you can see PGO improvements on other projects, some of them at first seems like IO-bound but get interesting improvements from PGO perspective like However, I agree that testing PGO directly on the Broot binary itself can be more interesting to see. I didn't do it yet. The issue is just an idea of how to (possibly) improve the performance - maybe someone could find this idea worth trying. |
In general or in Broot? If we are talking in general, I have all PGO results for real-life applications here. For every showcase you can follow the link read PGO effects on the software performance. Sometimes it's large enough (20% usually in compilers-like workloads), sometimes much less (like DragonflyDB). If we are talking about Broot. Yes, right now we only see improvement results in the project benchmarks, not directly in the Broot performance itself - it need to be tested as well. |
I decided not to pursue this ATM. This might be revised later. |
Hi!
Recently I did many Profile-Guided Optimization (PGO) benchmarks on multiple projects - the results are available here. Here you can find different applications from different domains that were accelerated with PGO: virtual machines (like QEMU and CrosVM), compilers, gRPC workloads, benchmark tools, databases, and much more. That's why I think it's worth trying to apply PGO to Broot. I ran some benchmarks and want to share my results.
Test environment
main
branch (1b5c1838b3a533cab390def547ef5cfb892c47f3
commit)Benchmark
As an evaluation and training set, I used these benchmarks https://github.com/Canop/broot/tree/main/benches via
cargo bench
. PGO has trained also on these benchmarks withcargo pgo bench
(see below the link to this awesome tool). All measurements were done with the same background noise (as much as I can guarantee on this OS).Results
The results are presented in the
cargo bench
format. Since I do not know the correct way to copy these fancy tables properly, I will attach the screenshots (sorry for that).Release run:
Instrumented compared to Release (here you can evaluate how benchmarks are slow with Instrumentation enabled):
Then I ran
cargo bench
once again with the Release version to restore the benchmarks state to a baseline Release.Release + PGO optimized compared to Release:
As you see, PGO helps with achieving better performance at least in the provided by the project benchmarks.
Possible future steps
I can suggest the following things to do:
After PGO, I can suggest evaluating LLVM BOLT as an additional optimization step after PGO.
For the Rust projects, I recommend starting with cargo-pgo.
The text was updated successfully, but these errors were encountered: