-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable frame pointers on all targets except x86. #107689
Conversation
While omitting frame pointers improves performance, on most targets there are enough general purpose registers that dedicating one to the frame pointer barely has any effect. 32-bit x86 is one of the few exceptions where there are so little general purpose registers that using the register meant for the frame pointer as general purpose register has a significant performance benefit. Keeping the frame pointer makes it significantly easier to profile rustc, hopefully resulting in more perf improvements than keeping frame pointers costs. While DWARF unwind tables allow getting stack traces during profiling, there are significant limitations that make this not very useful for rustc: * DWARF unwind table based unwinding is slow. This means it can't be done in real time and slows down offline analysis. * As a concequence of not being able to do DWARF based unwinding it becomes necessary to capture the stack on every sample. This significantly increases profile sizes. * Perf doesn't allow capturing more than 50k of the stack on every sample. Given that in rustc the stack is generally more than 50k, this results in partial stack traces which frequently splits a stack frame in two parts in flamegraphs where one part is at the root of the flamegraph, while the other is somewhere in the middle. This makes it really hard to determine how much time is spent in which function. * Capturing large parts of the stack causes cpu overload on slower systems, resulting in samples getting dropped. Another option is to use a tool like valgrind's callgrind. This is however significantly slower than a sampling profiler like perf which barely has a performance impact at all. All of this means that using frame pointer based unwinding is significantly better than any of the alternatives.
r? @ozkanonur (rustbot has picked a reviewer for you, use r? to override) |
@bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
⌛ Trying commit 3e20e9c with merge 5b9166cf87b24623137256450b4545d86fc99a3f... |
See also https://lwn.net/Articles/919940/. |
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
Finished benchmarking commit (5b9166cf87b24623137256450b4545d86fc99a3f): comparison URL. Overall result: ❌ regressions - ACTION NEEDEDBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @bors rollup=never Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
|
The regression in terms of cycles seems to be only half of the regression in terms of instructions for primary benchmarks. Even so it is higher than I hoped. Maybe allowing the frame pointer to be omitted for leaf function only would reduce the impact? No clue if perf will still be able to unwind the stack in that case though. |
Given that it's rather easy to build a local copy of rustc, should we really enable this by default for production builds? If someone wants to spend the time and effort to profile That being said, the perf. hit is not that terrible. |
A local build still takes a while though and missing PGO and BOLT can also skew profiles. It also prevents just quickly profiling rustc while compiling that one crate that takes a while if you didn't plan to do much profiling in advance. As for my own use case, I want to profile cg_clif, but on my laptop it takes quite a while to compile rustc and the dev desktops don't allow usage of perf for security reasons. |
1 similar comment
A local build still takes a while though and missing PGO and BOLT can also skew profiles. It also prevents just quickly profiling rustc while compiling that one crate that takes a while if you didn't plan to do much profiling in advance. As for my own use case, I want to profile cg_clif, but on my laptop it takes quite a while to compile rustc and the dev desktops don't allow usage of perf for security reasons. |
By the way, is |
For |
Rustc is shipped with both symbol names and line info as can be observed by triggering a panic using |
any progress here? It would be really nice to get this and allow easier profiling of rustc dist binaries |
Haven't found the time to work on this. I want to benchmark omitting frame pointers only for leaf functions too. |
As a small data point, I just spent way too much time on rust-lang/rust-clippy#10532 enabling frame pointers on my rustc for use with clippy so I could get an actual profile of the situation (without which I wasn't able to figure out the cause). If the dist toolchain had frame pointers already, that would have been way easier. |
Closing in favor of #114323 as I don't have time to work on this. |
While omitting frame pointers improves performance, on most targets there are enough general purpose registers that dedicating one to the frame pointer barely has any effect. 32-bit x86 is one of the few exceptions where there are so little general purpose registers that using the register meant for the frame pointer as general purpose register has a significant performance benefit. Keeping the frame pointer makes it significantly easier to profile rustc, hopefully resulting in more perf improvements than keeping frame pointers costs. While DWARF unwind tables allow getting stack traces during profiling, there are significant limitations that make this not very useful for rustc:
Another option is to use a tool like valgrind's callgrind. This is however significantly slower than a sampling profiler like perf which barely has a performance impact at all. All of this means that using frame pointer based unwinding is significantly better than any of the alternatives.