Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable frame pointers on all targets except x86. #107689

Closed
wants to merge 1 commit into from

Conversation

bjorn3
Copy link
Member

@bjorn3 bjorn3 commented Feb 5, 2023

While omitting frame pointers improves performance, on most targets there are enough general purpose registers that dedicating one to the frame pointer barely has any effect. 32-bit x86 is one of the few exceptions where there are so little general purpose registers that using the register meant for the frame pointer as general purpose register has a significant performance benefit. Keeping the frame pointer makes it significantly easier to profile rustc, hopefully resulting in more perf improvements than keeping frame pointers costs. While DWARF unwind tables allow getting stack traces during profiling, there are significant limitations that make this not very useful for rustc:

  • DWARF unwind table based unwinding is slow. This means it can't be done in real time and slows down offline analysis.
  • As a concequence of not being able to do DWARF based unwinding it becomes necessary to capture the stack on every sample. This significantly increases profile sizes.
  • Perf doesn't allow capturing more than 50k of the stack on every sample. Given that in rustc the stack is generally more than 50k, this results in partial stack traces which frequently splits a stack frame in two parts in flamegraphs where one part is at the root of the flamegraph, while the other is somewhere in the middle. This makes it really hard to determine how much time is spent in which function.
  • Capturing large parts of the stack causes cpu overload on slower systems, resulting in samples getting dropped.

Another option is to use a tool like valgrind's callgrind. This is however significantly slower than a sampling profiler like perf which barely has a performance impact at all. All of this means that using frame pointer based unwinding is significantly better than any of the alternatives.

While omitting frame pointers improves performance, on most targets
there are enough general purpose registers that dedicating one to the
frame pointer barely has any effect. 32-bit x86 is one of the few
exceptions where there are so little general purpose registers that
using the register meant for the frame pointer as general purpose
register has a significant performance benefit. Keeping the frame
pointer makes it significantly easier to profile rustc, hopefully
resulting in more perf improvements than keeping frame pointers costs.
While DWARF unwind tables allow getting stack traces during profiling,
there are significant limitations that make this not very useful for rustc:
* DWARF unwind table based unwinding is slow. This means it can't be
  done in real time and slows down offline analysis.
* As a concequence of not being able to do DWARF based unwinding it
  becomes necessary to capture the stack on every sample. This
  significantly increases profile sizes.
* Perf doesn't allow capturing more than 50k of the stack on every
  sample. Given that in rustc the stack is generally more than 50k,
  this results in partial stack traces which frequently splits a stack
  frame in two parts in flamegraphs where one part is at the root of
  the flamegraph, while the other is somewhere in the middle. This
  makes it really hard to determine how much time is spent in which
  function.
* Capturing large parts of the stack causes cpu overload on slower
  systems, resulting in samples getting dropped.
Another option is to use a tool like valgrind's callgrind. This is
however significantly slower than a sampling profiler like perf which
barely has a performance impact at all. All of this means that using
frame pointer based unwinding is significantly better than any of the
alternatives.
@rustbot
Copy link
Collaborator

rustbot commented Feb 5, 2023

r? @ozkanonur

(rustbot has picked a reviewer for you, use r? to override)

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap) labels Feb 5, 2023
@bjorn3
Copy link
Member Author

bjorn3 commented Feb 5, 2023

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 5, 2023
@bors
Copy link
Contributor

bors commented Feb 5, 2023

⌛ Trying commit 3e20e9c with merge 5b9166cf87b24623137256450b4545d86fc99a3f...

@bjorn3
Copy link
Member Author

bjorn3 commented Feb 5, 2023

See also https://lwn.net/Articles/919940/.

@bors
Copy link
Contributor

bors commented Feb 5, 2023

☀️ Try build successful - checks-actions
Build commit: 5b9166cf87b24623137256450b4545d86fc99a3f (5b9166cf87b24623137256450b4545d86fc99a3f)

@rust-timer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (5b9166cf87b24623137256450b4545d86fc99a3f): comparison URL.

Overall result: ❌ regressions - ACTION NEEDED

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
1.6% [0.4%, 3.9%] 199
Regressions ❌
(secondary)
1.7% [0.2%, 3.5%] 178
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 1.6% [0.4%, 3.9%] 199

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
3.4% [2.4%, 6.1%] 4
Improvements ✅
(primary)
-1.2% [-1.2%, -1.2%] 1
Improvements ✅
(secondary)
-2.8% [-3.1%, -2.3%] 4
All ❌✅ (primary) -1.2% [-1.2%, -1.2%] 1

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
1.4% [1.4%, 1.4%] 1
Regressions ❌
(secondary)
3.0% [3.0%, 3.0%] 1
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 1.4% [1.4%, 1.4%] 1

@rustbot rustbot added perf-regression Performance regression. and removed S-waiting-on-perf Status: Waiting on a perf run to be completed. labels Feb 5, 2023
@bjorn3
Copy link
Member Author

bjorn3 commented Feb 5, 2023

The regression in terms of cycles seems to be only half of the regression in terms of instructions for primary benchmarks. Even so it is higher than I hoped. Maybe allowing the frame pointer to be omitted for leaf function only would reduce the impact? No clue if perf will still be able to unwind the stack in that case though.

@Kobzol
Copy link
Contributor

Kobzol commented Feb 6, 2023

Given that it's rather easy to build a local copy of rustc, should we really enable this by default for production builds? If someone wants to spend the time and effort to profile rustc (and actually understand what's happening in the stack traces), they will probably need to download the rust repo anyway, and then a local version can be built with one command (and something like build.force-frame-pointers = true in config.toml). These builds wouldn't be PGO and BOLT optimized though.

That being said, the perf. hit is not that terrible.

@bjorn3
Copy link
Member Author

bjorn3 commented Feb 6, 2023

A local build still takes a while though and missing PGO and BOLT can also skew profiles. It also prevents just quickly profiling rustc while compiling that one crate that takes a while if you didn't plan to do much profiling in advance. As for my own use case, I want to profile cg_clif, but on my laptop it takes quite a while to compile rustc and the dev desktops don't allow usage of perf for security reasons.

1 similar comment
@bjorn3
Copy link
Member Author

bjorn3 commented Feb 6, 2023

A local build still takes a while though and missing PGO and BOLT can also skew profiles. It also prevents just quickly profiling rustc while compiling that one crate that takes a while if you didn't plan to do much profiling in advance. As for my own use case, I want to profile cg_clif, but on my laptop it takes quite a while to compile rustc and the dev desktops don't allow usage of perf for security reasons.

@Kobzol
Copy link
Contributor

Kobzol commented Feb 6, 2023

By the way, is rustc distributed with debug symbols? If not, then even if you enable frame pointers, the profiling output won't actually be very useful/readable, right?

@klensy
Copy link
Contributor

klensy commented Feb 6, 2023

rustc distributed with debug symbols?

For x86_64-pc-windows-msvc i see pdb files, so i guess answer is yes.

@bjorn3
Copy link
Member Author

bjorn3 commented Feb 6, 2023

Rustc is shipped with both symbol names and line info as can be observed by triggering a panic using -Ztreat-err-as-bug.

@Noratrieb
Copy link
Member

any progress here? It would be really nice to get this and allow easier profiling of rustc dist binaries

@bjorn3
Copy link
Member Author

bjorn3 commented Mar 22, 2023

Haven't found the time to work on this. I want to benchmark omitting frame pointers only for leaf functions too.

@Noratrieb
Copy link
Member

As a small data point, I just spent way too much time on rust-lang/rust-clippy#10532 enabling frame pointers on my rustc for use with clippy so I could get an actual profile of the situation (without which I wasn't able to figure out the cause). If the dist toolchain had frame pointers already, that would have been way easier.

@bjorn3
Copy link
Member Author

bjorn3 commented Aug 3, 2023

Closing in favor of #114323 as I don't have time to work on this.

@bjorn3 bjorn3 closed this Aug 3, 2023
@bjorn3 bjorn3 deleted the rustc_frame_pointers branch August 3, 2023 22:06
@onur-ozkan onur-ozkan removed the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Oct 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
perf-regression Performance regression. T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants