-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[experimental] Allow for RLIBs to only contain metadata. #48373
[experimental] Allow for RLIBs to only contain metadata. #48373
Conversation
r? @cramertj (rust_highfive has picked a reviewer for you, use r? to override) |
@bors try |
⌛ Trying commit 0c034ad50096b3fcb2d1c29aa538ef5ad22d6603 with merge 1f12186df3cc7f2c3af0b55b9089a7b22fead331... |
☀️ Test successful - status-travis |
er, well, not entirely sure this is even somewhat meaningful, but... http://perf.rust-lang.org/compare.html?start=27a046e9338fb0455c33b13e8fe28da78212dedc&end=1f12186df3cc7f2c3af0b55b9089a7b22fead331&stat=instructions%3Au |
Alright, interesting I think this mainly shows that we are not measuring quite the right thing with our benchmarks. MIR-only RLIBs postpone any LLVM work to until it is actually needed. In many of the benchmarks, we are only compiling RLIBs though, so we never need to generate code. However, that's not a realistic use case, I think. If you are compiling just for the diagnostics, I'll try to gather some more useful data. |
0c034ad
to
9614746
Compare
So, I ran some benchmarks and the numbers are not all that encouraging, at least for now. While we generate a lot less code overall, it seems that the single-threaded Here are some of the numbers I collected. The table shows the aggregate time spent for various tasks while compiling the whole crate graph. In many cases we do less work overall but due to worse parallelization, wall-clock time increases. The last table shows the number of functions we instantiate for the whole crate graph. It becomes substantially smaller in most cases. Maybe we should look into re-using generic instances separately if MIR-only RLIBs are blocked on parallelizing trans. ripgrep - cargo build
encoding-rs - cargo test --no-run
webrender - cargo build
futures-rs - cargo test --no-run
tokio-webpush-simple - cargo build
Number of LLVM function definitions generated for whole crate graph
cc @rust-lang/compiler |
Huh? LLVM IR generation and optimization is parallelized with multiple CGUs (and indeed the times you report are generally smaller for MIR-only rlibs). Are you referring to the steps before that, like translation item collection, CGU partitioning, etc.? |
@rkruppe I think @michaelwoerister is referring to the rustc code for generating the LLVM IR, not the LLVM execution. |
Seems like a good reason to push harder on parallel rustc =) |
Do you have any explanation for this?
Why would it make more with mir-only rlibs? |
@nikomatsakis Oh! I was under the impression that's parallelized as well. Is it not? |
Yes.
My theory is that it's because |
It's not, unfortunately. Translation reads the tcx, which cannot be accessed from multiple threads. |
So what you're saying is we need a workspace wide incremental codegen cache? So if one leaf does codegen, it gets cached and other leaf crates can reuse the existing work? |
I'm not sure if we could make this work. Maybe. It would solve part of the problem. |
☔ The latest upstream changes (presumably #48449) made this pull request unmergeable. Please resolve the merge conflicts. |
Posted the results of this experiment in #38913. Closing. |
Don't recompute SymbolExportLevel for upstream crates. The data collected in #48373 suggests that we can avoid generating up to 30% of the LLVM definitions by only instantiating function monomorphizations once with a given crate graph. Some more data, collected with a [proof-of-concept implementation](https://github.com/michaelwoerister/rust/commits/share-generics) of re-using monomorphizations, which is less efficient than the MIR-only RLIB approach, suggests that it's still around 25% LLVM definitions that we can save. So far, this PR only cleans up handling of symbol export status. Too early to review still.
This is an experimental implementation of #38913. I want to try and see if I can get it through a perf.rlo run. No real need to review this yet.