-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parallelize LLVM optimization and codegen passes #16367
Conversation
Have you clocked the resulting code performance difference? This seems like it would be great for servo. |
Very excited about this. I don't see the words 'codegen-threads' in this patch. Are you sure it exists? What happens when you specify |
Replace the two with --codegen-tasks ? 2014年8月9日 上午7:45于 "Brian Anderson" notifications@github.com写道:
|
// For LTO purposes, the bytecode of this library is also | ||
// inserted into the archive. We currently do this only when | ||
// codegen_units == 1, so we don't have to deal with multiple | ||
// bitcode files per crate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could each module be linked into one module to be emitted? (is that too timing-intensive?)
Otherwise, could we name the files bytecodeN.deflate
and list how many bytecodes are in the metadata?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could each module be linked into one module to be emitted? (is that too timing-intensive?)
I haven't tried this yet. I don't think it would take much longer than we already spend linking the object files together. The only problem is, all the LLVM modules are in separate contexts, so we would need to serialize each one and then deserialize into a shared context for linking.
Otherwise, could we name the files bytecodeN.deflate and list how many bytecodes are in the metadata?
Yeah, this is my preferred solution (which I also haven't tried implementing yet).
Could you describe some of the difficulties with sharing the |
This is also some super amazing work, I'm incredibly excited to see where this goes! Major prosp @epdtry! 🐟 |
} | ||
|
||
match config.opt_level { | ||
Some(opt_level) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While you're at it, could you 4-space tab this match?
I don't know if it's applicable, but the way we do it in Crystal is to have one llvm module for each "logical unit". In our cases each logical unit is a class or a module. Maybe in Rust a "logical unit" is a struct, an array, etc., together will all its impls. Then you can also have another logical unit to be the top level functions. Then we fire up N threads and each one takes a task (an llvm module) to compile it. This greatly reduces the compilation time. When you split your whole program in N modules and fire up N threads to compile those (as you are proposing here), if a thread finishes early its left without a job to do, so a thread becomes idle. With M smaller modules and N threads, with N > M, when a thread finishes it can start working on another module, reducing the idle time. Additionally, before compiling each module we write its bitcode to a .bc file in a hidden directory (.crystal, in our case). We then compare that .bc file to the .bc file generated by the previous run. If they turn out to be the same (and this will be true as long as you don't modify any impl of that logical unit), we can safely reuse the .o file of the previous run. This, again, reduces dramatically the times to recompile a project that had minimal changes. Bits of the source code implementing this behaviour are here and here, in case you want to take a look. |
^THIS^ !!! Please, please implement incremental compilation! Though I wonder if (transitively) unchanged modules could be culled right after resolution pass, based on source file timestamps and inter-module dependency info, which should be available at that point. |
I would say that the 'logical unit' in Rust is a module, they tend to be fairly small (at least relative to crate size) and are naturally self contained. It is probably worth getting data (at least) for smaller units - thanks for the idea! Incremental compilation is the next part of the project - looking forward to what comes out of that :-) |
Compiling
It's a codegen flag, so the flag name
Like most of
One way I tried to address this problem was by adding some basic load balancing:
This is basically my next project. Translation is the #2 time sink in |
@alexcrichton: Regarding |
I would definitely expect an It looked like it would make parts of this much nicer to have access to the raw session rather than duplicating some logic here and there, but it may not be too worth it in the end.
I don't think that any of our tests actually use the object file emitted, they just emit it. I also recall that the linker always succeeded in creating an object, but the object itself was just unusable (for one reason or another). Again though, this could all just be misremembering, or some bug which has since been fixed! |
The |
Oh dear, I must be over looking a test! I only see two instances of |
OK, let me back up. I think the relevant part of the design was unclear. On the master branch, On this branch, So, on this branch, any test that involves compiling and running Rust code will end up using |
Oh wow, I missed that entirely, I thought it was only used for In that case, I'm definitely willing to trust |
codegen_units: uint = (1, parse_uint, | ||
"divide crate into N units for optimization and codegen"), | ||
codegen_threads: uint = (1, parse_uint, | ||
"number of worker threads to use when running codegen"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given there was no benefit to having different values here, lets just have one option.
OK, looks good! r=me with all the changes (most of which are nits, TBH) and with Alex's review. @alexcrichton r? (specifically the stuff in back and concerning linking, about which I have no idea). |
@vadimcn: Wow, nice detective work! I would never have expected Here's a second reason why the Windows bot should never have had a problem to begin with: Anyway, I've added a workaround that should handle these inconsistencies between Windows toolchains. Now (on Windows) |
Well then, that's a new segfault I've never seen before! |
Break up `CrateContext` into `SharedCrateContext` and `LocalCrateContext`. The local piece corresponds to a single compilation unit, and contains all LLVM-related components. (LLVM data structures are tied to a specific `LLVMContext`, and we will need separate `LLVMContext`s to safely run multithreaded optimization.) The shared piece contains data structures that need to be shared across all compilation units, such as the `ty::ctxt` and some tables related to crate metadata.
Refactor the code in `llvm::back` that invokes LLVM optimization and codegen passes so that it can be called from worker threads. (Previously, it used `&Session` extensively, and `Session` is not `Share`.) The new code can handle multiple compilation units, by compiling each unit to `crate.0.o`, `crate.1.o`, etc., and linking together all the `crate.N.o` files into a single `crate.o` using `ld -r`. The later linking steps can then be run unchanged. The new code preserves the behavior of `--emit`/`-o` when building a single compilation unit. With multiple compilation units, the `--emit=asm/ir/bc` options produce multiple files, so combinations like `--emit=ir -o foo.ll` will not actually produce `foo.ll` (they instead produce several `foo.N.ll` files). The new code supports `-Z lto` only when using a single compilation unit. Compiling with multiple compilation units and `-Z lto` will produce an error. (I can't think of any good reason to do such a thing.) Linking with `-Z lto` against a library that was built as multiple compilation units will also fail, because the rlib does not contain a `crate.bytecode.deflate` file. This could be supported in the future by linking together the `crate.N.bc` files produced when compiling the library into a single `crate.bc`, or by making the LTO code support multiple `crate.N.bytecode.deflate` files.
When inlining an item from another crate, use the original symbol from that crate's metadata instead of generating a new symbol using the `ast::NodeId` of the inlined copy. This requires exporting symbols in the crate metadata in a few additional cases. Having predictable symbols for inlined items will be useful later to avoid generating duplicate object code for inlined items.
Rotate between compilation units while translating. The "worker threads" commit added support for multiple compilation units, but only translated into one, leaving the rest empty. With this commit, `trans` rotates between various compilation units while translating, using a simple stragtegy: upon entering a module, switch to translating into whichever compilation unit currently contains the fewest LLVM instructions. Most of the actual changes here involve getting symbol linkage right, so that items translated into different compilation units will link together properly at the end.
…t glue Use a shared lookup table of previously-translated monomorphizations/glue functions to avoid translating those functions in every compilation unit where they're used. Instead, the function will be translated in whichever compilation unit uses it first, and the remaining compilation units will link against that original definition.
Add a post-processing pass to `trans` that converts symbols from external to internal when possible. Translation with multiple compilation units initially makes most symbols external, since it is not clear when translating a definition whether that symbol will need to be accessed from another compilation unit. This final pass internalizes symbols that are not reachable from other crates and not referenced from other compilation units, so that LLVM can perform more aggressive optimizations on those symbols.
Adjust the handling of `#[inline]` items so that they get translated into every compilation unit that uses them. This is necessary to preserve the semantics of `#[inline(always)]`. Crate-local `#[inline]` functions and statics are blindly translated into every compilation unit. Cross-crate inlined items and monomorphizations of `#[inline]` functions are translated the first time a reference is seen in each compilation unit. When using multiple compilation units, inlined items are given `available_externally` linkage whenever possible to avoid duplicating object code.
Older versions of OSX's On master, the Newer versions of The latest commit on this branch avoids running |
@epdtry, oh my, that is quite the investigation! That's quite unfortunate that we'll segfault on older versions of OSX. It looks like there's not a whole lot we can do right now though. I'm sad that this may mean that we have to turn off parallel codegen for rustc itself by default (at least for osx), but we can cross that bridge later! |
Also, major major props for that investigation, that must have been quite a beast to track down! |
This branch adds support for running LLVM optimization and codegen on different parts of a crate in parallel. Instead of translating the crate into a single LLVM compilation unit, `rustc` now distributes items in the crate among several compilation units, and spawns worker threads to optimize and codegen each compilation unit independently. This improves compile times on multicore machines, at the cost of worse performance in the compiled code. The intent is to speed up build times during development without sacrificing too much optimization. On the machine I tested this on, `librustc` build time with `-O` went from 265 seconds (master branch, single-threaded) to 115s (this branch, with 4 threads), a speedup of 2.3x. For comparison, the build time without `-O` was 90s (single-threaded). Bootstrapping `rustc` using 4 threads gets a 1.6x speedup over the default settings (870s vs. 1380s), and building `librustc` with the resulting stage2 compiler takes 1.3x as long as the master branch (44s vs. 55s, single threaded, ignoring time spent in LLVM codegen). The user-visible changes from this branch are two new codegen flags: * `-C codegen-units=N`: Distribute items across `N` compilation units. * `-C codegen-threads=N`: Spawn `N` worker threads for running optimization and codegen. (It is possible to set `codegen-threads` larger than `codegen-units`, but this is not very useful.) Internal changes to the compiler are described in detail on the individual commit messages. Note: The first commit on this branch is copied from #16359, which this branch depends on. r? @nick29581
Awesome work! Looks great for an efficient #2369.
I think the Ninja build system use hashes from the compilation commands and source files (all dependencies) instead of relying on timestamps. @bors should like it ;) |
Q: Is this flag ignored if I just tried Did I do something wrong? (Also |
The flag is not ignored, it's just that your library is small enough that it doesn't get much benefit from this patch (especially when optimization is turned off). With
And with
Since
I removed that flag because in my testing I found no benefit from setting |
@epdtry Thanks for the detailed info! It seems that the bottleneck is the type checking phase in my particular case, and now I'm wondering if spending 50%+ of the time in that phase is normal (but that's off-topic). |
fix: Make `value_ty` query fallible
This branch adds support for running LLVM optimization and codegen on different parts of a crate in parallel. Instead of translating the crate into a single LLVM compilation unit,
rustc
now distributes items in the crate among several compilation units, and spawns worker threads to optimize and codegen each compilation unit independently. This improves compile times on multicore machines, at the cost of worse performance in the compiled code. The intent is to speed up build times during development without sacrificing too much optimization.On the machine I tested this on,
librustc
build time with-O
went from 265 seconds (master branch, single-threaded) to 115s (this branch, with 4 threads), a speedup of 2.3x. For comparison, the build time without-O
was 90s (single-threaded). Bootstrappingrustc
using 4 threads gets a 1.6x speedup over the default settings (870s vs. 1380s), and buildinglibrustc
with the resulting stage2 compiler takes 1.3x as long as the master branch (44s vs. 55s, single threaded, ignoring time spent in LLVM codegen).The user-visible changes from this branch are two new codegen flags:
-C codegen-units=N
: Distribute items acrossN
compilation units.-C codegen-threads=N
: SpawnN
worker threads for running optimization and codegen. (It is possible to setcodegen-threads
larger thancodegen-units
, but this is not very useful.)Internal changes to the compiler are described in detail on the individual commit messages.
Note: The first commit on this branch is copied from #16359, which this branch depends on.
r? @nick29581