Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parallelize LLVM optimization and codegen passes #16367

Merged
merged 14 commits into from
Sep 6, 2014
Merged

parallelize LLVM optimization and codegen passes #16367

merged 14 commits into from
Sep 6, 2014

Conversation

spernsteiner
Copy link
Contributor

This branch adds support for running LLVM optimization and codegen on different parts of a crate in parallel. Instead of translating the crate into a single LLVM compilation unit, rustc now distributes items in the crate among several compilation units, and spawns worker threads to optimize and codegen each compilation unit independently. This improves compile times on multicore machines, at the cost of worse performance in the compiled code. The intent is to speed up build times during development without sacrificing too much optimization.

On the machine I tested this on, librustc build time with -O went from 265 seconds (master branch, single-threaded) to 115s (this branch, with 4 threads), a speedup of 2.3x. For comparison, the build time without -O was 90s (single-threaded). Bootstrapping rustc using 4 threads gets a 1.6x speedup over the default settings (870s vs. 1380s), and building librustc with the resulting stage2 compiler takes 1.3x as long as the master branch (44s vs. 55s, single threaded, ignoring time spent in LLVM codegen).

The user-visible changes from this branch are two new codegen flags:

  • -C codegen-units=N: Distribute items across N compilation units.
  • -C codegen-threads=N: Spawn N worker threads for running optimization and codegen. (It is possible to set codegen-threads larger than codegen-units, but this is not very useful.)

Internal changes to the compiler are described in detail on the individual commit messages.

Note: The first commit on this branch is copied from #16359, which this branch depends on.

r? @nick29581

@metajack
Copy link
Contributor

metajack commented Aug 8, 2014

Have you clocked the resulting code performance difference? This seems like it would be great for servo.

@brson
Copy link
Contributor

brson commented Aug 8, 2014

Very excited about this.

I don't see the words 'codegen-threads' in this patch. Are you sure it exists? What happens when you specify --codegen-units but not --codegen-threads?

@liigo
Copy link
Contributor

liigo commented Aug 9, 2014

Replace the two with --codegen-tasks ?

2014年8月9日 上午7:45于 "Brian Anderson" notifications@github.com写道:

Very excited about this.

I don't see the words 'codegen-threads' in this patch. Are you sure it
exists? What happens when you specify --codegen-units but not
--codegen-threads?

// For LTO purposes, the bytecode of this library is also
// inserted into the archive. We currently do this only when
// codegen_units == 1, so we don't have to deal with multiple
// bitcode files per crate.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could each module be linked into one module to be emitted? (is that too timing-intensive?)

Otherwise, could we name the files bytecodeN.deflate and list how many bytecodes are in the metadata?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could each module be linked into one module to be emitted? (is that too timing-intensive?)

I haven't tried this yet. I don't think it would take much longer than we already spend linking the object files together. The only problem is, all the LLVM modules are in separate contexts, so we would need to serialize each one and then deserialize into a shared context for linking.

Otherwise, could we name the files bytecodeN.deflate and list how many bytecodes are in the metadata?

Yeah, this is my preferred solution (which I also haven't tried implementing yet).

@alexcrichton
Copy link
Member

Could you describe some of the difficulties with sharing the Session and across worker threads? Was it mainly that Rc is used liberally inside of it? If so, do you think it would ever be feasible to share the Session in the worker threads?

@alexcrichton
Copy link
Member

This is also some super amazing work, I'm incredibly excited to see where this goes! Major prosp @epdtry! 🐟

}

match config.opt_level {
Some(opt_level) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While you're at it, could you 4-space tab this match?

@asterite
Copy link

asterite commented Aug 9, 2014

I don't know if it's applicable, but the way we do it in Crystal is to have one llvm module for each "logical unit". In our cases each logical unit is a class or a module. Maybe in Rust a "logical unit" is a struct, an array, etc., together will all its impls. Then you can also have another logical unit to be the top level functions.

Then we fire up N threads and each one takes a task (an llvm module) to compile it. This greatly reduces the compilation time. When you split your whole program in N modules and fire up N threads to compile those (as you are proposing here), if a thread finishes early its left without a job to do, so a thread becomes idle. With M smaller modules and N threads, with N > M, when a thread finishes it can start working on another module, reducing the idle time.

Additionally, before compiling each module we write its bitcode to a .bc file in a hidden directory (.crystal, in our case). We then compare that .bc file to the .bc file generated by the previous run. If they turn out to be the same (and this will be true as long as you don't modify any impl of that logical unit), we can safely reuse the .o file of the previous run. This, again, reduces dramatically the times to recompile a project that had minimal changes.

Bits of the source code implementing this behaviour are here and here, in case you want to take a look.

@vadimcn
Copy link
Contributor

vadimcn commented Aug 9, 2014

^THIS^ !!! Please, please implement incremental compilation!

Though I wonder if (transitively) unchanged modules could be culled right after resolution pass, based on source file timestamps and inter-module dependency info, which should be available at that point.
But even if they are culled after translation, it would be a major boon in day-to-day development.

@nrc
Copy link
Member

nrc commented Aug 9, 2014

I would say that the 'logical unit' in Rust is a module, they tend to be fairly small (at least relative to crate size) and are naturally self contained. It is probably worth getting data (at least) for smaller units - thanks for the idea!

Incremental compilation is the next part of the project - looking forward to what comes out of that :-)

@spernsteiner
Copy link
Contributor Author

@metajack:

Have you clocked the resulting code performance difference?

Compiling rustc and all libraries using 4 compilation units produces a rustc that takes about 25% longer to run.

@brson:

I don't see the words 'codegen-threads' in this patch. Are you sure it exists?

It's a codegen flag, so the flag name codegen-threads is generated by a macro from the variable name codegen_threads.

What happens when you specify --codegen-units but not --codegen-threads?

rustc generates several compilation units, then runs optimization and codegen for them all sequentially.

@alexcrichton:

Could you describe some of the difficulties with sharing the Session and across worker threads?

Like most of rustc's major data structures, Session uses RefCell all over the place. I suppose we could share it using a Mutex, if we changed how the ownership is handled and were careful about the lifetimes of the mutex guards.

@asterite:

When you split your whole program in N modules and fire up N threads to compile those (as you are proposing here), if a thread finishes early its left without a job to do, so a thread becomes idle.

One way I tried to address this problem was by adding some basic load balancing: rustc tries to make each LLVM module roughly the same size, so that each worker thread gets the same amount of work to do. I also made codegen-units and codegen-threads separate flags so that you can have several smaller modules per worker thread. (Though in the testing I've done so far, it doesn't seem to help.)

@vadimcn:

Though I wonder if (transitively) unchanged modules could be culled right after resolution pass, based on source file timestamps and inter-module dependency info, which should be available at that point.

This is basically my next project. Translation is the #2 time sink in rustc (after LLVM passes), so culling modules (or finer-grained items) before translation seems like the way to go.

@spernsteiner
Copy link
Contributor Author

@alexcrichton:
I think I've fixed all the things you mentioned, except that I haven't implemented LTO against separately compiled libraries yet.

Regarding ld -r (since Github has unhelpfully collapsed that line comment), I haven't seen any problems yet on Linux or OSX. On both I have bootstrapped rustc and run the test suite normally with no problems. On Linux I have also run the test suite with codegen-units > 1, also with no problems. I haven't tested it on Windows yet.

@alexcrichton
Copy link
Member

Like most of rustc's major data structures, Session uses RefCell all over the place. I suppose we could share it using a Mutex, if we changed how the ownership is handled and were careful about the lifetimes of the mutex guards.

I would definitely expect an Arc<Mutex<Session>> to be passed around (maybe Option<Session> so it could be unwrapped). I'm not entirely sure if this could be done because Rc<T> isn't Send, and I think that the session has a bunch of Rc pointers, but I'm not sure how hard it would be to get rid of those.

It looked like it would make parts of this much nicer to have access to the raw session rather than duplicating some logic here and there, but it may not be too worth it in the end.

Regarding ld -r (since Github has unhelpfully collapsed that line comment), I haven't seen any problems yet on Linux or OSX

I don't think that any of our tests actually use the object file emitted, they just emit it. I also recall that the linker always succeeded in creating an object, but the object itself was just unusable (for one reason or another). Again though, this could all just be misremembering, or some bug which has since been fixed!

@spernsteiner
Copy link
Contributor Author

I don't think that any of our tests actually use the object file emitted, they just emit it.

The run-pass tests link the object into an executable, run the resulting binary, and check that it works. At least one step in that process should fail if ld -r emits a bad object file.

@alexcrichton
Copy link
Member

Oh dear, I must be over looking a test! I only see two instances of emit=.*obj in the codebase, one is the output-type-permutations run-make test (no linking involved there), and the other is the codegen tests (no linking involved either). What was the test that uses the output of ld -r?

@spernsteiner
Copy link
Contributor Author

OK, let me back up. I think the relevant part of the design was unclear.

On the master branch, rustc produces a single object file crate.o. Then it feeds crate.o into the linker to produce an executable or shared object.

On this branch, rustc produces several object files crate.0.o, crate.1.o, etc. It feeds those into ld -r to produce a combined object file crate.o. Then crate.o is used to produce the final executable/library just like before. (That's why this branch does not need any changes to link_dylib and such.)

So, on this branch, any test that involves compiling and running Rust code will end up using ld -r as part of the linking process.

@alexcrichton
Copy link
Member

Oh wow, I missed that entirely, I thought it was only used for OutputTypeObject! Sorry I missed that!

In that case, I'm definitely willing to trust ld -r.

codegen_units: uint = (1, parse_uint,
"divide crate into N units for optimization and codegen"),
codegen_threads: uint = (1, parse_uint,
"number of worker threads to use when running codegen"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given there was no benefit to having different values here, lets just have one option.

@nrc
Copy link
Member

nrc commented Aug 23, 2014

OK, looks good! r=me with all the changes (most of which are nits, TBH) and with Alex's review. @alexcrichton r? (specifically the stuff in back and concerning linking, about which I have no idea).

@spernsteiner
Copy link
Contributor Author

@vadimcn: Wow, nice detective work! I would never have expected rm to behave like that.

Here's a second reason why the Windows bot should never have had a problem to begin with: ld is supposed to ignore --force-exe-suffix if -r is specified - and it's apparently worked that way since binutils version 2.7, released in 1996. I don't know what kind of mingw build we've got installed on those Windows bots, but it's definitely not an up-to-date MSYS2, and on top of that it seems to have picked up some strange patches at some point.

Anyway, I've added a workaround that should handle these inconsistencies between Windows toolchains. Now (on Windows) rustc always adds .exe to the output file name when running ld -r, and then after linking renames the output file to the actual desired name. This should give the correct behavior no matter how ld handles --force-exe-suffix -r.

r? @alexcrichton

@alexcrichton
Copy link
Member

Well then, that's a new segfault I've never seen before!

@l0kod l0kod mentioned this pull request Aug 31, 2014
Break up `CrateContext` into `SharedCrateContext` and `LocalCrateContext`.  The
local piece corresponds to a single compilation unit, and contains all
LLVM-related components.  (LLVM data structures are tied to a specific
`LLVMContext`, and we will need separate `LLVMContext`s to safely run
multithreaded optimization.)  The shared piece contains data structures that
need to be shared across all compilation units, such as the `ty::ctxt` and some
tables related to crate metadata.
Refactor the code in `llvm::back` that invokes LLVM optimization and codegen
passes so that it can be called from worker threads.  (Previously, it used
`&Session` extensively, and `Session` is not `Share`.)  The new code can handle
multiple compilation units, by compiling each unit to `crate.0.o`, `crate.1.o`,
etc., and linking together all the `crate.N.o` files into a single `crate.o`
using `ld -r`.  The later linking steps can then be run unchanged.

The new code preserves the behavior of `--emit`/`-o` when building a single
compilation unit.  With multiple compilation units, the `--emit=asm/ir/bc`
options produce multiple files, so combinations like `--emit=ir -o foo.ll` will
not actually produce `foo.ll` (they instead produce several `foo.N.ll` files).

The new code supports `-Z lto` only when using a single compilation unit.
Compiling with multiple compilation units and `-Z lto` will produce an error.
(I can't think of any good reason to do such a thing.)  Linking with `-Z lto`
against a library that was built as multiple compilation units will also fail,
because the rlib does not contain a `crate.bytecode.deflate` file.  This could
be supported in the future by linking together the `crate.N.bc` files produced
when compiling the library into a single `crate.bc`, or by making the LTO code
support multiple `crate.N.bytecode.deflate` files.
When inlining an item from another crate, use the original symbol from that
crate's metadata instead of generating a new symbol using the `ast::NodeId` of
the inlined copy.  This requires exporting symbols in the crate metadata in a
few additional cases.  Having predictable symbols for inlined items will be
useful later to avoid generating duplicate object code for inlined items.
Rotate between compilation units while translating.  The "worker threads"
commit added support for multiple compilation units, but only translated into
one, leaving the rest empty.  With this commit, `trans` rotates between various
compilation units while translating, using a simple stragtegy: upon entering a
module, switch to translating into whichever compilation unit currently
contains the fewest LLVM instructions.

Most of the actual changes here involve getting symbol linkage right, so that
items translated into different compilation units will link together properly
at the end.
…t glue

Use a shared lookup table of previously-translated monomorphizations/glue
functions to avoid translating those functions in every compilation unit where
they're used.  Instead, the function will be translated in whichever
compilation unit uses it first, and the remaining compilation units will link
against that original definition.
Add a post-processing pass to `trans` that converts symbols from external to
internal when possible.  Translation with multiple compilation units initially
makes most symbols external, since it is not clear when translating a
definition whether that symbol will need to be accessed from another
compilation unit.  This final pass internalizes symbols that are not reachable
from other crates and not referenced from other compilation units, so that LLVM
can perform more aggressive optimizations on those symbols.
Adjust the handling of `#[inline]` items so that they get translated into every
compilation unit that uses them.  This is necessary to preserve the semantics
of `#[inline(always)]`.

Crate-local `#[inline]` functions and statics are blindly translated into every
compilation unit.  Cross-crate inlined items and monomorphizations of
`#[inline]` functions are translated the first time a reference is seen in each
compilation unit.  When using multiple compilation units, inlined items are
given `available_externally` linkage whenever possible to avoid duplicating
object code.
@spernsteiner
Copy link
Contributor Author

Older versions of OSX's ld64 linker parse object files using variable-size stack-allocated buffers for some temporary data structures. The bus error seen on 6a60448 occurs because the object file contains too much stuff (mainly, too many unwinding table entries), and those stack allocated buffers overflow the 8MB stack limit of the parser thread. This 8MB stack size is hard-coded inside ld64, so we can't work around the bug by bumping up stack size with ulimit -s.

On master, the librustc build works fine because rustc.o requires about 5MB of stack to parse. This branch triggers a stack overflow because it uses ld -r to generate rustc.o (even with -C codegen-units=1), and ld -r adds a __compact_unwind section to the generated object file. Parsing librustc's __compact_unwind section uses an additional 4MB of stack, which puts the parser thread over its 8MB limit. There is an undocumented flag -no_compact_unwind which is supposed to suppress the generation of the __compact_unwind section, but this flag is ignored when passed in combination with -r.

Newer versions of ld64 fix the stack overflow bug, by having the object file parser use malloc when the required buffer size is large. Unfortunately, according to wikipedia, the fixed ld64 versions (224.1+) are available only with XCode 5+, for OSX 10.8+, while Rust is supposed to support building on OSX 10.7. I'm not sure if there is any way to install newer ld64 on older versions of OSX.

The latest commit on this branch avoids running ld -r when building with only a single compilation unit (which is probably a good idea regardless of the ld64 bug). This will let librustc build without errors (giant object file, but no ld -r doubling its stack use), and the separate compilation tests should also pass (ld -r, but tiny object files). It doesn't fix the underlying problem, though - if anyone using XCode 4 tries to build a large crate with parallel codegen enabled, they will get a nasty segfault from the linker. (Though note that rustc master can already trigger the same error without ld -r, for crates with about twice as many functions as librustc.)

@alexcrichton
Copy link
Member

@epdtry, oh my, that is quite the investigation! That's quite unfortunate that we'll segfault on older versions of OSX. It looks like there's not a whole lot we can do right now though. I'm sad that this may mean that we have to turn off parallel codegen for rustc itself by default (at least for osx), but we can cross that bridge later!

@alexcrichton
Copy link
Member

Also, major major props for that investigation, that must have been quite a beast to track down!

bors added a commit that referenced this pull request Sep 6, 2014
This branch adds support for running LLVM optimization and codegen on different parts of a crate in parallel.  Instead of translating the crate into a single LLVM compilation unit, `rustc` now distributes items in the crate among several compilation units, and spawns worker threads to optimize and codegen each compilation unit independently.  This improves compile times on multicore machines, at the cost of worse performance in the compiled code.  The intent is to speed up build times during development without sacrificing too much optimization.

On the machine I tested this on, `librustc` build time with `-O` went from 265 seconds (master branch, single-threaded) to 115s (this branch, with 4 threads), a speedup of 2.3x.  For comparison, the build time without `-O` was 90s (single-threaded).  Bootstrapping `rustc` using 4 threads gets a 1.6x speedup over the default settings (870s vs. 1380s), and building `librustc` with the resulting stage2 compiler takes 1.3x as long as the master branch (44s vs.  55s, single threaded, ignoring time spent in LLVM codegen).

The user-visible changes from this branch are two new codegen flags:

 * `-C codegen-units=N`: Distribute items across `N` compilation units.
 * `-C codegen-threads=N`: Spawn `N` worker threads for running optimization and codegen.  (It is possible to set `codegen-threads` larger than `codegen-units`, but this is not very useful.)

Internal changes to the compiler are described in detail on the individual commit messages.

Note: The first commit on this branch is copied from #16359, which this branch depends on.

r? @nick29581
@bors bors closed this Sep 6, 2014
@bors bors merged commit 6d2d47b into rust-lang:master Sep 6, 2014
@l0kod
Copy link
Contributor

l0kod commented Sep 6, 2014

Awesome work! Looks great for an efficient #2369.

Though I wonder if (transitively) unchanged modules could be culled right after resolution pass, based on source file timestamps and inter-module dependency info, which should be available at that point.

I think the Ninja build system use hashes from the compilation commands and source files (all dependencies) instead of relying on timestamps. @bors should like it ;)
Shake can do it as well for source files: http://neilmitchell.blogspot.fr/2014/06/shake-file-hashesdigests.html

@japaric
Copy link
Member

japaric commented Sep 6, 2014

Q: Is this flag ignored if --test is passed to the compiler?

I just tried rustc --test -L target/deps -C codegen-units=8 src/lib.rs on my library that has 300+ tests and the compile time is still 20 seconds, and CPU usage remained at 100% (one thread).

Did I do something wrong? (Also -C codegen-threads=8 returns error: unknown codegen option)

@spernsteiner
Copy link
Contributor Author

@japaric,

The flag is not ignored, it's just that your library is small enough that it doesn't get much benefit from this patch (especially when optimization is turned off).

With -C codegen-units=1 (the default):

time: 1.879 s   translation
  time: 0.142 s llvm function passes
  time: 0.067 s llvm module passes
  time: 3.547 s codegen passes
  time: 0.000 s codegen passes
time: 4.264 s   LLVM passes
  time: 0.408 s running linker
time: 0.409 s   linking

real    0m16.718s

And with -C codegen-units=4:

time: 2.927 s   translation
time: 0.054 s   llvm function passes
time: 0.055 s   llvm function passes
time: 0.056 s   llvm function passes
time: 0.055 s   llvm function passes
time: 0.022 s   llvm module passes
time: 0.025 s   llvm module passes
time: 0.025 s   llvm module passes
time: 0.026 s   llvm module passes
time: 1.422 s   codegen passes
time: 1.443 s   codegen passes
time: 1.448 s   codegen passes
time: 0.000 s   codegen passes
time: 1.474 s   codegen passes
time: 1.875 s   LLVM passes
  time: 0.489 s running linker
time: 0.492 s   linking

real    0m15.472s

Since rustc spends only 4 seconds in LLVM passes to begin with, there is not much room for improvement. Setting codegen-units=4 reduces the time by about 2.5s, but also slows down translation and linking, so the overall benefit is tiny.

Also -C codegen-threads=8 returns error: unknown codegen option

I removed that flag because in my testing I found no benefit from setting codegen-threads != codegen-units.

@japaric
Copy link
Member

japaric commented Sep 6, 2014

@epdtry Thanks for the detailed info!

It seems that the bottleneck is the type checking phase in my particular case, and now I'm wondering if spending 50%+ of the time in that phase is normal (but that's off-topic).

bors added a commit to rust-lang-ci/rust that referenced this pull request Jan 21, 2024
fix: Make `value_ty` query fallible
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.