32 codegen units may not always be better at -O0 #44941

alexcrichton · 2017-09-30T08:08:33Z

Continuing discussion from #44853 (comment)

alexcrichton · 2017-09-30T08:12:29Z

@mersinvald unfortunately the perf files you mentioned I can't read locally, and the perf diff output isn't as illuminating as I hoped it would be.

I wonder, could you try investigating if one crate in particular takes much longer? Maybe take some of the crates that take longest to compile (sorting the rlibs by size is typically a rough measurement for this) and see how long they take?

I'm hoping that this is one crate hitting an accidentally pathological case, but it may also just be a case where everything slows down with 32 codegen units (even though we don't intend that).

One other possiblility may be the linker? (like ld) That should show up though as the final crate taking much longer to finish in 32 cgus than 2 in theory.

crumblingstatue · 2017-09-30T09:34:41Z

I tested compiling regex with different number of codegen units on my 2 core Pentium E2200.
Here are the results:

1:  33.4
2:  28.76
4:  28.90
8:  29.57
16: 29.97
32: 31.0

It gets a decent win with the number of codegen units equal the number of cores, but gets increasingly slower with more than that.

Here are the results for rustfmt. These are a bit more wild. It's actually faster on 4 units than 2, but gets increasingly slower after that, up to the point where 32 units is actually slower than 1 unit.

retep998 · 2017-09-30T17:50:39Z

Testing with winapi 0.3 on a Ryzen 7 1700X.

codegen-units	Time
1	15.17
2	14.63
4	15.18
8	15.44
16	14.68
32	14.89
64	15.22
128	15.22
256	15.92
512	17.34
1024	20.24
2048	27.12
4096	40.74
8192	69.23

oyvindln · 2017-10-01T12:31:20Z

There is a pretty noticable change in compile times on perf.rust-lang.org with the commit that changed the default number as well.

alexcrichton · 2017-10-04T21:03:03Z

I'm going to tag this as a regression until we can figure out what the spike is

alexcrichton · 2017-10-04T23:22:50Z

Ok looking more into the data from perf.rust-lang.org. We're measuring instructions executed and the "cpu-clock" measurement. Instructions certainly don't have much bearing on wall time (but serve as great indicators of when to look for a regression) and apparently cpu-clock is not the same as wall time. I believe the cpu-clock stat is the total sum of all threads on all cpus.

I'm trying to rationalize the numbers we're seeing and it's not quite making sense. The regressions shown on perf.rust-lang.org imply to me that if I pin the process to one CPU we'll see that much increase in compile time, but I see virtually no increase in compile time locally. In fact, if I pin cargo/rustc to one CPU it typically gets faster with more cgus presumably because the objects are a little quicker to create overall.

Basically I'm having a lot of difficulty reproducing any slowdowns here. Assistance investigating why things are slowing down as opposed to just "things are slowing down" would be much appreciated.

alexcrichton · 2017-10-05T20:03:37Z

I believe these are all the reported regressions from this change

alexcrichton · 2017-10-05T20:31:39Z

Some more investigation

I found locally I was able to reproduce some slowdowns with the syntex_syntax crate:

cgus	1	2	4	8	16	32
wall time	32.9	24.9	22.8	21.4	22.0	21.8
cpu-clock	33064	35218	40584	44581	48542	51788
1 core wall time	34.3	35.2	36.4	38.1	40.3	41.7

Here "wall time" is the wall time it took to compile the syntex_syntax crate on an 8 core cpu. The "cpu-clock" is the statistic reported by perf state, and "1 core wall time" is the amount of time the compilation takes when pinned to one cpu (taskset -c 0).

There's a clear tend here that parallelization is helping but also taking up more CPU time. Namely increasing the number of codegen units is not purely dividing the work, it's also increasing the total amount of work being done. This becomes particularly noticeable with more codegen units limited to one core on this big crate.

So on one hand there's certainly some overhead from more codegen units. We're shuffling more files around (not copying, just working with), spawning more threads, more messages, more synchronization, etc. I don't think that these factors would cause such a drastic increase in compile times, however. Instead what I think may be happening here (that I have just now remembered) is how we treat #[inline].

@michaelwoerister correct me if I'm wrong, but I believe that we'll inline any #[inline] function into all codegen units instead of just one. This means that if you're using String you're probably going to get a copy of each of its functions in all of your codegen units. This to me seems like the root cause of the timing regressions here.

As a result my conclusion is that increasing the number of codegen units increases the amount of work the compiler does, and hence parallelism isn't always enough to recover this loss in the increase of work done. In the worst case with N codegen units you can probably do N times more work, and with less than N cores you'll notice the increase in compile times.

@michaelwoerister do you agree with the conclusions here? If so, do you think we could treat #[inline] differently in debug mode? Is it worth it to solve this? Can you think of another possible solution?

I'm personally hesitant to set the number of codegen units based on the amount of cores on the machine. I think that may return optimal performance (perhaps 2x the number of cores cgus) because the extra work needed for each codegen unit I think should come out in the wash because of parallelization. This, however, would make builds across machines nondeterministic if they've got different numbers of cpus, forcing deterministic compiles to explicitly specify the number of codegen units.

My personal preference for a solution right now is:

Don't inline #[inline] into all codegen units in debug mode, only inline into one codegen unit.
Reduce the number of codegen units from 32 to 16 here if we continue to see problems. I think most machines compiling Rust today have fewer than 16 cores so we should still get some benefit of using as much as possible, while also this should help reduce any extra costs from extra work needed for each codegen unit.

michaelwoerister · 2017-10-06T08:56:00Z

@alexcrichton, Yes I think your assumptions and conclusions are correct. I also think that #[inline] functions are the main source of work duplication here. There's also debuginfo, which cannot be shared between codegen units, but that is usually rather cheap afaik.

Don't inline #[inline] into all codegen units in debug mode, only inline into one codegen unit.

We probably mean the same thing here: I would suggest treating them the same as regular polymorphic functions:

Instantiate only "on-demand", and
have exactly one instance and link to that.

That might require some special handling in the symbol hash to avoid conflicts between crates, just as we do for generic functions at the moment:

rust/src/librustc_trans/back/symbol_names.rs

Lines 194 to 201 in ed1cffd

    
           // If this is an instance of a generic function, we also hash in 
        
           // the ID of the instantiating crate. This avoids symbol conflicts 
        
           // in case the same instances is emitted in two crates of the same 
        
           // project. 
        
           if substs.types().next().is_some() { 
        
               hasher.hash(tcx.crate_name.as_str()); 
        
               hasher.hash(tcx.sess.local_crate_disambiguator().as_str()); 
        
           }

If performance turns out to be unbearable, we maybe could have a heuristic to do some inlining:

if the function body is small, and
if the function does not call other inline functions.

Reduce the number of codegen units from 32 to 16 here if we continue to see problems.

Seems like a sensible approach.

Other solutions?

I wonder if ThinLTO would allow us to completely get rid of duplicating bodies of inline functions across codegen units during trans. That would be awesome for incremental compilation and it would also help in the non-incremental case.

Rationale explained in the included comment as well as rust-lang#44941

alexcrichton · 2017-10-06T22:03:06Z

Some excellent news!

I implemented the strategy of "only inline functions into one codegen unit" and the updating timings I get are:

cgus	1	2	4	8	16	32
wall time	30.7	25.2	21.0	19.0	18.4	18.7
cpu-clock	30864	31387	31876	32525	34341	35370
1 core wall time	31.2	30.8	30.8	30.6	30.7	30.6

In other words, I think this indeed makes a huge dent in the issue of "more cgus causes rustc to do more work" problem. I'm going to send a PR shortly implementing this.

@michaelwoerister I also think that you're correct in that ThinLTO should eliminate the need for inlining functions into all codegen units unconditionally. Just gotta land ThinLTO and test it out first :)

alexcrichton · 2017-10-06T23:06:13Z

Ok I've created a PR at #45075

@mersinvald and @crumblingstatue, if y'all could test out that PR locally that'd be great! My hope is that it should make 32 codegen units continue to be better than 1, and perhaps give us flexibility to increase the number of cgus in the future.

This commit tweaks the behavior of inlining functions into multiple codegen units when rustc is compiling in debug mode. Today rustc will unconditionally treat `#[inline]` functions by translating them into all codegen units that they're needed within, marking the linkage as `internal`. This commit changes the behavior so that in debug mode (compiling at `-O0`) rustc will instead only translate `#[inline]` functions into *one* codegen unit, forcing all other codegen units to reference this one copy. The goal here is to improve debug compile times by reducing the amount of translation that happens on behalf of multiple codegen units. It was discovered in rust-lang#44941 that increasing the number of codegen units had the adverse side effect of increasing the overal work done by the compiler, and the suspicion here was that the compiler was inlining, translating, and codegen'ing more functions with more codegen units (for example `String` would be basically inlined into all codegen units if used). The strategy in this commit should reduce the cost of `#[inline]` functions to being equivalent to one codegen unit, which is only translating and codegen'ing inline functions once. Collected [data] shows that this does indeed improve the situation from [before] as the overall cpu-clock time increases at a much slower rate and when pinned to one core rustc does not consume significantly more wall clock time than with one codegen unit. One caveat of this commit is that the symbol names for inlined functions that are only translated once needed some slight tweaking. These inline functions could be translated into multiple crates and we need to make sure the symbols don't collideA so the crate name/disambiguator is mixed in to the symbol name hash in these situations. [data]: rust-lang#44941 (comment) [before]: rust-lang#44941 (comment)

…erister rustc: Reduce default CGUs to 16 Rationale explained in the included comment as well as #44941

rustc: Don't inline in CGUs at -O0 This commit tweaks the behavior of inlining functions into multiple codegen units when rustc is compiling in debug mode. Today rustc will unconditionally treat `#[inline]` functions by translating them into all codegen units that they're needed within, marking the linkage as `internal`. This commit changes the behavior so that in debug mode (compiling at `-O0`) rustc will instead only translate `#[inline]` functions into *one* codegen unit, forcing all other codegen units to reference this one copy. The goal here is to improve debug compile times by reducing the amount of translation that happens on behalf of multiple codegen units. It was discovered in #44941 that increasing the number of codegen units had the adverse side effect of increasing the overal work done by the compiler, and the suspicion here was that the compiler was inlining, translating, and codegen'ing more functions with more codegen units (for example `String` would be basically inlined into all codegen units if used). The strategy in this commit should reduce the cost of `#[inline]` functions to being equivalent to one codegen unit, which is only translating and codegen'ing inline functions once. Collected [data] shows that this does indeed improve the situation from [before] as the overall cpu-clock time increases at a much slower rate and when pinned to one core rustc does not consume significantly more wall clock time than with one codegen unit. One caveat of this commit is that the symbol names for inlined functions that are only translated once needed some slight tweaking. These inline functions could be translated into multiple crates and we need to make sure the symbols don't collideA so the crate name/disambiguator is mixed in to the symbol name hash in these situations. [data]: #44941 (comment) [before]: #44941 (comment)

alexcrichton · 2017-10-10T22:27:38Z

@mersinvald can you try re-testing with tonight's nightly which should have some improvements?

mersinvald · 2017-10-10T22:58:25Z

@alexcrichton sure, will do it tomorrow.

mersinvald · 2017-10-12T10:49:41Z

@alexcrichton I've ran the tests on the rustc 1.22.0-nightly (a47c9f870 2017-10-11)

I've also measured the hot build time, i.e only 200LOC crate + link-time.

Cold Build
4:  63.13 63.60 62.84 (avg 63.19)
32: 69.70 70.32 69.76 (avg 69.92)

Hot Build
4:  5.39 5.28 5.28 (avg 5.31)
32: 6.70 6.10 6.10 (avg 6.30)

alexcrichton · 2017-10-18T16:08:51Z

@mersinvald ok I've done some further investigation into your benchmark. The setup was:

script that was run
DevHeap/circles-server@3f32a49 with this patch, I notably found that enabling LTO was messing with the timings in release mode.
rustc 1.22.0-nightly (4e9527cf6 2017-10-16)
8-core (4x cores + hyperthread) + SSD + 32GB RAM
Ubuntu Linux

Results:

Build from scratch

Here I compiled the positions-api crate from scratch with a clean target directory with various settings

cgus	1	2	4	8	16	32	64
debug #0	43.368	37.003	34.790	34.254	34.376	34.287	34.849
debug #1	43.911	37.401	34.974	33.668	34.612	34.199	34.904
release thinlto=no #0	91.289	70.930	63.766	59.373	57.378	56.717	56.278
release thinlto=no #1	91.516	71.976	63.714	58.143	58.023	57.228	57.023
release thinlto=yes #0	90.799	96.447	91.650	80.653	81.806	82.241	85.829
release thinlto=yes #1	90.615	97.803	89.126	83.018	81.862	82.962	84.413

There's a suspicious jump at 2 CGUs where the compilation takes longer (with ThinLTO) but it quickly flattens out, the time improving with more and more codegen units thrown at it. Additionally in debug mode we see wins across the board as soon as codegen units are enabled.

Incremental build

Here I compiled the positions-api crate with a preexisting target directory. For these timings only the positions-api crate was rebuilt, effectively measuring the time it takes to compile just that one crate.

cgus	1	2	4	8	16	32	64
debug #0	5.147	3.945	3.326	3.129	3.257	3.265	3.604
debug #1	5.310	3.927	3.399	3.148	3.308	3.228	3.415
release thinlto=no #0	14.114	11.626	7.808	6.741	5.145	4.751	4.438
release thinlto=no #1	14.032	11.648	7.865	6.635	4.871	4.657	4.421
release thinlto=yes #0	14.121	15.107	11.088	8.836	6.876	6.368	6.366
release thinlto=yes #1	14.054	14.960	10.448	7.283	7.119	6.345	6.280

Here like before debug builds leveled off pretty quickly, benefitting a good amount from multiple CGUs. Release builds didn't fare so well after 8 CGUs, however, but still show improvements across the board from the original build time with just one CGU. Again though there's a jump at 2 CGUs in release mode which is a little suspicious

@mersinvald can you try running with that script and similar settings to guarantee that the only difference in the data collection here is hardware?

lnicola · 2017-10-18T16:43:44Z

Are the incremental crate-only builds 10-15x slower than full non-incremental builds?

alexcrichton · 2017-10-18T16:45:57Z

oh oops, I mixed up the tables

mersinvald · 2017-10-19T20:54:50Z

@alexcrichton
I've run tests with the exact same revisions of the compiler and circles-server.

rustc 1.22.0-nightly (4e9527c 2017-10-16)
4-core (2x cores + hyperthread) Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz
SSD + 8GB RAM
Arch Linux

cgus	1	2	4	8	16	32	64
debug #0	7.197	5.804	5.561	5.512	5.766	5.832	6.607
debug #1	7.170	5.834	5.533	5.490	5.782	5.913	6.277
release thinlto=no #0	18.490	15.631	13.683	10.496	9.092	9.368	9.457
release thinlto=no #1	18.496	15.671	13.846	9.143	9.124	9.391	9.496
release thinlto=yes #0	18.502	20.199	18.267	12.506	13.133	13.334	13.704
release thinlto=yes #1	18.647	20.059	15.457	12.428	12.731	13.368	13.660

cgus	1	2	4	8	16	32	64
debug #0	77.231	72.293	69.728	70.492	71.270	72.801	73.855
debug #1	76.678	70.609	70.246	69.892	72.097	72.374	73.683
release thinlto=no #0	162.873	139.603	130.557	128.175	130.154	131.654	132.660
release thinlto=no #1	160.766	140.791	129.430	126.258	130.905	132.073	134.501
release thinlto=yes #0	159.064	184.526	179.635	180.172	184.242	189.370	192.278
release thinlto=yes #1	157.843	181.420	180.135	178.202	186.637	191.419	192.772

alexcrichton · 2017-10-19T22:56:24Z

Awesome, thanks @mersinvald!

So reading that I'd conclude that in debug mode 16cgus is a unversal win over 1 cgu (although slightly worse than 8/4 cgus in some cases). In that sense, I think we can close this issue?

Otherwise it looks like the proposal for turning on multiple cgus by default in release mode means that 16 cgus is universally better than 1 cgu in terms of compile time if you only compile the last crate. If you compile the entire crate graph it looks like multiple cgus regreses the build time?

michaelwoerister · 2017-10-20T09:06:00Z

So reading that I'd conclude that in debug mode 16cgus is a unversal win over 1 cgu (although slightly worse than 8/4 cgus in some cases). In that sense, I think we can close this issue?

👍

alexcrichton · 2017-10-20T14:56:45Z

Ok!

alexcrichton changed the title ~~32 codegen units may not always be better~~ 32 codegen units may not always be better at -O0 Sep 30, 2017

alexcrichton added regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Oct 4, 2017

alexcrichton added this to the 1.22 milestone Oct 4, 2017

alexcrichton added a commit to alexcrichton/rust that referenced this issue Oct 6, 2017

rustc: Reduce default CGUs to 16

1988447

Rationale explained in the included comment as well as rust-lang#44941

alexcrichton mentioned this issue Oct 6, 2017

rustc: Reduce default CGUs to 16 #45064

Merged

hanna-kruppe mentioned this issue Oct 6, 2017

Tracking issue for MIR-only RLIBs #38913

Closed

alexcrichton mentioned this issue Oct 6, 2017

rustc: Don't inline in CGUs at -O0 #45075

Merged

bors added a commit that referenced this issue Oct 9, 2017

Auto merge of #45064 - alexcrichton:reduce-codegen-units, r=michaelwo…

d7acd29

…erister rustc: Reduce default CGUs to 16 Rationale explained in the included comment as well as #44941

alexcrichton mentioned this issue Oct 16, 2017

Tracking issue for enabling multiple CGUs in release mode by default #45320

Closed

11 tasks

alexcrichton closed this as completed Oct 20, 2017

alexcrichton mentioned this issue Jan 29, 2019

Tracking Issue for making incremental compilation the default for Release Builds #57968

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

32 codegen units may not always be better at -O0 #44941

32 codegen units may not always be better at -O0 #44941

alexcrichton commented Sep 30, 2017

alexcrichton commented Sep 30, 2017

crumblingstatue commented Sep 30, 2017 •

edited

Loading

retep998 commented Sep 30, 2017 •

edited

Loading

oyvindln commented Oct 1, 2017

alexcrichton commented Oct 4, 2017

alexcrichton commented Oct 4, 2017

alexcrichton commented Oct 5, 2017

alexcrichton commented Oct 5, 2017 •

edited

Loading

michaelwoerister commented Oct 6, 2017

alexcrichton commented Oct 6, 2017 •

edited

Loading

alexcrichton commented Oct 6, 2017

alexcrichton commented Oct 10, 2017

mersinvald commented Oct 10, 2017

mersinvald commented Oct 12, 2017 •

edited

Loading

alexcrichton commented Oct 18, 2017 •

edited

Loading

lnicola commented Oct 18, 2017

alexcrichton commented Oct 18, 2017

mersinvald commented Oct 19, 2017 •

edited

Loading

alexcrichton commented Oct 19, 2017

michaelwoerister commented Oct 20, 2017

alexcrichton commented Oct 20, 2017

32 codegen units may not always be better at -O0 #44941

32 codegen units may not always be better at -O0 #44941

Comments

alexcrichton commented Sep 30, 2017

alexcrichton commented Sep 30, 2017

crumblingstatue commented Sep 30, 2017 • edited Loading

retep998 commented Sep 30, 2017 • edited Loading

oyvindln commented Oct 1, 2017

alexcrichton commented Oct 4, 2017

alexcrichton commented Oct 4, 2017

alexcrichton commented Oct 5, 2017

alexcrichton commented Oct 5, 2017 • edited Loading

michaelwoerister commented Oct 6, 2017

Other solutions?

alexcrichton commented Oct 6, 2017 • edited Loading

alexcrichton commented Oct 6, 2017

alexcrichton commented Oct 10, 2017

mersinvald commented Oct 10, 2017

mersinvald commented Oct 12, 2017 • edited Loading

alexcrichton commented Oct 18, 2017 • edited Loading

lnicola commented Oct 18, 2017

alexcrichton commented Oct 18, 2017

mersinvald commented Oct 19, 2017 • edited Loading

alexcrichton commented Oct 19, 2017

michaelwoerister commented Oct 20, 2017

alexcrichton commented Oct 20, 2017

crumblingstatue commented Sep 30, 2017 •

edited

Loading

retep998 commented Sep 30, 2017 •

edited

Loading

alexcrichton commented Oct 5, 2017 •

edited

Loading

alexcrichton commented Oct 6, 2017 •

edited

Loading

mersinvald commented Oct 12, 2017 •

edited

Loading

alexcrichton commented Oct 18, 2017 •

edited

Loading

mersinvald commented Oct 19, 2017 •

edited

Loading