Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow compilation #68926

Open
dragostis opened this issue Feb 7, 2020 · 4 comments
Open

Slow compilation #68926

dragostis opened this issue Feb 7, 2020 · 4 comments
Labels
E-needs-mcve Call for participation: This issue has a repro, but needs a Minimal Complete and Verifiable Example I-compiletime Issue: Problems and improvements with respect to compile times. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@dragostis
Copy link

dragostis commented Feb 7, 2020

Using nested tuples in this example leads to painfully long compile times:

use rayon::prelude::*;

fn main() {
    let vec0 = vec![];
    let vec1 = vec![];
    let vec2 = vec![];
    let vec3 = vec![];
    let vec4 = vec![];
    let vec5 = vec![];
    let vec6 = vec![];
    let vec7 = vec![];
    
    (vec0,
    (vec1,
    (vec2,
    (vec3,
    (vec4,
    (vec5,
    (vec6,
    (vec7)))))))).par_extend((1u32..10).into_par_iter().map(|n| (n, (n, (n, (n, (n, (n, (n, n)))))))));
}
@jonas-schievink jonas-schievink added E-needs-mcve Call for participation: This issue has a repro, but needs a Minimal Complete and Verifiable Example I-compiletime Issue: Problems and improvements with respect to compile times. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Feb 7, 2020
@cuviper
Copy link
Member

cuviper commented Oct 1, 2021

This is kind of impressive! Running nightly-2021-10-01 on my Ryzen 3800X, with cached dependencies, cargo check for this example code takes only 0.36s, while cargo build takes 13.55s.

So I ran cargo +nightly rustc -- -Cno-prepopulate-passes --emit=llvm-ir to see what we're feeding LLVM, and this took 1m 06s to produce a 3.7G file! There are 3.3 million lines in there, the longest having 67K characters thanks to some monster type names. Getting bitcode with --emit=llvm-bc is only 27.02s for a 274M file.

The debug binary is also large at 245M, but most of that is debuginfo. The actual text+data is about 12M.

cuviper added a commit to cuviper/rayon that referenced this issue Oct 7, 2021
For unindexed parallel itererators, we've implemented `ParallelExtend`
for most collections using an intermediate `LinkedList<Vec<T>>` like:

```rust
    par_iter
        .into_par_iter()
        .fold(Vec::new, vec_push)
        .map(as_list)
        .reduce(LinkedList::new, list_append)
```

However, this introduces `Fold`, `Map`, and `Reduce` types that are all
dependent on the input iterator type. When it comes to very complicated
cases like nested tuple unzips, this can add up quickly. For example, in
rust-lang/rust#68926 an 8-way unzip leads to 3.7GB of LLVM IR, with
lines up to 67K characters in long generic types.

Now we add a new `ListVecConsumer` that is not generic at all itself,
and implements `Consumer<T>` etc. generic only on the item type. So each
collection now gets the same `LinkedList<Vec<T>>` as before with:

```rust
    par_iter.into_par_iter().drive_unindexed(ListVecConsumer);
```

Each implementation now also separates the code that doesn't need to be
iterator-specific to a separate function, for their `reserve` and final
`extend` from the list data.

That 8-way unzip is now _only_ 1.5GB with lines up to 17K characters.
Compile time drops from 12.8s to 7.7s debug, 32.1s to 26.9s release.
cuviper added a commit to cuviper/rayon that referenced this issue Oct 7, 2021
For unindexed parallel itererators, we've implemented `ParallelExtend`
for most collections using an intermediate `LinkedList<Vec<T>>` like:

```rust
    par_iter
        .into_par_iter()
        .fold(Vec::new, vec_push)
        .map(as_list)
        .reduce(LinkedList::new, list_append)
```

However, this introduces `Fold`, `Map`, and `Reduce` types that are all
dependent on the input iterator type. When it comes to very complicated
cases like nested tuple unzips, this can add up quickly. For example, in
rust-lang/rust#68926 an 8-way unzip leads to 3.7GB of LLVM IR, with
lines up to 67K characters in long generic types.

Now we add a new `ListVecConsumer` that is not generic at all itself,
and implements `Consumer<T>` etc. generic only on the item type. So each
collection now gets the same `LinkedList<Vec<T>>` as before with:

```rust
    par_iter.into_par_iter().drive_unindexed(ListVecConsumer);
```

Each implementation now also separates the code that doesn't need to be
iterator-specific to a separate function, for their `reserve` and final
`extend` from the list data.

That 8-way unzip is now _only_ 1.5GB with lines up to 17K characters.
Compile time drops from 12.8s to 7.7s debug, 32.1s to 26.9s release.
@cuviper
Copy link
Member

cuviper commented Oct 7, 2021

I've reduced it somewhat in rayon-rs/rayon#887:

That 8-way unzip is now only 1.5GB with lines up to 17K characters.
Compile time drops from 12.8s to 7.7s debug, 32.1s to 26.9s release.

I'm not sure if there's anything the compiler could do though -- there's just a lot of code to expand here.

In ParallelExtend<T> for Vec<T> in particular, there's a condition on opt_len for two different implementation paths, writing directly for indexed inputs or using intermediate collections otherwise. The indexed case is important for both memory use and performance, but we can't do that for every type of iterator. When nesting unzip-like behavior, that branch means we'll get basically exponential code generation for everything that depends on the iterator type.

Specialization might avoid that, but in the meantime maybe we could add a par_extend_indexed method for statically-known cases like your example. I'm not sure what that will look like for tuple extends yet.

@dragostis
Copy link
Author

dragostis commented Oct 7, 2021

Thank you for looking into this and writing a patch to improve things.

The par_extend_indexed solution you mentioned, how would that work?

@cuviper
Copy link
Member

cuviper commented Oct 7, 2021

If you look at the current implementation that matches opt_len, the suggested par_extend_indexed would just be the Some(len) side of that. That None side is probably getting optimized away in LLVM right now, if opt_len gets inlined and it can see when an iterator is always Some, but that's still a heap of code that gets generated first, especially with nested unzips.

I took a look though, and I don't see how to do indexed unzip in rayon's design. The Producer has to create a normal iterator for serial processing after all the parallel splits, but I don't see any way to do that for separate parts of unzipped tuples. It only works for unzip_into_vecs because we know that code never calls with_producer, but we don't know that for generic ParExtend.

bors bot added a commit to rayon-rs/rayon that referenced this issue Apr 1, 2022
887: Reduce the amount of generic code for ParallelExtend r=cuviper a=cuviper

For unindexed parallel itererators, we've implemented `ParallelExtend`
for most collections using an intermediate `LinkedList<Vec<T>>` like:

```rust
    par_iter
        .into_par_iter()
        .fold(Vec::new, vec_push)
        .map(as_list)
        .reduce(LinkedList::new, list_append)
```

However, this introduces `Fold`, `Map`, and `Reduce` types that are all
dependent on the input iterator type. When it comes to very complicated
cases like nested tuple unzips, this can add up quickly. For example, in
rust-lang/rust#68926 an 8-way unzip leads to 3.7GB of LLVM IR, with
lines up to 67K characters in long generic types.

Now we add a new `ListVecConsumer` that is not generic at all itself,
and implements `Consumer<T>` etc. generic only on the item type. So each
collection now gets the same `LinkedList<Vec<T>>` as before with:

```rust
    par_iter.into_par_iter().drive_unindexed(ListVecConsumer);
```

Each implementation now also separates the code that doesn't need to be
iterator-specific to a separate function, for their `reserve` and final
`extend` from the list data.

That 8-way unzip is now _only_ 1.5GB with lines up to 17K characters.
Compile time drops from 12.8s to 7.7s debug, 32.1s to 26.9s release.


Co-authored-by: Josh Stone <cuviper@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
E-needs-mcve Call for participation: This issue has a repro, but needs a Minimal Complete and Verifiable Example I-compiletime Issue: Problems and improvements with respect to compile times. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

3 participants