Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache LLVM between try runs #112011

Open
jyn514 opened this issue May 26, 2023 · 13 comments
Open

Cache LLVM between try runs #112011

jyn514 opened this issue May 26, 2023 · 13 comments
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue.

Comments

@jyn514
Copy link
Member

jyn514 commented May 26, 2023

We currently spend an inordinate amount of time rebuilding LLVM on every try run. See for example https://github.com/rust-lang-ci/rust/actions/runs/5073748000/jobs/9113150676#step:25:38905, where building LLVM with PGO and BOLT takes up 111 out of 131 total minutes of the run.

We should cache LLVM between runs so perf runs don't take so long. It should be fine to cache LLVM until the next time the src/llvm-project is updated, the same way download-ci-llvm works today. To avoid having three different LLVM artifacts stored in S3 per commit, we'd replace the current artifacts with the BOLT artifacts, leaving the -alt builds as the unoptimized + debug-assertions build.

@rust-lang/wg-llvm @Kobzol @rust-lang/infra

@jyn514 jyn514 added T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue. A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. labels May 26, 2023
@Kobzol
Copy link
Contributor

Kobzol commented May 26, 2023

I have been planning to do this for some time. I actually started implementing it this week, but I realized that I'd need some Python dependencies for this in the PGO script, which was a bit annoying, because if is a rather "locked down" script using Python 3.6 with everything in a single file (it's being executed from different Linux/Windows environment, and while definitely possible, it wouldn't be completely trivial to add dependencies to it).

After discussing this with @Mark-Simulacrum, we have decided that it would be better to move the PGO script to Rust (not to bootstrap, but to src/tools!), in order to have better dependency management and also to make it a bit more robust hopefully. So now I'm in the process of porting the script to Rust, and then I'll start with LLVM caching.

I think that we don't actually need to cache LLVM itself. We can just cache the PGO and BOLT profiles (these are two files, ~50 MiB each), as these should be much easier to cache, and mainly much easier to "inject" into the builds. With these files available, we don't need to run LLVM PGO and LLVM BOLT instrumentation steps, we can just run rustc PGO and then go straight for the final dist build with --llvm-profile-use and llvm-bolt-profile-use.

My plan is to basically hash src/llvm-project, src/boostrap and src/ci (these should be approximately all the locations that can change how we build LLVM). Then (to avoid a separate asynchronous CI job that would periodically build LLVM), I suggest to do this: during a dist/try build, if we find that profiles for the LLVM/bootstrap/CI hash are already on S3, we will download them and use them. If they are not, we will generate them, upload them to S3, and then continue with the dist build.

@jyn514
Copy link
Member Author

jyn514 commented May 26, 2023

Then (to avoid a separate asynchronous CI job that would periodically build LLVM), I suggest to do this: during a dist/try build, if we find that profiles for the LLVM/bootstrap/CI hash are already on S3, we will download them and use them. If they are not, we will generate them, upload them to S3, and then continue with the dist build.

This seems broadly good and is how our other caching works, but I do want to mention that this has caused issues in the past where if the cache gets busted CI will timeout (e.g. #49278). I don't have solutions to that in mind, but we should be wary that we don't get close to GitHub's time limit.

@Kobzol
Copy link
Contributor

Kobzol commented May 26, 2023

I don't think that this would be a problem for try builds. Currently, we do 3 builds of LLVM in each try build, and it takes ~1h 30m (after my PR which disables some unneeded targets). With caching, it should be strictly faster, there is nothing that would be performed "extra" vs the current status in the plan that I have described (apart from hashing a few directories and uploading two files to S3, but that should be very fast).

@nikic
Copy link
Contributor

nikic commented May 26, 2023

This makes me somewhat uneasy in terms of how it will affect benchmark stability at the points where LLVM does get updated. The profiles depend on the IR processed by LLVM, which in turn depends on the IR produced by rustc, which depends on lots of things like libstd code, MIR opts, etc. I would rather not have every commit that touches LLVM have spurious perf diffs because that's the only point where we gather a new profile.

FWIW, the current profile bootstrap sequence still has a lot of room for improvement. We currently rebuild rustc three times, but should only build it twice. Two of those builds are also done with PGO/BOLT instrumented LLVM, which makes the rustc build much slower. This should shave off at least 30 minutes from try builds.

The ideal sequence would look something like:

  1. Build rustc with PGO instrumentation, LLVM without.
  2. Gather profiles.
  3. Build rustc with PGO use (reuse old LLVM).
  4. Build LLVM with PGO instrumentation (reuse old rustc).
  5. Gather profiles.
  6. Build LLVM with PGO use (reuse old rustc).
  7. Instrument LLVM with BOLT.
  8. Gather profiles.
  9. Optimize LLVM with BOLT.

@Kobzol
Copy link
Contributor

Kobzol commented May 26, 2023

Btw @jyn514 where did you get that 111/131 figure? From the log that you have posted, it seems to me that LLVM takes ~35 minutes:

2023-05-25T00:02:21.3754640Z ------------------------------------------------
2023-05-25T00:02:21.3755046Z Stage 1 (LLVM PGO):            1951.20s (25.15%)
2023-05-25T00:02:21.3755417Z   Build rustc and LLVM:        1486.38s (19.16%)
2023-05-25T00:02:21.3755766Z     LLVM:                       454.43s ( 5.86%)
2023-05-25T00:02:21.3756114Z     Rustc:                     1014.26s (13.07%)
2023-05-25T00:02:21.3756456Z   Gather profiles:              464.81s ( 5.99%)
2023-05-25T00:02:21.3756819Z Stage 2 (rustc PGO):           1398.60s (18.03%)
2023-05-25T00:02:21.3757185Z   Build rustc and LLVM:         758.77s ( 9.78%)
2023-05-25T00:02:21.3757533Z     LLVM:                       403.33s ( 5.20%)
2023-05-25T00:02:21.3757929Z     Rustc:                      338.59s ( 4.36%)
2023-05-25T00:02:21.3758271Z   Gather profiles:              639.84s ( 8.25%)
2023-05-25T00:02:21.3758637Z Stage 3 (LLVM BOLT):           3027.82s (39.02%)
2023-05-25T00:02:21.3758985Z   Build rustc and LLVM:        2291.47s (29.53%)
2023-05-25T00:02:21.3759323Z     LLVM:                      1235.00s (15.92%)
2023-05-25T00:02:21.3759668Z     Rustc:                      927.71s (11.96%)
2023-05-25T00:02:21.3760003Z   Gather profiles:              736.35s ( 9.49%)
2023-05-25T00:02:21.3760556Z Stage 4 (final build):         1381.54s (17.81%)
2023-05-25T00:02:21.3761027Z   LLVM:                           0.00s ( 0.00%)
2023-05-25T00:02:21.3761353Z   Rustc:                          1.01s ( 0.01%)
2023-05-25T00:02:21.3761678Z                                                 
2023-05-25T00:02:21.3762025Z Total duration:                        2h 9m 19s
2023-05-25T00:02:21.3762479Z ------------------------------------------------

@jyn514
Copy link
Member Author

jyn514 commented May 26, 2023

To be clear, the problem I'm worried about is:

  1. We add caching. Most builds get a lot faster (say it now takes 40 minutes instead of 130).
  2. Emboldened by the additional time, we add more tests to try and dist builds. Now it takes 100 minutes with caching.
  3. When the cache gets busted, it takes 230 minutes and we hit a GitHub time limit.

I think try builds are limited enough in scope that this isn't a giant risk in practice, just something to keep in mind.

where did you get that 111/131 figure? From the log that you have posted, it seems to me that LLVM takes ~35 minutes:

ah, that's my bad - I missed the PGO rustc build in the middle and was counting it as part of LLVM.

I think it takes more than 35 minutes though - the time taken to build rustc for instrumenting PGO/BOLT also counts.

@Kobzol
Copy link
Contributor

Kobzol commented May 26, 2023

This makes me somewhat uneasy in terms of how it will affect benchmark stability at the points where LLVM does get updated.

Yeah, this also worries me. My plan was to basically "try it and see what happens". I'm not sure how important are the changes to IR patterns being generated, but it's possible that we would see large perf swings, which would be bad.

Regarding the schedule that you have described, I agree that this would be the ideal scenario (we could even pre-cache the initial LLVM build without any PGO instrumentation/use in S3, that shouldn't cause any problems with profile noise). However, I'm not sure how to currently use bootstrap to achieve this sequence - e.g. is it (easily) possible to take an existing rustc and "relink" it to an updated version of LLVM? This direction is definitely worth exploring IMO, as it could possibly be more stable and even easier than the caching approach.

@jyn514
Copy link
Member Author

jyn514 commented May 26, 2023

is it (easily) possible to take an existing rustc and "relink" it to an updated version of LLVM?

i'm not sure what you mean by relink here - LLVM is a shared object file, so rustc will use whichever .so is in rpath/LD_LIBRARY_PATH at runtime, rustc itself doesn't need be rebuilt.

In general you can set a custom llvm-config to have LLVM managed by the PGO script instead of bootstrap.

@nikic
Copy link
Contributor

nikic commented May 26, 2023

Regarding the schedule that you have described, I agree that this would be the ideal scenario (we could even pre-cache the initial LLVM build without any PGO instrumentation/use in S3, that shouldn't cause any problems with profile noise).

Yes, caching the initial build should be fine -- though maybe it would already be mostly free due to sccache? (Which doesn't help us for PGO builds, but should help for plain ones.)

However, I'm not sure how to currently use bootstrap to achieve this sequence - e.g. is it (easily) possible to take an existing rustc and "relink" it to an updated version of LLVM? This direction is definitely worth exploring IMO, as it could possibly be more stable and even easier than the caching approach.

In theory, "relinking" rustc should just be a matter of copying the new libLLVM.so in the right place. I'd expect the main tricky part here would be allowing bootstrap to build just libLLVM.so without doing anything else.

@Kobzol
Copy link
Contributor

Kobzol commented May 26, 2023

Yes, caching the initial build should be fine -- though maybe it would already be mostly free due to sccache? (Which doesn't help us for PGO builds, but should help for plain ones.)

With sccache it's now at about 400s to build the initial LLVM. Not free, but also not terribly long.

In theory, "relinking" rustc should just be a matter of copying the new libLLVM.so in the right place. I'd expect the main tricky part here would be allowing bootstrap to build just libLLVM.so without doing anything else.

Yeah, that was what I was alluding to.

Steps 1. - 3. are easy. At step 4, we need to (first delete and then) rebuild LLVM, but normally in bootstrap this also causes further rebuilds of rustc. So we basically need bootstrap to allow us to rebuild/modify libLLVM.so without touching anything else, and do that "in-place" so that it always generates the shared library at the same location, in order for a previously built rustc to use it. It doesn't sound that hard, I just haven't personally examined this possibility yet (if it's a oneliner, I'll be mad :D ).

@cuviper
Copy link
Member

cuviper commented May 26, 2023

It should be fine to cache LLVM until the next time the src/llvm-project is updated, the same way download-ci-llvm works today.

Can we include src/ci/docker/host-x86_64/dist-x86_64-linux/ in that? It's not just what LLVM sources we build, but how we build it. Maybe use the docker image ID somehow in the caching decision?

@Kobzol
Copy link
Contributor

Kobzol commented May 27, 2023

Yeah, as I wrote in my first comment here, I'd hash (at least) src/llvm-project, src/ci and src/bootstrap.

@Kobzol
Copy link
Contributor

Kobzol commented May 28, 2023

So, it looks like it is possible to just run x.py build llvm lld to rebuild only (PGO instrumented) LLVM. However, from the CI logs, it looks like only a fragment of profiles is generated now (like, 1% of what we had before). I have to investigate more. Currently, I suspect that the difference is that now we no longer gather profiles when we compile stage 2 of rustc. If that is indeed the case, and perf will get worse because of the difference in profiles, we might need to rethink this approach (maybe cache only stage1 of rustc?). But it's still speculation at this point, I need to get it fully working first and then run perf.

(For BOLT it's more complicated, but we can deal with that later, I want to get it working for PGO first.)

bors added a commit to rust-lang-ci/rust that referenced this issue Jun 8, 2023
Avoid one `rustc` rebuild in the optimized build pipeline

This PR changes the optimized build pipeline to avoid one `rustc` rebuild, inspired by [this comment](rust-lang#112011 (comment)). This speeds up the pipeline by 5-10 minutes. After this change, we **no longer gather LLVM PGO profiles from compiling stage 2 of `rustc`**.

Now we build `rustc` two times (1x PGO instrumented, 1x PGO optimized) and LLVM three times (1x normal, 1x PGO instrumented, 1x PGO optimized). It should be possible to cache the normal LLVM build, but I'll leave that for another PR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

5 participants
@cuviper @nikic @Kobzol @jyn514 and others