Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scalable jobserver for rustc #7731

Merged
merged 26 commits into from
Jan 22, 2020
Merged

Conversation

Mark-Simulacrum
Copy link
Member

@Mark-Simulacrum Mark-Simulacrum commented Dec 21, 2019

This refactors the job queue code for support of per-rustc process jobservers. Furthermore, it also cleans up the code and refactors the main loop to be more amenable to understanding (splitting into methods and such).

Assignment of tokens to either rustc "threads" or processes is dedicated to the main loop, which proceeds in a strict "least recently requested" fashion among both thread and process token requests. Specifically, we will first allocate tokens to all pending process token requests (i.e., high-level units of work), and then (in per-rustc jobserver mode) follow up by assigning any remaining tokens to rustcs, again in the order that requests came into cargo (first request served first).

It's not quite clear that that model is good (no modeling or so has been done). On the other hand this strategy should mean that long-running crates will get more thread tokens once we bottom out in terms of rustc parallelism than short-running crates, which means that crates like syn which start early on but finish pretty late should hopefully get more parallelism nicely (without any more complex heuristics).

One plausible change that may be worth exploring is making the assignment prefer earlier rustc's, globally, rather than first attempting to spawn new crates and only then increasing parallelism for old crates. syn for example frequently gets compiled in the early storm of dozens of crates so is somewhat unlikely to have parallelism, until fairly late in its compilation.

We also currently conflate under this model the rayon threads and codegen threads. Eventually inside rustc those will probably(?) also be just one thing, and the rustc side of this implementation provides no information as to what the token request is for so we can't do better here yet.

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Dec 21, 2019
@Mark-Simulacrum
Copy link
Member Author

CI here looks to be mostly failing because the current branch expects rustc to have -Zjobserver-token-requests.

@bors
Copy link
Contributor

bors commented Dec 22, 2019

☔ The latest upstream changes (presumably #7737) made this pull request unmergeable. Please resolve the merge conflicts.

@ehuss
Copy link
Contributor

ehuss commented Dec 22, 2019

Can this get a doc-comment in job_server.rs pointing to the documentation in src/librustc_jobserver/lib.rs in rustc? It's pretty hard for me to follow in isolation.

@Mark-Simulacrum
Copy link
Member Author

Hm, do you mean job_queue.rs? job_server.rs doesn't appear to exist in this repository.

I would be happy to provide such a doc comment -- it's worth noting that I'm guessing we'll be landing this behind some sort of -Z flag as well initially (as we test it out and such) just because it's a non-trivial overhaul of the protocol we use with rustc.

@ehuss
Copy link
Contributor

ehuss commented Dec 22, 2019

Oops, yea, typo. Should be job_queue.rs.

@Mark-Simulacrum Mark-Simulacrum changed the title Scalabale jobserver for rustc Scalable jobserver for rustc Dec 23, 2019
@Mark-Simulacrum
Copy link
Member Author

I've added some (somewhat) high level docs to job_queue and rebased.

@alexcrichton
Copy link
Member

Ok I've read over this and it looks pretty good to me, certainly for the purposes of testing this looks like it'll cover all our bases.

@Mark-Simulacrum do you want to update your PR in rust-lang/rust to have a Cargo submodule with this commit, and then make a try build so we can test that out?

I think longer-term I'd want to refactor the jobserver management in Cargo here, it feels a bit spaghetti-like and I'd like to make it a bit more concise (if we can) to make it a bit more robust. Nothing concrete I can think of, just a general feeling of "this feels bolted on" which is not at all your fault, just something for me to stew on :)

Also, for the vecs, I think they can be relatively trivially converted to hash maps? I'd expect something like:

let rustcs_with_a_token: HashMap<u32, Vec<Token>>; 
let rustcs_wanting_a_token: HashMap<u32, (usize, Client)>; // `usize` == how many tokens requested

@Mark-Simulacrum
Copy link
Member Author

Yeah, I think the HashMaps would work. I moved between a few different schemes during development and had started out with Vecs.

I'll go ahead and get the rustc branch to point at this branch (and probably rebase both, while I'm at it) and kickoff a try build.

@Mark-Simulacrum
Copy link
Member Author

Okay I've force pushed this and the rust-lang/rust branch and kicked off a try build over there, hopefully that all works out.

I didn't look at implementing the HashMap stuff as I'm also a bit strapped for time :)

@bors
Copy link
Contributor

bors commented Jan 8, 2020

☔ The latest upstream changes (presumably #7776) made this pull request unmergeable. Please resolve the merge conflicts.

@Mark-Simulacrum
Copy link
Member Author

CI failure is.. scary "E: Failed to fetch https://packages.microsoft.com/ubuntu/16.04/prod/dists/xenial/main/binary-amd64/Packages Writing more data than expected (622228 > 606754)"

But seems unrelated.

r? @alexcrichton

Copy link
Member

@alexcrichton alexcrichton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This overall looks very good to me now, thanks!

On closer reading I think the only "major" thing I'd like to see is a refactoring of the job queue code to have all of its assorted local variables wrapped up in a struct with a name, methods, etc. I think that'll make edits a bit easier to track and also easier to document things.

src/cargo/core/compiler/context/mod.rs Outdated Show resolved Hide resolved
src/cargo/core/compiler/job_queue.rs Outdated Show resolved Hide resolved
src/cargo/core/compiler/job_queue.rs Outdated Show resolved Hide resolved
src/cargo/core/compiler/job_queue.rs Outdated Show resolved Hide resolved
src/cargo/core/compiler/mod.rs Outdated Show resolved Hide resolved
@@ -1081,8 +1101,9 @@ fn on_stderr_line(
package_id: PackageId,
target: &Target,
options: &mut OutputOptions,
rustc_client: Option<&Client>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may perhaps be a bit cleaner to instead of passing rustc_client around and sending it through messages to instead store it elsewhere. The processes spawned by the job queue I think should all have an optional Client associated with them (stored internally in ProcessBuilder), and job_queue could keep track of a map from id to Client in that case.

@@ -1210,6 +1232,36 @@ fn on_stderr_line_inner(
}
}

#[derive(serde::Deserialize)]
struct JobserverNotification {
jobserver_event: Event,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll probably want to bikeshed this a bit, but the json messages coming out of rustc aren't exactly the most structured things in terms of precedent anyway.

I'd ideally prefer to not ignore errors if Cargo is sure that event has to do with jobserver events, whereas right now if there's a slight mismatch Cargo will ignore everything coming from rustc. In any case this is probably a concern best left for stabilization.

@Mark-Simulacrum Mark-Simulacrum force-pushed the chatty-jobserver branch 3 times, most recently from babeba6 to 5cb8b08 Compare January 16, 2020 02:29
@Mark-Simulacrum
Copy link
Member Author

Okay I've updated the PR, I think I addressed most of your comments.

I opted not to implement the rustc_client out-of-band storage -- that looked pretty complicated; it seems like there's a semi-opaque wall around "Work" / "Job" units and it was unclear to me how to thread state through that... maybe I missed something obvious though.

Furthermore, we eventually do need to keep track of the number of requests for a given JobId, so while we could e.g. do a usize for that, it seems simpler to just send around clients.

Of course if you disagree then happy to try and dig in more :)

I've also -- in the process of refactoring the main loop and extracting things out into methods -- tried to add documentation here and there but since I've been fairly embedded into this code for a few hours now I'm not sure where all confusion lies anymore :) I think someone with a semi-fresh set of eyes (you) would be more helpful in basically saying "this could use a comment" -- happy to write them up then.

The latest set also shifts around the algorithm for jobserver token assignment under the per-rustc model, where before we were eagerly giving tokens to the first rustc that requested them, which plausibly lead to pretty poor "multi-rustc" parallelism in practice (since the first one would likely gobble up tokens, then the second one would do the same, and so forth, presuming highly parallel and quick to request tokens rustcs). I'm not sure we were observing that behavior but it seemed possibly suboptimal.

The new strategy is to dedicate assignment to the main loop, which now proceeds in a strict "least recently requested" fashion among both types of requests. Specifically, we will first allocate tokens to all pending process token requests (i.e., high-level units of work), and then (in per-rustc jobserver mode) follow up by assigning any remaining tokens to rustcs, again in the order that requests came into cargo (first request served first).

It's not quite clear that that model is good -- at least for the latter bit, maybe randomly spreading tokens around would be better, or some other heuristic, but on the other hand this means that it is likely that long-running crates will get more tokens than short-running crates, which means that crates like syn which start early on but finish pretty late should hopefully get more parallelism nicely (without any more complex heuristics).


As a fairly orthogonal note I'm also somewhat increasingly unhappy with the jobserver crate's API -- I keep hitting the problem that the "request a token" interface does not provide for cancellation -- and we've also wanted e.g. Acquired::forget in rustc, so I'll probably put some resources there before moving on to fixing up the rustc PR. Generally my plan is to get this landed or in a state ready for landing and then move on to fixing up the rustc PR :)

@Mark-Simulacrum
Copy link
Member Author

Tests here are all going to fail as I missed a minor detail where previously cargo would always immediately spawn up a rustc process ("sharing" the cargo implicit jobserver token with that rustc) but the new implementation would make that pretty painful -- ideally we'd synthesize the token out of nowhere, but the jobserver crate makes that hard and in any case it's not quite what we want.

I think to get the correct behavior we'd basically end up with a lot of code trying to shuffle around the implicit token -- loosely, it'd look something like what the current codegen stuff does in rustc I imagine (less complicated due to everything being threadsafe though).

I'm going to try and think some more on this, maybe we can discuss synchronously tomorrow too. Don't have any great ideas here :/

@Mark-Simulacrum
Copy link
Member Author

Oh, also forgot to mention, I updated the PR description with a summary of the PR, though it loosely duplicates what I said already in a comment.

/// We should only have tokens here "in-between" loops of the
/// `drain_the_queue` function, i.e., when we've acquired a bunch of tokens
/// and are about to give them to rustcs.
unused_tokens: Vec<Acquired>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry, to clarify, can this be a separate struct which only lives for the one stack frame where we're running stuff?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, as to this field specifically, I don't think we're going to want it. There should never be any long-term storage for unused tokens.

.len()
.saturating_sub(self.pending_queue.len());
for _ in 0..tokens_needed {
jobserver_helper.request_token();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This I'm a little worried about since it will request a huge amount more than before. I don't think we want to do these requests on each turn of the loop, we should still strive to match up a request to an acquire 1:1

src/cargo/core/compiler/job_queue.rs Show resolved Hide resolved
@Mark-Simulacrum
Copy link
Member Author

Okay I dropped the commit which refactored into the unused_tokens and so forth and then restored it a bit.

I also moved the state into a new struct which only exists while we're draining the job queue.

@Mark-Simulacrum
Copy link
Member Author

Turns out the crossbeam scope was needed, unlike what the comment indicated...

src/cargo/core/compiler/job_queue.rs Show resolved Hide resolved
src/cargo/core/compiler/job_queue.rs Outdated Show resolved Hide resolved
}
}
} else {
self.tick_progress();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the purpose of this loop is to actually not call this in the case that the events are not empty, so was this perhaps a stray refactoring error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Especially in the case where rustc is using this channel for the jobserver protocol, events may not be empty for a long time (i.e., we're constantly trying to wait for threads, releasing tokens, etc.). I can drop this from this PR though and file it separately since it's a drive by change if anything.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm ok, mind moving this to a separate commit? I think it'd be best to handle this with a sort of heartbeat message if it becomes problematic. Did you see extraneous stuttering you weren't expecting when building though?

src/cargo/core/compiler/job_queue.rs Outdated Show resolved Hide resolved
src/cargo/core/compiler/job_queue.rs Show resolved Hide resolved
src/cargo/core/compiler/job_queue.rs Outdated Show resolved Hide resolved
src/cargo/core/compiler/job_queue.rs Outdated Show resolved Hide resolved
src/cargo/core/compiler/job_queue.rs Outdated Show resolved Hide resolved
src/cargo/core/compiler/job_queue.rs Outdated Show resolved Hide resolved
@Mark-Simulacrum Mark-Simulacrum force-pushed the chatty-jobserver branch 2 times, most recently from f247393 to eee14c0 Compare January 22, 2020 19:26
@Mark-Simulacrum
Copy link
Member Author

I think that this latest push resolves your comments (dropped tick_progress change entirely, will file a follow-up PR once this lands).

I also realize this has a ton of commits now -- do you want me to squash them down?

I'll look at the CI failures later.

This also moves and enhances the message logging the state before blocking.
@Mark-Simulacrum
Copy link
Member Author

CI is now passing.

@alexcrichton
Copy link
Member

@bors: r+

Very nice, thanks so much! Commits are fine :)

@bors
Copy link
Contributor

bors commented Jan 22, 2020

📌 Commit 7e3e1b1 has been approved by alexcrichton

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jan 22, 2020
@bors
Copy link
Contributor

bors commented Jan 22, 2020

⌛ Testing commit 7e3e1b1 with merge 8ee3d92...

bors added a commit that referenced this pull request Jan 22, 2020
Scalable jobserver for rustc

This refactors the job queue code for support of [per-rustc process jobservers](rust-lang/rust#67398). Furthermore, it also cleans up the code and refactors the main loop to be more amenable to understanding (splitting into methods and such).

Assignment of tokens to either rustc "threads" or processes is dedicated to the main loop, which proceeds in a strict "least recently requested" fashion among both thread and process token requests. Specifically, we will first allocate tokens to all pending process token requests (i.e., high-level units of work), and then (in per-rustc jobserver mode) follow up by assigning any remaining tokens to rustcs, again in the order that requests came into cargo (first request served first).

It's not quite clear that that model is good (no modeling or so has been done). On the other hand this strategy should mean that long-running crates will get more thread tokens once we bottom out in terms of rustc parallelism than short-running crates, which means that crates like syn which start early on but finish pretty late should hopefully get more parallelism nicely (without any more complex heuristics).

One plausible change that may be worth exploring is making the assignment prefer earlier rustc's, globally, rather than first attempting to spawn new crates and only then increasing parallelism for old crates. syn for example frequently gets compiled in the early storm of dozens of crates so is somewhat unlikely to have parallelism, until fairly late in its compilation.

We also currently conflate under this model the rayon threads and codegen threads. Eventually inside rustc those will probably(?) also be just one thing, and the rustc side of this implementation provides no information as to what the token request is for so we can't do better here yet.
@bors
Copy link
Contributor

bors commented Jan 22, 2020

☀️ Test successful - checks-azure
Approved by: alexcrichton
Pushing 8ee3d92 to master...

@bors bors merged commit 7e3e1b1 into rust-lang:master Jan 22, 2020
@bors bors mentioned this pull request Jan 22, 2020
@ehuss
Copy link
Contributor

ehuss commented Jan 25, 2020

This PR seems to be causing panics with debug builds of Cargo. @Mark-Simulacrum can you check? The repro I have is:

  1. cargo new foo
  2. cd foo
  3. cargo add bitflags
  4. cargo test

It panics at progress.rs:259 where in this case that function is called with cur=2 and max=1. Here's the full backtrace:

bactrace
   Compiling bitflags v1.2.1
thread 'main' panicked at 'attempt to subtract with overflow', src/cargo/util/progress.rs:259:21
stack backtrace:
   0: backtrace::backtrace::libunwind::trace
             at /Users/runner/.cargo/registry/src/gh.neting.cc-1ecc6299db9ec823/backtrace-0.3.40/src/backtrace/libunwind.rs:88
   1: backtrace::backtrace::trace_unsynchronized
             at /Users/runner/.cargo/registry/src/gh.neting.cc-1ecc6299db9ec823/backtrace-0.3.40/src/backtrace/mod.rs:66
   2: std::sys_common::backtrace::_print_fmt
             at src/libstd/sys_common/backtrace.rs:77
   3: ::fmt
             at src/libstd/sys_common/backtrace.rs:59
   4: core::fmt::write
             at src/libcore/fmt/mod.rs:1057
   5: std::io::Write::write_fmt
             at src/libstd/io/mod.rs:1426
   6: std::sys_common::backtrace::_print
             at src/libstd/sys_common/backtrace.rs:62
   7: std::sys_common::backtrace::print
             at src/libstd/sys_common/backtrace.rs:49
   8: std::panicking::default_hook::{{closure}}
             at src/libstd/panicking.rs:195
   9: std::panicking::default_hook
             at src/libstd/panicking.rs:215
  10: std::panicking::rust_panic_with_hook
             at src/libstd/panicking.rs:463
  11: rust_begin_unwind
             at src/libstd/panicking.rs:371
  12: std::panicking::begin_panic
  13: std::panicking::begin_panic
  14: cargo::util::progress::Format::progress
             at src/cargo/util/progress.rs:259
  15: cargo::util::progress::State::tick
             at src/cargo/util/progress.rs:176
  16: cargo::util::progress::Progress::tick_now
             at src/cargo/util/progress.rs:107
  17: cargo::core::compiler::job_queue::DrainState::tick_progress
             at src/cargo/core/compiler/job_queue.rs:714
  18: cargo::core::compiler::job_queue::DrainState::wait_for_events
             at src/cargo/core/compiler/job_queue.rs:600
  19: cargo::core::compiler::job_queue::DrainState::drain_the_queue
             at src/cargo/core/compiler/job_queue.rs:648
  20: cargo::core::compiler::job_queue::JobQueue::execute::{{closure}}
             at src/cargo/core/compiler/job_queue.rs:384
  21: crossbeam_utils::thread::scope::{{closure}}
             at /Users/eric/.cargo/registry/src/gh.neting.cc-1ecc6299db9ec823/crossbeam-utils-0.7.0/src/thread.rs:161
  22:  as core::ops::function::FnOnce<()>>::call_once
             at /rustc/85976442558bf2d09cec3aa49c9c9ba86fb15c1f/src/libstd/panic.rs:318
  23: std::panicking::try::do_call
             at /rustc/85976442558bf2d09cec3aa49c9c9ba86fb15c1f/src/libstd/panicking.rs:296
  24: __rust_maybe_catch_panic
             at src/libpanic_unwind/lib.rs:79
  25: std::panicking::try
             at /rustc/85976442558bf2d09cec3aa49c9c9ba86fb15c1f/src/libstd/panicking.rs:272
  26: std::panic::catch_unwind
             at /rustc/85976442558bf2d09cec3aa49c9c9ba86fb15c1f/src/libstd/panic.rs:394
  27: crossbeam_utils::thread::scope
             at /Users/eric/.cargo/registry/src/gh.neting.cc-1ecc6299db9ec823/crossbeam-utils-0.7.0/src/thread.rs:161
  28: cargo::core::compiler::job_queue::JobQueue::execute
             at src/cargo/core/compiler/job_queue.rs:384
  29: cargo::util::config::clippy_driver::{{closure}}
  30: cargo::ops::cargo_compile::compile_ws
             at src/cargo/ops/cargo_compile.rs:469
  31: cargo::ops::cargo_compile::compile_with_exec
             at src/cargo/ops/cargo_compile.rs:259
  32: cargo::ops::cargo_compile::compile
             at src/cargo/ops/cargo_compile.rs:248
  33: cargo::ops::cargo_test::compile_tests
             at src/cargo/ops/cargo_test.rs:69
  34: cargo::ops::cargo_test::run_tests
             at src/cargo/ops/cargo_test.rs:21
  35: cargo::commands::test::exec
             at src/bin/cargo/commands/test.rs:162
  36: cargo::cli::execute_subcommand
             at src/bin/cargo/cli.rs:197
  37: cargo::cli::main
             at src/bin/cargo/cli.rs:107
  38: cargo::main
             at src/bin/cargo/main.rs:39
  39: std::rt::lang_start::{{closure}}
             at /rustc/85976442558bf2d09cec3aa49c9c9ba86fb15c1f/src/libstd/rt.rs:67
  40: std::rt::lang_start_internal::{{closure}}
             at src/libstd/rt.rs:52
  41: std::panicking::try::do_call
             at src/libstd/panicking.rs:296
  42: __rust_maybe_catch_panic
             at src/libpanic_unwind/lib.rs:79
  43: std::panicking::try
             at src/libstd/panicking.rs:272
  44: std::panic::catch_unwind
             at src/libstd/panic.rs:394
  45: std::rt::lang_start_internal
             at src/libstd/rt.rs:51
  46: std::rt::lang_start
             at /rustc/85976442558bf2d09cec3aa49c9c9ba86fb15c1f/src/libstd/rt.rs:67
  47: cargo::init_git_transports

@Mark-Simulacrum
Copy link
Member Author

Hm, not impossible, will take a look. I guess we're generating too many completed jobs?

It'll probably be tomorrow at the earliest though, if necessary please do revert.

@Mark-Simulacrum Mark-Simulacrum deleted the chatty-jobserver branch January 25, 2020 19:12
@Mark-Simulacrum
Copy link
Member Author

Filed #7829 to resolve this bug.

bors added a commit that referenced this pull request Jan 26, 2020
Store maximum queue length

Previously, the queue length was constantly decreasing as we built crates, which
meant that we were incorrectly displaying the progress bar. In debug builds,
this even led to panics (due to underflow on subtraction).

Not sure if we can add a test case for this. I have made the panic unconditional on release/debug though by explicitly checking that current is less than the maximum for the progress bar.

Fixes #7731 (comment).
Comment on lines +507 to +514
while let Err(e) = client.acquire_raw() {
anyhow::bail!(
"failed to fully drain {}/{} token from jobserver at startup: {:?}",
i,
tokens,
e,
);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the intent of this loop? Should it maybe be something like this?

client.acquire_raw().chain_err(|| {
    format!(
        "failed to fully drain {}/{} token from jobserver at startup",
        i, tokens
    )
})?;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I think so -- it doesn't really make any difference (bail! returns from the method anyway), but I'll send a PR cleaning it up. I suspect I had intended this as "loop until you error" but that's clearly not what this does :)

bors added a commit that referenced this pull request Jan 31, 2020
[beta] Revert scalable jobserver.

This reverts #7731, #7829, and #7836.

This should prevent #7840 on beta. I feel more comfortable reverting this than merging #7844, since the impact is still unknown.
@ehuss ehuss added this to the 1.42.0 milestone Feb 6, 2022
@weihanglo weihanglo mentioned this pull request Feb 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants