Limiting the parallelism automatically #12912

weihanglo · 2023-11-03T05:17:21Z

tl;dr: Introduce a simple mechanism for limiting parallelism automatically in Cargo, to avoid consuming all system resources during the compilation.

Problem

Cargo by default uses all cores (std::thread::available_parallelism) and spawns off rustc or build scripts onto each core. This is not an issue when compiling on a decent machine. When working on low-end machines or large-scale codebase, developers often encounter issue like extremely high CPU loads or out-of-memory errors.

To solve these problem, developers can set --jobs from command line, or build.jobs in .cargo/config.toml to control the maximum parallelism Cargo can use. This is not ideal because

build.jobs is bound to number of core in use. It is not immediately correlated with memory usage. Executing parallel builds might lead to out of memory before any CPU throttling happens, especially when several linker invocations happens.
build.jobs assigns core fairly to each unit of work (i.e. a crate build"). However, some crate builds might consume more computing resources. If those crate builds are bottlenecks of the entire compilation, we might want to throw more resources to complete them to unblock other crate builds.
Developers need to set build.jobs explicitly to control the parallelism. However, it's often a long trial-and-error to figure out a proper value. The value also varies in different environments. Not really user friendly.
Developers might want a full control of every dependency build. build.jobs is too coarse-grained.

An "ideal" approach (but not now)

There are a couple of existing proposals trying to improve the situation. Some of them want to define a weight on a certain job, or tag jobs into a group. With weights and tags, job scheduler understands whether it should allocate a job. This is pretty much the ideal solution, as it maximizes the control of parallelism for developers, and the system could be extend to the job scheduling optimization.

However, such a system requires developers to fully understand the entire compilation of their projects. For now, the data is either missing or hard to get from Cargo. To incrementally build the system, there are prerequisites:

Cargo can monitor the resource usage of the system and each unit of work during a build.
Cargo can persist the resource usage of each unit of work for each build.

Start small

We should start small, and focus on monitoring the resource usage, additionally limiting the parallelism when the usage exceeds a threshold.

Some options we can do:

Assign the maximum amount of resources that Cargo can use. This is how build.jobs works now. We might need an equivalent for memory usage. Something like
```
[build.limit]
local-memory = "3GiB" # or "95%" or "100% - 200MiB"
```
Set a system threshold. Cargo won't allocate any new job and wait for the entire system usage going down, even when the usage of Cargo itself is still under the assigned maximum.
```
[build.limit]
system = "3GiB" # or "95%" or "100% - 200MiB"
cpu = "100%"
```

To minimize the impact of bad data points, these metrics will be sampled and averaged out within a period of time.

Instead of "usage", we can also leverage the concept "load average" from Unix-like, which might make more sense to manage computing resource loads.

I entirely don't know which one we want, or both, or none.

Library to use

procfs — used by wider Rust web-dev community, via promethues and other metrics crates.
sysinfo — another popular crate for inspecting system info.

Both of then introduce an excessive amount of code Cargo doesn't need at this moment.

Alternatively, we can use syscall lib directly to get these info.

Prior arts

Bazel
- --jobs
- --local_{ram,cpu}_resources to assign resources a build can use
Buck
- --jobs
- link_weight to config how many job a link job consumes.
Cabel
- --jobs
- Got the same linker invocation issue Add option to limit number of concurrent calls to linker when building with -j haskell/cabal#1529.
CMake
- -j to set max number of concurrent processes
GitHub Actions
- has concurrency.group
Go
- go build -p limits the number of programs, such as build commands or test binaries, that can be run in parallel.
- GOMAXPROCS to limit the number of OS threads that can execute user-level Go code simultaneously.
Gradle
- --max-workers — like --jobs
- Has a SharedResourceLeaseRegistry for registering a resource with its maximum lease numbers. Like a semaphore.
- Parallelism can be configured per-project on demand.
make
- -j to set max number of concurrent jobs
- --max-load to limit the start of a new job if load average goes above the value
- Read Parallel for more
Ninja
- has a pool concept that user can assign some stage of build to a pool with more restricted parallelism rules.
Nix
- max-jobs
sbt
- tasks are tagged, and each tag get a default weight of resource restriction.

Related issues

There are more issues regaring scheduling but I dont want to link to them here. These are issue of people trying to tell Cargo not to be that greedy.

And sorry I opened a new issue instead. Feel free to close and move to any existing one.

The text was updated successfully, but these errors were encountered:

epage · 2023-11-03T13:40:28Z

#9250 is an interesting alternative for the CPU load aspect. I've not done enough with nice to know how cross platform the concept is or if there are restrictions that might get in the way.

In general, with all of the security and docker-like technologies out these days, I wonder if there is more we can delegate to the operating system for this which would likely reduce complexity and overhead within cargo.

epage · 2023-11-03T13:43:12Z

local-memory = "3GiB" # or "95%" or "100% - 200MiB"

On the surface, percentages seem nice because you don't have to worry about the exact configuration of the local system. However, 90% of 64GB is a lot more usable of a system than 90% of 8GB. I feel like what will be most useful is "all except". We covered this with --jobs by allowing negative numbers. We could do similar here where -3GB means "all but 3GB"

epage · 2023-11-03T13:45:50Z

To minimize the impact of bad data points, these metrics will be sampled and averaged out within a period of time.

The meaning of such an average dramatically changes whenever a job finishes and a new one starts, especially if there are jobs or categories of jobs (e.g. linking) with dramatically different characteristics.

epage · 2023-11-03T13:47:44Z

With the parallel frontend rearing its head again, we should probably consider how that affects this.

the8472 · 2023-11-04T15:37:53Z

We should start small, and focus on monitoring the resource usage, additionally limiting the parallelism when the usage exceeds a threshold.

On linux specifically it might be better to monitor pressure rather than utilization. The downside is that that's even less portable.

clarfonthey · 2023-11-05T18:43:53Z

FWIW, I personally was looking into a way of cross-platform measuring load average when looking for a solution to this, since it's not just Linux that would benefit from that metric. It is doable, but annoying, and I personally don't know or care enough about Windows to formulate a proper solution.

It would be nice if whatever solution is made could be applied generically to projects, since I have a strong feeling more than just Cargo could benefit from it.

epage · 2023-11-06T15:21:02Z

btw doctests are another place where we can hit this (see #12916). We should keep in mind a way to carry this information forward to those.

clarfonthey · 2023-11-06T17:09:42Z

Part of the reason why I mention a general solution is because, although the implementation would be complicated, the actual API probably wouldn't. Right now, things just check the number of concurrent threads the CPU can run and only let that many threads run at a time. The biggest change is that, before spawning a thread, you have to verify that both the number of threads is low enough and the system load is low enough. The API could be something as simple as fn system_load() -> f64 in theory and you'd just verify that it's below the desired number.

Of course, the tricky thing is making sure that you compute that number and fetch it quickly, that it's consistent across platforms, and that you set the right load limit, which could very well depend on the system.

the8472 · 2023-11-06T17:43:12Z

I don't think an in-process API to query load would be sufficient because builds can involve multiple processes (cargo spawning rustc instances). To throttle entire process trees, potentially after processes have been spawned (e.g. to keep rustc from spawning even more threads after it has already started) we need IPC.
So that basically means a more powerful jobserver protocol because currently the jobserver just consists of pouring some tokens (bytes) into a pipe and then having processes contend for them in a decentralized fashion.

If we had each process connect to a jobserver through a separate connection and signalled intent (spawn thread vs. spawn child process) then the jobserver could dole out tokens more intelligently, withhold them under system load and even ask child processes to go quiescent for a while.
Supporting that and the make jobserver protocol at the same time (by shuffling tokens back and forth between them) would be quite complicated.

weihanglo · 2023-11-06T17:57:54Z

A zulip topic is opened for a similar discussion as well: https://rust-lang.zulipchat.com/#narrow/stream/246057-t-cargo/topic/parallel.20.28link.29.20jobs.20and.20OOM.20cargo.2312912.

luser · 2023-11-07T14:31:40Z

Instead of "usage", we can also leverage the concept "load average" from Unix-like, which might make more sense to manage computing resource loads.

FYI GNU make has long supported a basic version of this with its --max-load option: https://www.gnu.org/software/make/manual/make.html#Parallel . However, in practice it never seemed to work well, since by the time you measure that the load average is above your target, it's too late and the system is overloaded.

the8472 · 2023-11-07T14:53:17Z

I wonder how thread priorities interact with concurrent pipe reads. I.e. do the highest-priority readers get woken up first? If so cargo could run at a higher priority than rustc or the linkers and reclaim tokens under pressure.
And rustc could release and reacquire a token every second to see if cargo wants it back.

This would allow more granularity than just not starting a new rustc/link job when load is high.

weihanglo · 2023-11-09T21:14:20Z

do the highest-priority readers get woken up first? If so cargo could run at a higher priority than rustc or the linkers and reclaim tokens under pressure.

Thanks for the suggestion. That's definitely and thing we can look into. I remember people did something similar in the past rust-lang/rust#67398, but rustc turns to go the other way for parallelism story recently.

I tend to avoid a general mechanism touching every components under Rust project, which is harder to move forward. And in any case, cargo might still need a way to monitor some indicators to control the overall parallelism, from a build system perspective. They seems somehow independent and can be done separately.

weihanglo · 2023-11-09T21:20:55Z

However, in practice it never seemed to work well, since by the time you measure that the load average is above your target, it's too late and the system is overloaded.

Okay, load average sounds like a lagging indicator here. @luser do you know any other indicators might help? I've done a survey for some major build tool listed in the issue descritption, but can't see any other interesting indicator they expose in the CLI interface. Or if you know any of their implementation has a fancy resource monitoring logic, please let us know. Personally I am looking for automatic way without user interference first, then we can start thinking the interface and scheduling issues.

the8472 · 2023-11-09T22:11:31Z

Memory pressure could work to some extent because it includes page reclaims. If build processes gobble up enough ram that it forces the kernel to synchronously clean up caches or even paging that's an indication that memory reserves are running low some time before OOM conditions are reached. The question is whether it's early enough.

If a single linker job eats half the available memory but only counts as one job token then even 1 token too many can be problematic if a linker job is already running and that remaining token would be used to start another one.
Ultimately job tokens are intend to regulate core utilization, not memory utilization so there's an impedance mismatch.

Core utilization is kinda easy to regulate and predict. 1 compute-bound process/thread = 1 token.

Memory utilization is more difficult because we lack estimators for that.

Some ideas:

tell users to add copious amounts of swap. Swap allows processes to live (even if under painful conditions), which gives us room to react to swap pressure if we can't proact, e.g. by temporarily quiescing processes until the remaining ones can make progress without thrashing. once pressure subsides the other processes can resume work
automatically recover from OOMs by reserving more job tokens and then restarting the child process, under the assumption that it'll succeed if it has fewer siblings
- variation: instead of waiting for system-wide OOM conditions we can set per-process memory limits (or cgroups or whatever) and let them die earlier and then retry them later with a higher limit
try to dial things back (by reclaiming tokens) at the earliest sign of pressure/high load/whatever. this won't help in worst-case scenarios where a single job needs all the available memory
develop memory predictors and then have a 2nd jobserver that doles out tokens where 1 token = 256MB or something like that

weihanglo · 2023-11-16T05:33:21Z

Just post what I found from https://gcc.gnu.org/wiki/DebugFission:

As a rule of thumb, the link job total memory requirements can be estimated at about 200% of the total size of its input files.

Might help to predict/analyze possible memory consumption for linking.

soloturn · 2023-12-30T11:55:36Z

here a ticket towards rust-lang, when memory consumption was unpredictable when cargo used all threads to link binaries, and caused OOM when compiling cosmic: rust-lang/rust#114037 . not sure what happened. but it seems that the situation improved in august 2023.

keeping the system responsive is a different matter, and we solve it by using "nice cargo ..." or "nice paru ...", in arch linux. giving lower prio to processes is operating specific and, at least in my opinion, needs to stay OUTSIDE of cargo. because what is "nice" in linux, is "start /low" in windows: https://stackoverflow.com/questions/4208/windows-equivalent-of-nice .

sunshowers · 2024-04-03T02:16:31Z

Wanted to add that nextest also has several knobs for this:

heavy tests, some tests take up more concurrency slots than others: https://nexte.st/book/threads-required
test groups and mutual exclusion: https://nexte.st/book/test-groups
more generally, profiles to control runner behavior: https://nexte.st/book/configuration.html#profiles

The context is that in nextest we also wanted to try and avoid test contention in high-core situations (e.g. oxidecomputer/omicron#5380) -- we were looking to see if there was prior art for using an expression language to define concurrency limits, or other static/dynamic behavior. @epage kindly linked me to this thread -- thanks!

sunshowers · 2024-04-03T02:18:50Z

Memory utilization is more difficult because we lack estimators for that.

A practical approach may be to record and store historical metrics, and use them to predict future performance.

Summary: Cargo doesn't have good mechanism to limit parallerism to avoid OOMs: rust-lang/cargo#12912 Reduce number of parallel jobs to save some memory. Default is number of logical processors, so this will run twice less jobs. If it keeps failing, we may need to reduce it further as it's still using over 25 GB . Reviewed By: blackm00n Differential Revision: D59054227 fbshipit-source-id: 708265891d69c0264392d0b24e29ac1f6b9818dd

what I am seeing with the move to Linode VMs is essentially this: rust-lang/cargo#9157 rust-lang/cargo#12912 B/c these new VMs have more CPU cores, 16 (new) vs 4(old), compilation is faster however this causes cargo to be overzealous and spawn too many linker processes which consume all of the available memory (on a 64 GB VM) and causes an OOM error forcing the kernel to kill the linker process and causing cargo to fail! Another alternative, which works, is using `--jobs 8`, however that is less optimal b/c it leaves unused CPU capacity and also affects the number of parallel threads when executing the test suite! WARNING: using `--release` is not an option because it breaks tests. The polkadot-sdk code uses the macro defensive! which is designed to panic when running in debug mode and multiple test scenarios rely on this behavior via #[should_panic]! WARNING: we still need the 64 GB memory!

epage mentioned this issue Nov 6, 2023

cargo test renders device unresponsive due to MSVC linker RAM usage #12916

Open

weihanglo mentioned this issue Nov 10, 2023

Provide scheduling hints? #7437

Open

soloturn mentioned this issue Dec 30, 2023

Introduce 'nice' value under cargo.toml -> [build] #9250

Open

dionysius mentioned this issue Jun 6, 2024

process didn't exit successfully dionysius/vaultwarden-deb#13

Closed

weihanglo mentioned this issue Jul 4, 2024

Memory leak/spike during doc-tests #14190

Closed

weihanglo added S-needs-design Status: Needs someone to work further on the design for the feature or fix. NOT YET accepted. and removed S-triage Status: This issue is waiting on initial triage. labels Jul 11, 2024

weihanglo mentioned this issue Oct 7, 2024

Job server strongly conflicts with multithreaded linkers #14651

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limiting the parallelism automatically #12912

Limiting the parallelism automatically #12912

weihanglo commented Nov 3, 2023 •

edited

Loading

epage commented Nov 3, 2023 •

edited

Loading

epage commented Nov 3, 2023

epage commented Nov 3, 2023

epage commented Nov 3, 2023

the8472 commented Nov 4, 2023

clarfonthey commented Nov 5, 2023 •

edited

Loading

epage commented Nov 6, 2023

clarfonthey commented Nov 6, 2023

the8472 commented Nov 6, 2023 •

edited

Loading

weihanglo commented Nov 6, 2023 •

edited

Loading

luser commented Nov 7, 2023

the8472 commented Nov 7, 2023 •

edited

Loading

weihanglo commented Nov 9, 2023

weihanglo commented Nov 9, 2023

the8472 commented Nov 9, 2023

weihanglo commented Nov 16, 2023

soloturn commented Dec 30, 2023 •

edited

Loading

sunshowers commented Apr 3, 2024

sunshowers commented Apr 3, 2024 •

edited

Loading

Limiting the parallelism automatically #12912

Limiting the parallelism automatically #12912

Comments

weihanglo commented Nov 3, 2023 • edited Loading

Problem

An "ideal" approach (but not now)

Start small

Library to use

Prior arts

Related issues

epage commented Nov 3, 2023 • edited Loading

epage commented Nov 3, 2023

epage commented Nov 3, 2023

epage commented Nov 3, 2023

the8472 commented Nov 4, 2023

clarfonthey commented Nov 5, 2023 • edited Loading

epage commented Nov 6, 2023

clarfonthey commented Nov 6, 2023

the8472 commented Nov 6, 2023 • edited Loading

weihanglo commented Nov 6, 2023 • edited Loading

luser commented Nov 7, 2023

the8472 commented Nov 7, 2023 • edited Loading

weihanglo commented Nov 9, 2023

weihanglo commented Nov 9, 2023

the8472 commented Nov 9, 2023

weihanglo commented Nov 16, 2023

soloturn commented Dec 30, 2023 • edited Loading

sunshowers commented Apr 3, 2024

sunshowers commented Apr 3, 2024 • edited Loading

weihanglo commented Nov 3, 2023 •

edited

Loading

epage commented Nov 3, 2023 •

edited

Loading

clarfonthey commented Nov 5, 2023 •

edited

Loading

the8472 commented Nov 6, 2023 •

edited

Loading

weihanglo commented Nov 6, 2023 •

edited

Loading

the8472 commented Nov 7, 2023 •

edited

Loading

soloturn commented Dec 30, 2023 •

edited

Loading

sunshowers commented Apr 3, 2024 •

edited

Loading