Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limiting the parallelism automatically #12912

Open
weihanglo opened this issue Nov 3, 2023 · 19 comments
Open

Limiting the parallelism automatically #12912

weihanglo opened this issue Nov 3, 2023 · 19 comments
Labels
A-jobserver Area: jobserver, concurrency, parallelism C-feature-request Category: proposal for a feature. Before PR, ping rust-lang/cargo if this is not `Feature accepted` S-needs-design Status: Needs someone to work further on the design for the feature or fix. NOT YET accepted.

Comments

@weihanglo
Copy link
Member

weihanglo commented Nov 3, 2023

tl;dr: Introduce a simple mechanism for limiting parallelism automatically in Cargo, to avoid consuming all system resources during the compilation.

Problem

Cargo by default uses all cores (std::thread::available_parallelism) and spawns off rustc or build scripts onto each core. This is not an issue when compiling on a decent machine. When working on low-end machines or large-scale codebase, developers often encounter issue like extremely high CPU loads or out-of-memory errors.

To solve these problem, developers can set --jobs from command line, or build.jobs in .cargo/config.toml to control the maximum parallelism Cargo can use. This is not ideal because

  • build.jobs is bound to number of core in use. It is not immediately correlated with memory usage. Executing parallel builds might lead to out of memory before any CPU throttling happens, especially when several linker invocations happens.
  • build.jobs assigns core fairly to each unit of work (i.e. a crate build"). However, some crate builds might consume more computing resources. If those crate builds are bottlenecks of the entire compilation, we might want to throw more resources to complete them to unblock other crate builds.
  • Developers need to set build.jobs explicitly to control the parallelism. However, it's often a long trial-and-error to figure out a proper value. The value also varies in different environments. Not really user friendly.
  • Developers might want a full control of every dependency build. build.jobs is too coarse-grained.

An "ideal" approach (but not now)

There are a couple of existing proposals trying to improve the situation. Some of them want to define a weight on a certain job, or tag jobs into a group. With weights and tags, job scheduler understands whether it should allocate a job. This is pretty much the ideal solution, as it maximizes the control of parallelism for developers, and the system could be extend to the job scheduling optimization.

However, such a system requires developers to fully understand the entire compilation of their projects. For now, the data is either missing or hard to get from Cargo. To incrementally build the system, there are prerequisites:

  • Cargo can monitor the resource usage of the system and each unit of work during a build.
  • Cargo can persist the resource usage of each unit of work for each build.

Start small

We should start small, and focus on monitoring the resource usage, additionally limiting the parallelism when the usage exceeds a threshold.

Some options we can do:

  • Assign the maximum amount of resources that Cargo can use. This is how build.jobs works now. We might need an equivalent for memory usage. Something like
    [build.limit]
    local-memory = "3GiB" # or "95%" or "100% - 200MiB"
  • Set a system threshold. Cargo won't allocate any new job and wait for the entire system usage going down, even when the usage of Cargo itself is still under the assigned maximum.
    [build.limit]
    system = "3GiB" # or "95%" or "100% - 200MiB"
    cpu = "100%"

To minimize the impact of bad data points, these metrics will be sampled and averaged out within a period of time.

Instead of "usage", we can also leverage the concept "load average" from Unix-like, which might make more sense to manage computing resource loads.

I entirely don't know which one we want, or both, or none.

Library to use

  • procfs — used by wider Rust web-dev community, via promethues and other metrics crates.
  • sysinfo — another popular crate for inspecting system info.

Both of then introduce an excessive amount of code Cargo doesn't need at this moment.

Alternatively, we can use syscall lib directly to get these info.

Prior arts

  • Bazel

  • Buck

    • --jobs
    • link_weight to config how many job a link job consumes.
  • Cabel

  • CMake

    • -j to set max number of concurrent processes
  • GitHub Actions

    • has concurrency.group
  • Go

    • go build -p limits the number of programs, such as build commands or test binaries, that can be run in parallel.
    • GOMAXPROCS to limit the number of OS threads that can execute user-level Go code simultaneously.
  • Gradle

    • --max-workers — like --jobs
    • Has a SharedResourceLeaseRegistry for registering a resource with its maximum lease numbers. Like a semaphore.
    • Parallelism can be configured per-project on demand.
  • make

    • -j to set max number of concurrent jobs
    • --max-load to limit the start of a new job if load average goes above the value
    • Read Parallel for more
  • Ninja

    • has a pool concept that user can assign some stage of build to a pool with more restricted parallelism rules.
  • Nix

  • sbt

Related issues

There are more issues regaring scheduling but I dont want to link to them here. These are issue of people trying to tell Cargo not to be that greedy.

And sorry I opened a new issue instead. Feel free to close and move to any existing one.

@weihanglo weihanglo added C-feature-request Category: proposal for a feature. Before PR, ping rust-lang/cargo if this is not `Feature accepted` A-jobserver Area: jobserver, concurrency, parallelism S-triage Status: This issue is waiting on initial triage. S-needs-design Status: Needs someone to work further on the design for the feature or fix. NOT YET accepted. and removed S-needs-design Status: Needs someone to work further on the design for the feature or fix. NOT YET accepted. labels Nov 3, 2023
@epage
Copy link
Contributor

epage commented Nov 3, 2023

#9250 is an interesting alternative for the CPU load aspect. I've not done enough with nice to know how cross platform the concept is or if there are restrictions that might get in the way.

In general, with all of the security and docker-like technologies out these days, I wonder if there is more we can delegate to the operating system for this which would likely reduce complexity and overhead within cargo.

@epage
Copy link
Contributor

epage commented Nov 3, 2023

local-memory = "3GiB" # or "95%" or "100% - 200MiB"

On the surface, percentages seem nice because you don't have to worry about the exact configuration of the local system. However, 90% of 64GB is a lot more usable of a system than 90% of 8GB. I feel like what will be most useful is "all except". We covered this with --jobs by allowing negative numbers. We could do similar here where -3GB means "all but 3GB"

@epage
Copy link
Contributor

epage commented Nov 3, 2023

To minimize the impact of bad data points, these metrics will be sampled and averaged out within a period of time.

The meaning of such an average dramatically changes whenever a job finishes and a new one starts, especially if there are jobs or categories of jobs (e.g. linking) with dramatically different characteristics.

@epage
Copy link
Contributor

epage commented Nov 3, 2023

With the parallel frontend rearing its head again, we should probably consider how that affects this.

@the8472
Copy link
Member

the8472 commented Nov 4, 2023

We should start small, and focus on monitoring the resource usage, additionally limiting the parallelism when the usage exceeds a threshold.

On linux specifically it might be better to monitor pressure rather than utilization. The downside is that that's even less portable.

@clarfonthey
Copy link
Contributor

clarfonthey commented Nov 5, 2023

FWIW, I personally was looking into a way of cross-platform measuring load average when looking for a solution to this, since it's not just Linux that would benefit from that metric. It is doable, but annoying, and I personally don't know or care enough about Windows to formulate a proper solution.

It would be nice if whatever solution is made could be applied generically to projects, since I have a strong feeling more than just Cargo could benefit from it.

@epage
Copy link
Contributor

epage commented Nov 6, 2023

btw doctests are another place where we can hit this (see #12916). We should keep in mind a way to carry this information forward to those.

@clarfonthey
Copy link
Contributor

Part of the reason why I mention a general solution is because, although the implementation would be complicated, the actual API probably wouldn't. Right now, things just check the number of concurrent threads the CPU can run and only let that many threads run at a time. The biggest change is that, before spawning a thread, you have to verify that both the number of threads is low enough and the system load is low enough. The API could be something as simple as fn system_load() -> f64 in theory and you'd just verify that it's below the desired number.

Of course, the tricky thing is making sure that you compute that number and fetch it quickly, that it's consistent across platforms, and that you set the right load limit, which could very well depend on the system.

@the8472
Copy link
Member

the8472 commented Nov 6, 2023

I don't think an in-process API to query load would be sufficient because builds can involve multiple processes (cargo spawning rustc instances). To throttle entire process trees, potentially after processes have been spawned (e.g. to keep rustc from spawning even more threads after it has already started) we need IPC.
So that basically means a more powerful jobserver protocol because currently the jobserver just consists of pouring some tokens (bytes) into a pipe and then having processes contend for them in a decentralized fashion.

If we had each process connect to a jobserver through a separate connection and signalled intent (spawn thread vs. spawn child process) then the jobserver could dole out tokens more intelligently, withhold them under system load and even ask child processes to go quiescent for a while.
Supporting that and the make jobserver protocol at the same time (by shuffling tokens back and forth between them) would be quite complicated.

@weihanglo
Copy link
Member Author

weihanglo commented Nov 6, 2023

@luser
Copy link
Contributor

luser commented Nov 7, 2023

Instead of "usage", we can also leverage the concept "load average" from Unix-like, which might make more sense to manage computing resource loads.

FYI GNU make has long supported a basic version of this with its --max-load option: https://www.gnu.org/software/make/manual/make.html#Parallel . However, in practice it never seemed to work well, since by the time you measure that the load average is above your target, it's too late and the system is overloaded.

@the8472
Copy link
Member

the8472 commented Nov 7, 2023

I wonder how thread priorities interact with concurrent pipe reads. I.e. do the highest-priority readers get woken up first? If so cargo could run at a higher priority than rustc or the linkers and reclaim tokens under pressure.
And rustc could release and reacquire a token every second to see if cargo wants it back.

This would allow more granularity than just not starting a new rustc/link job when load is high.

@weihanglo
Copy link
Member Author

do the highest-priority readers get woken up first? If so cargo could run at a higher priority than rustc or the linkers and reclaim tokens under pressure.

Thanks for the suggestion. That's definitely and thing we can look into. I remember people did something similar in the past rust-lang/rust#67398, but rustc turns to go the other way for parallelism story recently.

I tend to avoid a general mechanism touching every components under Rust project, which is harder to move forward. And in any case, cargo might still need a way to monitor some indicators to control the overall parallelism, from a build system perspective. They seems somehow independent and can be done separately.

@weihanglo
Copy link
Member Author

However, in practice it never seemed to work well, since by the time you measure that the load average is above your target, it's too late and the system is overloaded.

Okay, load average sounds like a lagging indicator here. @luser do you know any other indicators might help? I've done a survey for some major build tool listed in the issue descritption, but can't see any other interesting indicator they expose in the CLI interface. Or if you know any of their implementation has a fancy resource monitoring logic, please let us know. Personally I am looking for automatic way without user interference first, then we can start thinking the interface and scheduling issues.

@the8472
Copy link
Member

the8472 commented Nov 9, 2023

Memory pressure could work to some extent because it includes page reclaims. If build processes gobble up enough ram that it forces the kernel to synchronously clean up caches or even paging that's an indication that memory reserves are running low some time before OOM conditions are reached. The question is whether it's early enough.

If a single linker job eats half the available memory but only counts as one job token then even 1 token too many can be problematic if a linker job is already running and that remaining token would be used to start another one.
Ultimately job tokens are intend to regulate core utilization, not memory utilization so there's an impedance mismatch.

Core utilization is kinda easy to regulate and predict. 1 compute-bound process/thread = 1 token.

Memory utilization is more difficult because we lack estimators for that.

Some ideas:

  • tell users to add copious amounts of swap. Swap allows processes to live (even if under painful conditions), which gives us room to react to swap pressure if we can't proact, e.g. by temporarily quiescing processes until the remaining ones can make progress without thrashing. once pressure subsides the other processes can resume work
  • automatically recover from OOMs by reserving more job tokens and then restarting the child process, under the assumption that it'll succeed if it has fewer siblings
    • variation: instead of waiting for system-wide OOM conditions we can set per-process memory limits (or cgroups or whatever) and let them die earlier and then retry them later with a higher limit
  • try to dial things back (by reclaiming tokens) at the earliest sign of pressure/high load/whatever. this won't help in worst-case scenarios where a single job needs all the available memory
  • develop memory predictors and then have a 2nd jobserver that doles out tokens where 1 token = 256MB or something like that

@weihanglo
Copy link
Member Author

Just post what I found from https://gcc.gnu.org/wiki/DebugFission:

As a rule of thumb, the link job total memory requirements can be estimated at about 200% of the total size of its input files.

Might help to predict/analyze possible memory consumption for linking.

@soloturn
Copy link
Contributor

soloturn commented Dec 30, 2023

here a ticket towards rust-lang, when memory consumption was unpredictable when cargo used all threads to link binaries, and caused OOM when compiling cosmic: rust-lang/rust#114037 . not sure what happened. but it seems that the situation improved in august 2023.

keeping the system responsive is a different matter, and we solve it by using "nice cargo ..." or "nice paru ...", in arch linux. giving lower prio to processes is operating specific and, at least in my opinion, needs to stay OUTSIDE of cargo. because what is "nice" in linux, is "start /low" in windows: https://stackoverflow.com/questions/4208/windows-equivalent-of-nice .

@sunshowers
Copy link
Contributor

Wanted to add that nextest also has several knobs for this:

The context is that in nextest we also wanted to try and avoid test contention in high-core situations (e.g. oxidecomputer/omicron#5380) -- we were looking to see if there was prior art for using an expression language to define concurrency limits, or other static/dynamic behavior. @epage kindly linked me to this thread -- thanks!

@sunshowers
Copy link
Contributor

sunshowers commented Apr 3, 2024

Memory utilization is more difficult because we lack estimators for that.

A practical approach may be to record and store historical metrics, and use them to predict future performance.

facebook-github-bot pushed a commit to facebook/buck2 that referenced this issue Jun 26, 2024
Summary:
Cargo doesn't have good mechanism to limit parallerism to avoid OOMs: rust-lang/cargo#12912

Reduce number of parallel jobs to save some memory. Default is number of logical processors, so this will run twice less jobs.

If it keeps failing, we may need to reduce it further as it's still using over 25 GB .

Reviewed By: blackm00n

Differential Revision: D59054227

fbshipit-source-id: 708265891d69c0264392d0b24e29ac1f6b9818dd
@weihanglo weihanglo added S-needs-design Status: Needs someone to work further on the design for the feature or fix. NOT YET accepted. and removed S-triage Status: This issue is waiting on initial triage. labels Jul 11, 2024
atodorov added a commit to gluwa/polkadot-sdk that referenced this issue Nov 14, 2024
what I am seeing with the move to Linode VMs is essentially this:
rust-lang/cargo#9157
rust-lang/cargo#12912

B/c these new VMs have more CPU cores, 16 (new) vs 4(old),
compilation is faster however this causes cargo to be overzealous and
spawn too many linker processes which consume all of the available
memory (on a 64 GB VM) and causes an OOM error forcing the kernel to
kill the linker process and causing cargo to fail!

Another alternative, which works, is using `--jobs 8`, however that is
less optimal b/c it leaves unused CPU capacity and also affects the
number of parallel threads when executing the test suite!

WARNING: using `--release` is not an option because it breaks tests. The
polkadot-sdk code uses the macro defensive! which is designed to panic
when running in debug mode and multiple test scenarios rely on this
behavior via #[should_panic]!

WARNING: we still need the 64 GB memory!
atodorov added a commit to gluwa/polkadot-sdk that referenced this issue Nov 15, 2024
what I am seeing with the move to Linode VMs is essentially this:
rust-lang/cargo#9157
rust-lang/cargo#12912

B/c these new VMs have more CPU cores, 16 (new) vs 4(old),
compilation is faster however this causes cargo to be overzealous and
spawn too many linker processes which consume all of the available
memory (on a 64 GB VM) and causes an OOM error forcing the kernel to
kill the linker process and causing cargo to fail!

Another alternative, which works, is using `--jobs 8`, however that is
less optimal b/c it leaves unused CPU capacity and also affects the
number of parallel threads when executing the test suite!

WARNING: using `--release` is not an option because it breaks tests. The
polkadot-sdk code uses the macro defensive! which is designed to panic
when running in debug mode and multiple test scenarios rely on this
behavior via #[should_panic]!

WARNING: we still need the 64 GB memory!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-jobserver Area: jobserver, concurrency, parallelism C-feature-request Category: proposal for a feature. Before PR, ping rust-lang/cargo if this is not `Feature accepted` S-needs-design Status: Needs someone to work further on the design for the feature or fix. NOT YET accepted.
Projects
None yet
Development

No branches or pull requests

7 participants