Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace jemalloc with mimalloc config.toml option #81782

Closed
wants to merge 6 commits into from

Conversation

jq-rs
Copy link

@jq-rs jq-rs commented Feb 5, 2021

  • This is a test PR for swapping jemalloc to mimalloc.
  • The performance results in my own environment are promising.
  • This is not intended to be merged upstream as such, just to provide initial performance results which can lead to further discuss what to do next.

Edit: After the results, changed mimalloc to be introduced as its own option in config.toml.example and added support via 22f306c. Changed title accordingly and dropped WIP status.

Edit: After review comments, changed mimalloc to replace jemalloc as config.toml option.

* Swap jemalloc to mimalloc
* Fix aligned alloc.
* Bump libmimalloc-sys.
* Include lock-file.
@rust-highfive
Copy link
Collaborator

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @Mark-Simulacrum (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Feb 5, 2021
@rust-log-analyzer

This comment has been minimized.

@jq-rs jq-rs marked this pull request as ready for review February 5, 2021 11:54
@jq-rs
Copy link
Author

jq-rs commented Feb 5, 2021

We probably would need bors for benchmarks?

@Aaron1011
Copy link
Member

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 5, 2021
@bors
Copy link
Contributor

bors commented Feb 5, 2021

⌛ Trying commit 35d9352 with merge 86f3a470dd7fa102be01dc6d84ecc6920f8af32a...

@bors
Copy link
Contributor

bors commented Feb 5, 2021

☀️ Try build successful - checks-actions
Build commit: 86f3a470dd7fa102be01dc6d84ecc6920f8af32a (86f3a470dd7fa102be01dc6d84ecc6920f8af32a)

@rust-timer
Copy link
Collaborator

Queued 86f3a470dd7fa102be01dc6d84ecc6920f8af32a with parent f9435f4, future comparison URL.

@rust-timer
Copy link
Collaborator

Finished benchmarking try commit (86f3a470dd7fa102be01dc6d84ecc6920f8af32a): comparison url.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying rollup- to bors.

Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 5, 2021
@nagisa
Copy link
Member

nagisa commented Feb 5, 2021

While there are major decrease in the instruction counts (yay!), the wall-clock time does not share a similarly stark change (though bootstrap times are all universally better) and there's a large increase in max-rss on some benchmarks. Pretty interesting tradeoff, overall.

@jq-rs
Copy link
Author

jq-rs commented Feb 6, 2021

Pretty good results, I think! How well the benchmarks measure multi-threaded performance? I believe I see even better results in such a setup.

If we want to add mimalloc support, maybe the best option would be to add mimalloc as a new config.toml option? Mimalloc is in active development, so results may improve even more in the future.

@jyn514
Copy link
Member

jyn514 commented Feb 6, 2021

the wall-clock time does not share a similarly stark change

Hmm, why do you say that? I see up to -11.9% decreases, and large decreases across the board.

image

@jyn514
Copy link
Member

jyn514 commented Feb 6, 2021

Pretty good results, I think! How well the benchmarks measure multi-threaded performance? I believe I see even better results in such a setup.

Not at all currently I think, cfg(parallel_compiler) is off by default (and isn't currently seeing active work: https://github.com/rust-lang/compiler-team/blob/master/content/minutes/design-meeting/2020-05-29-Roadmap-2020-2021.md#key-notes-and-points-from-the-meeting-itself).

Remove misleading OSX from config.toml option comments as
it is not really supported.
@jq-rs jq-rs changed the title [WIP] Swap jemalloc to mimalloc Add mimalloc confog.toml option support Feb 9, 2021
@jq-rs jq-rs changed the title Add mimalloc confog.toml option support Add mimalloc config.toml option support Feb 9, 2021
@jq-rs
Copy link
Author

jq-rs commented Feb 9, 2021

Added mimalloc option support for config.toml as a reference via 22f306c. Removed misleading OSX mention from config.toml.example comments of jemalloc.

@Mark-Simulacrum
Copy link
Member

Is there a reason this intends to add mimalloc, instead of replacing jemalloc? I'm not seeing clear reasons we would want to have both as options. I am also not seeing any huge increases in max-rss - 20-25% is pretty typical variability I think.

@Mark-Simulacrum Mark-Simulacrum removed the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Feb 12, 2021
@jq-rs
Copy link
Author

jq-rs commented Feb 23, 2021

By difficult I mean perhaps infeasible. The compiler allocator system is currently designed for drop-in allocator use, as we need to support system allocators on some architectures. It seems reasonable and elegant design to me where function override can be leveraged. Any drop-in allocator has to support these paths as well as they can to be able to compete with other drop-in allocators. This is a fair game as such.

For ad-hoc testing we can change the free call to non-standard version for rustc, but I don't think it provides a path forward here. AFAIK it does not help with LLVM override either. So, if we do it, we should do it for completeness, and perhaps for curiosity.

Edit: Even though I tried quickly, adding new dependencies to alloc.rs, so that we would be able to call sdallocx, seems tricky:

   Compiling tikv-jemalloc-sys v0.4.1+5.2.1-patched
   Compiling std v0.0.0 (/home/jkettune/rust/library/std)
error[E0463]: can't find crate for `core`

Other crates there have a special rustc-dep-for-std feature, maybe for this reason exactly, which I am not familiar with. Seems that this is a dead end for me, and my best at this point. Referring to the above, it does not matter much as it would not help with LLVM override anyhow. I still think we should run bors try for #82389, so that we see what kind of reference results latest jemalloc stable has as a drop-in allocator.

@pcwalton
Copy link
Contributor

It's disappointing if we've boxed ourselves into a corner where we can't use sized deallocation APIs anymore. Many years ago we actually made a design decision to used sized deallocation wherever possible. I think we should try hard to be able to use sized deallocation again, because I don't think we ever officially went back on that decision.

@pcwalton
Copy link
Contributor

cc @alexcrichton -- did we actually decide to drop sized deallocation at any point? :(

@alexcrichton
Copy link
Member

The problem, IIRC, has nothing to do with Rust-the-language or whether we've boxed ourselves out of this, but rather it has everything to do with how we distribute and build the compiler itself. The compiler is built as rustc which links to librustc_driver.so which links to libstd.so. The standard library must have an allocator picked by the time we link it and we can't really reasonably say that all users of libstd.so must use whatever works for rustc. As a result we pick the system malloc/free symbols for libstd.so, and then overriding the allocator in rustc involves somehow overriding the system allocator (which is pretty platform-specific AFAIK).

This means that rustc, due to how its built, has no way of using #[global_allocator] in Rust which would allow routing all Rust allocation/deallocation to custom APIs (e.g. a sized deallocation API). This has historically been coupled with the fact that there was no way (or no one knew a way) to hook up LLVM into sized deallocation APIs.

To leverage sized deallocation I think what needs to happen is:

  • First probably measure the impact to see whether it's at all worth it.
  • Decide that rustc: Move jemalloc from rustc_driver to rustc #56986 should "be reverted" in that we want to force all tools using librustc_driver.so (e.g. RLS, rustdoc, rustc, etc) to use the allocator we pick. rustc: Move jemalloc from rustc_driver to rustc #56986 originally concluded that jemalloc was not appropriate for the RLS at the time.
  • Decouple librustc_driver.so from libstd.so. I forget if it's just legacy reasons why we still do this or whether there's a technical reason. It's probably more of the former. This is likely a large-ish undertaking.
  • Use #[global_allocator] in librustc_driver.so, hooking into custom allocator API.
  • Either make sure LLVM's allocations are routed to the custom allocator used via some platform-specific means or figure out how to route LLVM's allocations to the same custom allocator API.

AFAIK none of this is a light undertaking. Depending on the allocator is also may only benefit one platform instead of all.

@craterbot
Copy link
Collaborator

🚧 Experiment pr-81782 is now running

ℹ️ Crater is a tool to run experiments across parts of the Rust ecosystem. Learn more

@craterbot
Copy link
Collaborator

🎉 Experiment pr-81782 is completed!
📊 13 regressed and 7 fixed (145879 total)
📰 Open the full report.

⚠️ If you notice any spurious failure please add them to the blacklist!
ℹ️ Crater is a tool to run experiments across parts of the Rust ecosystem. Learn more

@craterbot craterbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. S-waiting-on-crater Status: Waiting on a crater run to be completed. labels Feb 28, 2021
@jq-rs
Copy link
Author

jq-rs commented Feb 28, 2021

@Mark-Simulacrum Regarding your earlier question, mimalloc does not benefit from sized deallocation, free is the only fast path as can be seen here.

@bors
Copy link
Contributor

bors commented Mar 8, 2021

☔ The latest upstream changes (presumably #82896) made this pull request unmergeable. Please resolve the merge conflicts.

@jq-rs
Copy link
Author

jq-rs commented Mar 9, 2021

jemalloc just got macOS fixes in, which is grand. mimalloc would need similar fixes for macOS, so they would need to be included for this PR before merging. But where are we with this one, are these macOS fixes worth working for?

In addition to the above, my view is that the discussion so far has all the input right and we would just need the decision shall we switch allocator at this point. mimalloc seems perhaps faster against the latest jemalloc dev free path and is faster with rustc compared to latest stable 5.2.1 (see #82389 measurements). Sized deallocation seems a bit faster approach on some allocators (especially jemalloc and tcmalloc) but that does not apply for mimalloc at the moment.

How I see this is that if we decide to stay with the drop-in allocator approach for now or for some time, mimalloc is certainly worth considering. If we plan to use sized deallocation in the near future, then changing to mimalloc now would perhaps not be worth it.

@jq-rs
Copy link
Author

jq-rs commented Mar 9, 2021

And for details why I think "mimalloc seems perhaps faster against the latest jemalloc dev free path" is this mimalloc-bench between mi 1.7.0 and je dev (73ca4b8) where in most tests mimalloc is faster. It would be better to test this with the real application, but as it is difficult, this is the best I could do. The larson-sized test is an extension I did for the benchmarks to have at least one measurement data point for sized deallocation (the benchmark can be found here):

# benchmark allocator elapsed rss user sys page-faults page-reclaims

cfrac mi 08.19 3104 8.18 0.00 0 362
cfrac je 08.31 4836 8.29 0.00 0 470

espresso mi 05.97 4096 5.93 0.02 0 678
espresso je 06.17 5212 6.12 0.03 0 648

barnes mi 04.34 58276 4.31 0.02 0 15604
barnes je 04.34 60276 4.30 0.02 0 15732

leanN mi 34.29 502904 121.55 1.65 0 371062
leanN je 34.44 464624 120.82 1.56 0 349706

redis mi 7.467 30064 2.92 0.82 0 9830
redis je 9.077 33672 3.91 0.64 0 8406

alloc-test1 mi 04.08 12744 4.06 0.01 1 7122
alloc-test1 je 04.29 12296 4.27 0.00 0 6802

alloc-testN mi 04.76 14552 70.95 0.02 0 5792
alloc-testN je 05.76 15440 77.03 0.03 0 9390

larsonN mi 2.889 312180 199.19 0.56 0 80847
larsonN je 2.962 256788 197.21 1.17 0 140290

larsonN-sized mi 2.829 318884 199.01 0.60 0 82602
larsonN-sized je 2.758 259904 196.79 1.34 0 148061

sh6benchN mi 00.16 288940 3.27 0.22 0 71998
sh6benchN je 00.20 284776 4.50 0.33 0 73431

sh8benchN mi 00.38 208100 7.72 0.15 0 51784
sh8benchN je 00.65 171168 19.64 0.26 0 51550

xmalloc-testN mi 0.457 134344 190.65 1.69 1 60977
xmalloc-testN je 1.804 110368 165.07 17.96 0 57251

cache-scratch1 mi 02.53 3972 2.52 0.00 0 265
cache-scratch1 je 02.52 4588 2.51 0.00 0 345

cache-scratchN mi 00.17 4608 4.80 0.00 0 467
cache-scratchN je 00.16 6724 5.39 0.00 0 992

mstressN mi 06.67 2583856 64.01 3.57 0 648395
mstressN je 13.00 695004 33.78 39.55 0 8733729

rptestN mi 3.529 296848 27.70 5.77 1 73968
rptestN je 4.801 201188 37.23 5.94 0 105776

@joshtriplett
Copy link
Member

@alexcrichton wrote:

Either make sure LLVM's allocations are routed to the custom allocator used via some platform-specific means or figure out how to route LLVM's allocations to the same custom allocator API.

As long as we never assume we can call our free function on something that LLVM allocated, couldn't we just make LLVM support calling an allocator other than the system allocator, and then assume that if someone uses a shared LLVM, they want to use whatever allocator that LLVM is built to use? That way we don't have to do any kind of interposition on the system allocator at all.

@jq-rs
Copy link
Author

jq-rs commented Mar 30, 2021

As this one needs macOS updates, closing this for now. After #83152 is merged, this can be compared again with macOS support. Thanks all!

@jq-rs jq-rs closed this Mar 30, 2021
bors added a commit to rust-lang-ci/rust that referenced this pull request Apr 5, 2021
…mulacrum

Use tikv-jemallocator in rustc/rustdoc in addition to jemalloc-sys when enabled.

In rust-lang#81782 it was mentioned that one reason rustc may benefit from minimalloc is it doesn't use the `sdallocx` api from jemalloc.

Currently, on unix, rust uses jemalloc by importing its symbols to use them with the default, System (libc) global allocator.
This PR switches its global alloc to `tikv-jemallocator`, which correctly uses sized deallocation (https://docs.rs/tikv-jemallocator/0.4.1/src/tikv_jemallocator/lib.rs.html#121-126). `tikv-jemallocator`, as far as I can tell, is a more up-to-date set of bindings to jemalloc than `jemallocator`

The perf results of this pr are in large part due to the version upgrade of jemalloc, but sized deallocation has a non-trivial improvement, particularly to rustdoc.

This pr also includes changes to bootstrap to correctly pass the jemalloc feature through to the rustdoc build
@nothingnesses
Copy link

@jq-rs #83152 has been merged. Any updates on this?

@jq-rs
Copy link
Author

jq-rs commented Sep 25, 2021

@nothingnesses AFAIK macOS is not supported by mimalloc fully, so this is not progressing.

@jq-rs
Copy link
Author

jq-rs commented Nov 10, 2021

And I verified this, the macOS override part is not there yet for mimalloc in v1.7.2 at least. If some macOS specialist wants to take a look, I am happy to give a patch for the other replace parts.

@fee1-dead
Copy link
Member

It looks like they improved macOS overriding support in v1.7.3: microsoft/mimalloc@c47de7e

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Awaiting review from the assignee but also interested parties.
Projects
None yet
Development

Successfully merging this pull request may close these issues.