Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add missing inline annotations to Cell #48905

Closed
wants to merge 1 commit into from
Closed

Conversation

Marwes
Copy link
Contributor

@Marwes Marwes commented Mar 10, 2018

Were seeing some odd performance problems when using incremental compilation where Rc pointers were actually slower than Arc pointers (the problem goes away when using non-incremental compilation). I haven't been able to build rustc locally to verify that this fixes it but these missing inline annotations seem to be the only thing that could affect performance (to this extent).

test vector_push_back        ... bench:  11,668,015 ns/iter (+/- 772,861)
test vector_push_back_mut    ... bench:   1,423,771 ns/iter (+/- 22,011)
test vector_push_back_mut_rc ... bench:   1,181,765 ns/iter (+/- 123,724)
test vector_push_back_rc     ... bench:  17,141,746 ns/iter (+/- 203,048)

(Source and non incremental benchmarks orium/rpds#7 (comment))

Were seeing some odd performance problems when using incremental compilation where `Rc` pointers were actually slower than `Arc` pointers (the problem goes away when using non-incremental compilation). I haven't been able to build rustc locally to verify that this fixes it but these missing inline annotations seem to be the only thing that could affect performance (to this extent).

```
test vector_push_back        ... bench:  11,668,015 ns/iter (+/- 772,861)
test vector_push_back_mut    ... bench:   1,423,771 ns/iter (+/- 22,011)
test vector_push_back_mut_rc ... bench:   1,181,765 ns/iter (+/- 123,724)
test vector_push_back_rc     ... bench:  17,141,746 ns/iter (+/- 203,048)
```
(Source and non incremental benchmarks orium/rpds#7 (comment))
@rust-highfive
Copy link
Collaborator

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @kennytm (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Mar 10, 2018
@kennytm
Copy link
Member

kennytm commented Mar 10, 2018

@bors try

Trying to get a compiled libstd, and going to compare the benchmark result with the un-inlined version as wanted in #42716 (comment).

@bors
Copy link
Contributor

bors commented Mar 10, 2018

⌛ Trying commit 58fced4 with merge 3a1fd61...

bors added a commit that referenced this pull request Mar 10, 2018
Add missing inline annotations to Cell

Were seeing some odd performance problems when using incremental compilation where `Rc` pointers were actually slower than `Arc` pointers (the problem goes away when using non-incremental compilation). I haven't been able to build rustc locally to verify that this fixes it but these missing inline annotations seem to be the only thing that could affect performance (to this extent).

```
test vector_push_back        ... bench:  11,668,015 ns/iter (+/- 772,861)
test vector_push_back_mut    ... bench:   1,423,771 ns/iter (+/- 22,011)
test vector_push_back_mut_rc ... bench:   1,181,765 ns/iter (+/- 123,724)
test vector_push_back_rc     ... bench:  17,141,746 ns/iter (+/- 203,048)
```
(Source and non incremental benchmarks orium/rpds#7 (comment))
@bors
Copy link
Contributor

bors commented Mar 10, 2018

☀️ Test successful - status-travis
State: approved= try=True

@kennytm
Copy link
Member

kennytm commented Mar 10, 2018

Benchmark result based on Marwes/rpds@e482d5a, first one is without #[inline], second one is this PR.

$ cargo +87344aa59af2ebb868253228e2b558d701573dff bench vector
...
test vector_drop_last        ... bench:  10,546,095 ns/iter (+/- 35,342)
test vector_drop_last_mut    ... bench:     768,934 ns/iter (+/- 2,609)
test vector_get              ... bench:     322,871 ns/iter (+/- 621)
test vector_iterate          ... bench:     144,636 ns/iter (+/- 2,131)
test vector_push_back        ... bench:  11,965,770 ns/iter (+/- 100,901)
test vector_push_back_mut    ... bench:   1,576,195 ns/iter (+/- 41,488)
test vector_push_back_mut_rc ... bench:   1,371,017 ns/iter (+/- 42,805)
test vector_push_back_rc     ... bench:   7,474,717 ns/iter (+/- 120,604)
...

$ cargo +3a1fd611fc2ad3085a9298b46a85cd055724c45e bench vector
...
test vector_drop_last        ... bench:  10,771,358 ns/iter (+/- 19,155)
test vector_drop_last_mut    ... bench:     777,418 ns/iter (+/- 28,679)
test vector_get              ... bench:     322,735 ns/iter (+/- 713)
test vector_iterate          ... bench:     144,504 ns/iter (+/- 568)
test vector_push_back        ... bench:  12,188,718 ns/iter (+/- 51,151)
test vector_push_back_mut    ... bench:   1,567,152 ns/iter (+/- 34,553)
test vector_push_back_mut_rc ... bench:   1,367,240 ns/iter (+/- 39,629)
test vector_push_back_rc     ... bench:   7,471,407 ns/iter (+/- 32,327)
...

I cannot see any significant change in timing. Also, contrary to OP's measurement, the Rc version is much faster than the Arc version.

You may install the toolchains yourself to check:

$ cargo install rustup-toolchain-install-master
$ rustup-toolchain-install-master 3a1fd611fc2ad3085a9298b46a85cd055724c45e 87344aa59af2ebb868253228e2b558d701573dff
$ cargo +87344aa59af2ebb868253228e2b558d701573dff bench vector
$ cargo +3a1fd611fc2ad3085a9298b46a85cd055724c45e bench vector

@Marwes
Copy link
Contributor Author

Marwes commented Mar 10, 2018

Getting an error when trying to download the artifact unfortunately (guess it does not include windows binaries, linux subsystem does not work either since I haven't managed to get openssl installed in it...).

rustup-toolchain-install-master 3a1fd611fc2ad3085a9298b46a85cd055724c45e
downloading <https://s3-us-west-1.amazonaws.com/rust-lang-ci2/rustc-builds/3a1fd611fc2ad3085a9298b46a85cd055724c45e/rustc-nightly-x86_64-pc-windows-msvc.tar.xz>...
skipping 3a1fd611fc2ad3085a9298b46a85cd055724c45e due to failure:
Error { kind: ClientError(NotFound), url: Some("https://s3-us-west-1.amazonaws.com/rust-lang-ci2/rustc-builds/3a1fd611fc2ad3085a9298b46a85cd055724c45e/rustc-nightly-x86_64-pc-windows-msvc.tar.xz") }

@kennytm Did you run with CARGO_INCREMENTAL=1? It does not appear in normal release builds since those do not use incremental compilation. While one wouldn't use incremental compilation in a release build (I had the variable set by default but I have removed it since) I'd expect a similiar slowdown to happen for debug builds.

If this slowdown is due to these missing inline annotations and that is acceptable, then I'd assume that we should remove the annotations from the other functions instead to keep things consistent?

@kennytm
Copy link
Member

kennytm commented Mar 10, 2018

@Marwes Unfortunately, only 64-bit Linux build is available before merging 😞 You can force a Linux build with:

rustup-toolchain-install-master --host x86_64-unknown-linux-gnu 3a1fd611fc2ad3085a9298b46a85cd055724c45e

I've retried using CARGO_INCREMENTAL=1, the result are:

$ CARGO_INCREMENTAL=1 cargo +87344aa59af2ebb868253228e2b558d701573dff bench vector
...
test vector_drop_last        ... bench:  13,697,880 ns/iter (+/- 136,696)
test vector_drop_last_mut    ... bench:   1,176,880 ns/iter (+/- 2,256)
test vector_get              ... bench:     493,647 ns/iter (+/- 2,101)
test vector_iterate          ... bench:     506,707 ns/iter (+/- 1,191)
test vector_push_back        ... bench:  15,053,624 ns/iter (+/- 35,672)
test vector_push_back_mut    ... bench:   2,035,044 ns/iter (+/- 42,841)
test vector_push_back_mut_rc ... bench:   1,768,721 ns/iter (+/- 43,902)
test vector_push_back_rc     ... bench:  20,100,645 ns/iter (+/- 34,784)
...

$ CARGO_INCREMENTAL=1 cargo +3a1fd611fc2ad3085a9298b46a85cd055724c45e bench vector
...
test vector_drop_last        ... bench:  13,644,513 ns/iter (+/- 30,618)
test vector_drop_last_mut    ... bench:   1,149,987 ns/iter (+/- 5,317)
test vector_get              ... bench:     511,327 ns/iter (+/- 1,883)
test vector_iterate          ... bench:     502,438 ns/iter (+/- 2,337)
test vector_push_back        ... bench:  15,010,549 ns/iter (+/- 51,013)
test vector_push_back_mut    ... bench:   1,995,922 ns/iter (+/- 44,019)
test vector_push_back_mut_rc ... bench:   1,763,790 ns/iter (+/- 44,449)
test vector_push_back_rc     ... bench:  18,610,890 ns/iter (+/- 45,462)
...

So there is 7% improvement in the vector_push_back_rc test case, though still significantly slower than Arc 🤔.

@Marwes
Copy link
Contributor Author

Marwes commented Mar 10, 2018

:/ The only other real difference is that Rc uses unwrap_or_else, but I'd expect that to be trivially inlined

self.inner().strong.set(self.strong().checked_add(1).unwrap_or_else(|| unsafe { abort() }));

@kennytm
Copy link
Member

kennytm commented Mar 10, 2018

I've checked debug build (opt-level=0) as well, no improvement is seen, because #[inline] won't take effect without optimization 🙃

$ CARGO_INCREMENTAL=1 cargo +87344aa59af2ebb868253228e2b558d701573dff bench vector
...
test vector_drop_last        ... bench:  77,054,997 ns/iter (+/- 1,630,919)
test vector_drop_last_mut    ... bench:   7,855,141 ns/iter (+/- 68,446)
test vector_get              ... bench:   3,356,728 ns/iter (+/- 22,170)
test vector_iterate          ... bench:   3,758,851 ns/iter (+/- 4,583)
test vector_push_back        ... bench:  82,371,195 ns/iter (+/- 686,300)
test vector_push_back_mut    ... bench:   8,274,384 ns/iter (+/- 50,500)
test vector_push_back_mut_rc ... bench:   6,840,017 ns/iter (+/- 53,296)
test vector_push_back_rc     ... bench: 138,269,887 ns/iter (+/- 150,359)
...

$ CARGO_INCREMENTAL=1 cargo +3a1fd611fc2ad3085a9298b46a85cd055724c45e bench vector
...
test vector_drop_last        ... bench:  77,315,873 ns/iter (+/- 597,236)
test vector_drop_last_mut    ... bench:   7,880,990 ns/iter (+/- 7,060)
test vector_get              ... bench:   3,341,212 ns/iter (+/- 4,199)
test vector_iterate          ... bench:   3,780,185 ns/iter (+/- 3,689)
test vector_push_back        ... bench:  81,823,875 ns/iter (+/- 318,751)
test vector_push_back_mut    ... bench:   8,294,160 ns/iter (+/- 51,599)
test vector_push_back_mut_rc ... bench:   6,813,080 ns/iter (+/- 53,938)
test vector_push_back_rc     ... bench: 137,634,856 ns/iter (+/- 788,074)
...

I'm now going to defer this to someone from the libs team to decide whether to merge or not.

r? @BurntSushi

Summary: 7% improvement in release mode with incremental. No improvement otherwise.

@rust-highfive rust-highfive assigned BurntSushi and unassigned kennytm Mar 10, 2018
@BurntSushi
Copy link
Member

@alexcrichton Has opinions on inline annotations, so I defer to him. :-)

r? @alexcrichton

@alexcrichton
Copy link
Member

Thanks for the PR! I think though that this PR may not be necessary in the sense that incremental compilation in release mode is not at all tuned for performance. These methods are all inlined as necessary (due to them being generic) so the extra #[inline] here isn't necessary to get LLVM to inline them in release mode.

Due to how codegen units are partitioned and how we do incremental compilation, though, you may not be able to inline these across codegen units in incremental mode. This will eventually be fixed if we implement incremental ThinLTO, but that has yet to be done at this point!

In that sense I would be inclined to not merge this PR as extra #[inline] annotations can be harmful to code size and compile time, and otherwise it looks like this is only related to performance in incremental + release mode, right?

@Marwes
Copy link
Contributor Author

Marwes commented Mar 11, 2018

In that sense I would be inclined to not merge this PR as extra #[inline] annotations can be harmful to code size and compile time, and otherwise it looks like this is only related to performance in incremental + release mode, right?

I didn't think to check incremental + debug but it appears that the problem also occurs there

running 8 tests
test vector_drop_last        ... bench:  72,176,998 ns/iter (+/- 5,394,864)
test vector_drop_last_mut    ... bench:   6,730,589 ns/iter (+/- 3,288,727)
test vector_get              ... bench:   2,451,287 ns/iter (+/- 322,803)
test vector_iterate          ... bench:   3,149,383 ns/iter (+/- 94,201)
test vector_push_back        ... bench:  75,140,253 ns/iter (+/- 5,813,070)
test vector_push_back_mut    ... bench:   7,294,028 ns/iter (+/- 439,593)
test vector_push_back_mut_rc ... bench:   6,153,137 ns/iter (+/- 240,858)
test vector_push_back_rc     ... bench: 115,874,988 ns/iter (+/- 6,698,258)

The added inline annotations here does not seem to help much though so it seems prudent to just close this (though I'd argue that the inline annotations that exist should be removed in that case as there is no reason to single out these 4 methods as the only ones without inline annotations).

@alexcrichton
Copy link
Member

Er wait is that running benchmarks in debug mode? If so we've never optimized for that, so I wouldn't put much weight in those numbers!

@Marwes
Copy link
Contributor Author

Marwes commented Mar 11, 2018

The ones in the first comment were release + incremental, the ones in #48905 (comment) were debug (opt-level = 0) + incremental. While I don't expect the debug mode code to be fast, I wanted to point out that it is rather unexpected that the Rc pointer code is so much slower than the Arc code, given that the only reason to use Rc is because it is supposed to be faster than Arc.

Still, that might be fine.

@alexcrichton
Copy link
Member

@Marwes would you be ok closing this or would you still prefer to land this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Awaiting review from the assignee but also interested parties.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants