Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specialize Iterator for &mut I where I: Sized #82185

Closed
wants to merge 5 commits into from

Conversation

SkiFire13
Copy link
Contributor

@SkiFire13 SkiFire13 commented Feb 16, 2021

This PR aims to improve the implementation of Iterator and DoubleEndedIterator for &mut I where I: Iterator. This is accomplished by forwarding try_fold to the implementation for I (since it takes &mut self) and implementing fold in terms of try_fold (since fold takes self).

Since try_fold is not available if I: !Sized I needed to specialize the case I: Sized (that's also why I choose to keep the + Sized bound, to be explicit about what I'm specializing). This means this optimization won't apply to &mut dyn Iterator.

While coding this pr I realized that it adds a considerable chunk of code for &mut I where I: Iterator and it may make sense to move it to a separate file just like other adapters. Tell me what you think about it.

Edit: Note that the following benchmark results are relatively old (they were done before the LLVM 12 update for instance) and should probably be updated.

Here's a comparison (run with cargo benchcmp master pr --threshold 2) of the benchmark (run with py x.py bench --stage 0 library/core --test-args iter::):

 name                                         master.txt ns/iter  pr.txt ns/iter  diff ns/iter   diff %  speedup
 iter::bench_cycle_skip_take_ref_sum          2,702,500           1,037,230         -1,665,270  -61.62%   x 2.61
 iter::bench_cycle_take_ref_sum               950,040             738,915             -211,125  -22.22%   x 1.29
 iter::bench_cycle_take_skip_ref_sum          2,117,830           666,940           -1,450,890  -68.51%   x 3.18
 iter::bench_enumerate_chain_ref_sum          1,381,700           1,476,370             94,670    6.85%   x 0.94
 iter::bench_enumerate_ref_sum                459,605             436,760              -22,845   -4.97%   x 1.05
 iter::bench_filter_chain_ref_count           1,389,580           1,095,382           -294,198  -21.17%   x 1.27
 iter::bench_filter_chain_ref_sum             2,007,972           893,280           -1,114,692  -55.51%   x 2.25
 iter::bench_filter_map_chain_ref_sum         3,908,650           1,040,480         -2,868,170  -73.38%   x 3.76
 iter::bench_filter_map_ref_sum               806,280             469,225             -337,055  -41.80%   x 1.72
 iter::bench_filter_ref_sum                   604,827             566,450              -38,377   -6.35%   x 1.07
 iter::bench_flat_map_chain_ref_sum           4,467,065           1,641,630         -2,825,435  -63.25%   x 2.72
 iter::bench_flat_map_ref_sum                 1,849,160           474,605           -1,374,555  -74.33%   x 3.90
 iter::bench_flat_map_sum                     446,975             483,936               36,961    8.27%   x 0.92
 iter::bench_for_each_chain_ref_fold          1,608,805           1,378,055           -230,750  -14.34%   x 1.17
 iter::bench_fuse_chain_ref_sum               1,612,977           942,280             -670,697  -41.58%   x 1.71
 iter::bench_fuse_chain_sum                   461,980             707,125              245,145   53.06%   x 0.65
 iter::bench_fuse_ref_sum                     460,043             352,918             -107,125  -23.29%   x 1.30
 iter::bench_inspect_chain_ref_sum            1,612,792           941,940             -670,852  -41.60%   x 1.71
 iter::bench_inspect_chain_sum                460,000             706,720              246,720   53.63%   x 0.65
 iter::bench_inspect_ref_sum                  459,523             358,241             -101,282  -22.04%   x 1.28
 iter::bench_multiple_take                    63                  58                        -5   -7.94%   x 1.09
 iter::bench_peekable_chain_ref_sum           1,378,620           924,195             -454,425  -32.96%   x 1.49
 iter::bench_peekable_chain_sum               463,480             692,655              229,175   49.45%   x 0.67
 iter::bench_peekable_ref_sum                 459,677             346,225             -113,452  -24.68%   x 1.33
 iter::bench_rposition                        66                  74                         8   12.12%   x 0.89
 iter::bench_skip_chain_ref_sum               3,911,640           921,185           -2,990,455  -76.45%   x 4.25
 iter::bench_skip_cycle_skip_zip_add_ref_sum  4,439,160           3,252,870         -1,186,290  -26.72%   x 1.36
 iter::bench_skip_sum                         344,301             460,145              115,844   33.65%   x 0.75
 iter::bench_skip_while_chain_ref_sum         3,947,070           921,160           -3,025,910  -76.66%   x 4.28
 iter::bench_skip_while_ref_sum               344,387             460,540              116,153   33.73%   x 0.75
 iter::bench_take_while_chain_ref_sum         760,888             511,980             -248,908  -32.71%   x 1.49

There are a lot of promising improvements, but surprisingly there are also a couple of strange regressions:

 name                                         master.txt ns/iter  pr.txt ns/iter  diff ns/iter   diff %  speedup
 iter::bench_flat_map_sum                     446,975             483,936               36,961    8.27%   x 0.92
 iter::bench_fuse_chain_sum                   461,980             707,125              245,145   53.06%   x 0.65
 iter::bench_inspect_chain_sum                460,000             706,720              246,720   53.63%   x 0.65
 iter::bench_peekable_chain_sum               463,480             692,655              229,175   49.45%   x 0.67
 iter::bench_rposition                        66                  74                         8   12.12%   x 0.89
 iter::bench_skip_sum                         344,301             460,145              115,844   33.65%   x 0.75

These benchmarks shouldn't have be impacted by my changes because they never use &mut impl Iterator but somehow they were. My guess is that this is the result of some LLVM shenanigans because if I remove any benchmark containing .by_ref() then most of them don't regress anymore. Since these look like codegen problems and specific to the current benchmark I'm not too worried about them.

 name                                         master.txt ns/iter  pr.txt ns/iter  diff ns/iter   diff %  speedup
 iter::bench_enumerate_chain_ref_sum          1,381,700           1,476,370             94,670    6.85%   x 0.94
 iter::bench_skip_while_ref_sum               344,387             460,540              116,153   33.73%   x 0.75

These benchmarks however were directly impacted by my changes and the give the same results even if I remove all the other benchmarks, so I guess they're genuine regressions.

All these regressions are present even if I copy the default fold implementation but directly call I::next instead of passing through <&mut I>::next (which should just call I::next anyway). This makes me think these are also codegen problem.

I would like to not have these regressions but the other improvements probably outweights them.

@rust-highfive
Copy link
Collaborator

r? @Mark-Simulacrum

(rust-highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Feb 16, 2021
@rust-log-analyzer

This comment has been minimized.

@bluss
Copy link
Member

bluss commented Feb 16, 2021

Is the intermediate trait SpecSizedIterator a workaround? Couldn't it be specialized on the Sized bound for the plain Iterator trait impl for &mut I? I know I tried that years ago (and at that point it ICE'd, so it was not possible to merge it). Just specializing the Iterator impl would avoid having to duplicate the default method implementations, but I admit I don't know the min specialization rules, if they allow specializing on I: Sized.

@the8472
Copy link
Member

the8472 commented Feb 16, 2021

Current std policy says to use private helper traits. https://std-dev-guide.rust-lang.org/code-considerations/using-unstable-lang/specialization.html

@Mark-Simulacrum
Copy link
Member

r? @cuviper (or someone else more familiar with iterator code)

@cuviper
Copy link
Member

cuviper commented Feb 21, 2021

Let's check the general perf effect:

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 21, 2021
@bors
Copy link
Contributor

bors commented Feb 21, 2021

⌛ Trying commit a09399b1f1ba90d72fdc4eadf65d0ab459a96287 with merge 4bf6a12cc7384e44f61e55b762c4efb6612c5aca...

@bors
Copy link
Contributor

bors commented Feb 21, 2021

☀️ Try build successful - checks-actions
Build commit: 4bf6a12cc7384e44f61e55b762c4efb6612c5aca (4bf6a12cc7384e44f61e55b762c4efb6612c5aca)

@rust-timer
Copy link
Collaborator

Queued 4bf6a12cc7384e44f61e55b762c4efb6612c5aca with parent d2b38d6, future comparison URL.

@rust-timer
Copy link
Collaborator

Finished benchmarking try commit (4bf6a12cc7384e44f61e55b762c4efb6612c5aca): comparison url.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying rollup- to bors.

Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 21, 2021
@cuviper
Copy link
Member

cuviper commented Mar 3, 2021

@SkiFire13 those perf results don't look promising -- what's your take on it?

@SkiFire13
Copy link
Contributor Author

I'm not quite sure but my guesses are:

  • Calling I::try_fold instead of just calling next in a loop may result in a bigger function, which probably translates in bigger compile times. Since println somehow ends up calling for_each on a &mut vec::Drain (through BufWriter::flush_buf and then vec::Drain::drop) this might explain why the incr-patched: println builds were the most affected. Maybe we could change vec::Drain::drop to not call that for_each, or maybe discard the idea of specializing &mut Iterator and instead add another iterator adaptor that does the same job but is opt-in (this feels like duplicating by_ref but it's not like we can change its return type).

  • default_fold and friends being separate functions may result in more time spent deciding whether to inline them or not. I don't know exactly how this may affect compile times, but we may try manually inling (e.g. reverting the commit that introduced default_fold) or using #[inline(always)]. As a side note, a quick benchmark shows that after adding default_fold and friends some benchmarks didn't regress anymore, but others started regressing, which is pretty weird to me.

@crlf0710 crlf0710 added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Mar 20, 2021
@crlf0710
Copy link
Member

@SkiFire13 Ping from triage, what's next steps here?

@SkiFire13
Copy link
Contributor Author

@SkiFire13 Ping from triage, what's next steps here?

After the perf run I decided to focus on solving the regressing benchmarks, hoping that their cause is the same as the regressions in the perf run. I tried inling more methods and changing vec::Drain::drop to not use .by_ref() but unfortunately the result in both cases was that some benches regress and other improve. Now I'm a bit lost. I'm not too familiar with how codegen works and how to debug it and the fact that the regressions behaves pretty weirdly doesn't help at all.

@the8472
Copy link
Member

the8472 commented Mar 21, 2021

If you click on the percentages you'll see a breakdown by what's consuming more time. It's mostly spending more time in LLVM, so one thing you could investigate is whether these changes caused more llvm IR to be emitted for those benchmarked crates. cargo llvm-lines can be useful there.

In the past optimization attempts in Vec have shown that while internal iteration methods such as fold() can generate better code they involve a lot more duplicate IR due to all the helper functions in the iterator adapters which makes compilation more expensive than the speed gains in the compiler itself. E.g. a 3rd party crate might become faster when executed (-0.2%) but the compiler spends more time in LLVM (+0.4%) and its own rust code only gets a little more efficient ( -0.1%) since it doesn't use that particular code path often which means it shows a +0.3% regression. Perf runs only capture the +0.3%, not the -0.2%.

Since println somehow ends up calling for_each on a &mut vec::Drain (through BufWriter::flush_buf and then vec::Drain::drop) this might explain why the incr-patched: println builds were the most affected.

Just run the experiment?

Another idea would be measuring the shuffling of the default implementations into separate traits (Spec...) and the additional Iterator + Sized specialization separately. Then we could see whether it's the outlining or the specialization of interest that causes the regression.

@crlf0710
Copy link
Member

crlf0710 commented Apr 9, 2021

@SkiFire13 Ping from triage, any updates on this?

@crlf0710 crlf0710 removed the S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. label Apr 9, 2021
@SkiFire13 SkiFire13 changed the title Improve implementation of Iterator for &mut I Specialize Iterator for &mut I where I: Sized May 11, 2021
@SkiFire13 SkiFire13 marked this pull request as draft May 11, 2021 08:36
@SkiFire13
Copy link
Contributor Author

SkiFire13 commented May 11, 2021

I've rebased onto #85157, let's see how it compares. It might be nice to have a perf-run comparison with both master and #85157. I've also converted this PR into a draft since it's clearly not ready for merging.

@Mark-Simulacrum
Copy link
Member

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label May 11, 2021
@bors
Copy link
Contributor

bors commented May 11, 2021

⌛ Trying commit 17350de with merge 5d8726c5e6e2c8a635a0178c21932f6634271e89...

@bors
Copy link
Contributor

bors commented May 11, 2021

☀️ Try build successful - checks-actions
Build commit: 5d8726c5e6e2c8a635a0178c21932f6634271e89 (5d8726c5e6e2c8a635a0178c21932f6634271e89)

@rust-timer
Copy link
Collaborator

Queued 5d8726c5e6e2c8a635a0178c21932f6634271e89 with parent 2bafe96, future comparison URL.

@rust-timer
Copy link
Collaborator

Finished benchmarking try commit (5d8726c5e6e2c8a635a0178c21932f6634271e89): comparison url.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying rollup- to bors.

Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label May 11, 2021
@SkiFire13
Copy link
Contributor Author

Still pretty bad results

@the8472
Copy link
Member

the8472 commented May 11, 2021

The clap-rs-opt full and clap-rs-opt incr-patched: println benchmarks still are spending additional time in llvm, as they did in the previous perf run. And they're way outside the noise level in the perf charts. So that's probably unrelated to Drain::drop.

Running that benchmark locally for master and for the branch and comparing the llvm-lines output might help. Maybe there are lots of generic instantiations of iterator code somewhere else.

deeply-nested-async-opt incr-unchanged is interesting in a different way. It spends more time in rust passes, not llvm. That might be a genuine regression in iterator performance. Running that benchmark with perf output and then doing a perf diff might shed some light on that. Currently this PR optimizes 4 different iterator methods. Maybe we can just trim down the set to the more beneficial ones by rejecting the ones which cause regressions.

@inquisitivecrystal inquisitivecrystal added the T-libs Relevant to the library team, which will review and decide on the PR/issue. label Aug 24, 2021
bors added a commit to rust-lang-ci/rust that referenced this pull request Dec 9, 2021
…imulacrum

replace vec::Drain drop loops with drop_in_place

The `Drain::drop` implementation came up in rust-lang#82185 (comment) as potentially interfering with other optimization work due its widespread use somewhere in `println!`

`@rustbot` label T-libs-impl
@Dylan-DPC Dylan-DPC removed the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Feb 19, 2022
@Dylan-DPC
Copy link
Member

@SkiFire13 any updates on this?

@SkiFire13
Copy link
Contributor Author

I left this PR in draft mode in case I got new ideas but nothing came up and eventually I forgot about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.