Specialize Iterator for &mut I where I: Sized #82185

SkiFire13 · 2021-02-16T14:16:20Z

This PR aims to improve the implementation of Iterator and DoubleEndedIterator for &mut I where I: Iterator. This is accomplished by forwarding try_fold to the implementation for I (since it takes &mut self) and implementing fold in terms of try_fold (since fold takes self).

Since try_fold is not available if I: !Sized I needed to specialize the case I: Sized (that's also why I choose to keep the + Sized bound, to be explicit about what I'm specializing). This means this optimization won't apply to &mut dyn Iterator.

While coding this pr I realized that it adds a considerable chunk of code for &mut I where I: Iterator and it may make sense to move it to a separate file just like other adapters. Tell me what you think about it.

Edit: Note that the following benchmark results are relatively old (they were done before the LLVM 12 update for instance) and should probably be updated.

Here's a comparison (run with cargo benchcmp master pr --threshold 2) of the benchmark (run with py x.py bench --stage 0 library/core --test-args iter::):

 name                                         master.txt ns/iter  pr.txt ns/iter  diff ns/iter   diff %  speedup
 iter::bench_cycle_skip_take_ref_sum          2,702,500           1,037,230         -1,665,270  -61.62%   x 2.61
 iter::bench_cycle_take_ref_sum               950,040             738,915             -211,125  -22.22%   x 1.29
 iter::bench_cycle_take_skip_ref_sum          2,117,830           666,940           -1,450,890  -68.51%   x 3.18
 iter::bench_enumerate_chain_ref_sum          1,381,700           1,476,370             94,670    6.85%   x 0.94
 iter::bench_enumerate_ref_sum                459,605             436,760              -22,845   -4.97%   x 1.05
 iter::bench_filter_chain_ref_count           1,389,580           1,095,382           -294,198  -21.17%   x 1.27
 iter::bench_filter_chain_ref_sum             2,007,972           893,280           -1,114,692  -55.51%   x 2.25
 iter::bench_filter_map_chain_ref_sum         3,908,650           1,040,480         -2,868,170  -73.38%   x 3.76
 iter::bench_filter_map_ref_sum               806,280             469,225             -337,055  -41.80%   x 1.72
 iter::bench_filter_ref_sum                   604,827             566,450              -38,377   -6.35%   x 1.07
 iter::bench_flat_map_chain_ref_sum           4,467,065           1,641,630         -2,825,435  -63.25%   x 2.72
 iter::bench_flat_map_ref_sum                 1,849,160           474,605           -1,374,555  -74.33%   x 3.90
 iter::bench_flat_map_sum                     446,975             483,936               36,961    8.27%   x 0.92
 iter::bench_for_each_chain_ref_fold          1,608,805           1,378,055           -230,750  -14.34%   x 1.17
 iter::bench_fuse_chain_ref_sum               1,612,977           942,280             -670,697  -41.58%   x 1.71
 iter::bench_fuse_chain_sum                   461,980             707,125              245,145   53.06%   x 0.65
 iter::bench_fuse_ref_sum                     460,043             352,918             -107,125  -23.29%   x 1.30
 iter::bench_inspect_chain_ref_sum            1,612,792           941,940             -670,852  -41.60%   x 1.71
 iter::bench_inspect_chain_sum                460,000             706,720              246,720   53.63%   x 0.65
 iter::bench_inspect_ref_sum                  459,523             358,241             -101,282  -22.04%   x 1.28
 iter::bench_multiple_take                    63                  58                        -5   -7.94%   x 1.09
 iter::bench_peekable_chain_ref_sum           1,378,620           924,195             -454,425  -32.96%   x 1.49
 iter::bench_peekable_chain_sum               463,480             692,655              229,175   49.45%   x 0.67
 iter::bench_peekable_ref_sum                 459,677             346,225             -113,452  -24.68%   x 1.33
 iter::bench_rposition                        66                  74                         8   12.12%   x 0.89
 iter::bench_skip_chain_ref_sum               3,911,640           921,185           -2,990,455  -76.45%   x 4.25
 iter::bench_skip_cycle_skip_zip_add_ref_sum  4,439,160           3,252,870         -1,186,290  -26.72%   x 1.36
 iter::bench_skip_sum                         344,301             460,145              115,844   33.65%   x 0.75
 iter::bench_skip_while_chain_ref_sum         3,947,070           921,160           -3,025,910  -76.66%   x 4.28
 iter::bench_skip_while_ref_sum               344,387             460,540              116,153   33.73%   x 0.75
 iter::bench_take_while_chain_ref_sum         760,888             511,980             -248,908  -32.71%   x 1.49

There are a lot of promising improvements, but surprisingly there are also a couple of strange regressions:

 name                                         master.txt ns/iter  pr.txt ns/iter  diff ns/iter   diff %  speedup
 iter::bench_flat_map_sum                     446,975             483,936               36,961    8.27%   x 0.92
 iter::bench_fuse_chain_sum                   461,980             707,125              245,145   53.06%   x 0.65
 iter::bench_inspect_chain_sum                460,000             706,720              246,720   53.63%   x 0.65
 iter::bench_peekable_chain_sum               463,480             692,655              229,175   49.45%   x 0.67
 iter::bench_rposition                        66                  74                         8   12.12%   x 0.89
 iter::bench_skip_sum                         344,301             460,145              115,844   33.65%   x 0.75

These benchmarks shouldn't have be impacted by my changes because they never use &mut impl Iterator but somehow they were. My guess is that this is the result of some LLVM shenanigans because if I remove any benchmark containing .by_ref() then most of them don't regress anymore. Since these look like codegen problems and specific to the current benchmark I'm not too worried about them.

 name                                         master.txt ns/iter  pr.txt ns/iter  diff ns/iter   diff %  speedup
 iter::bench_enumerate_chain_ref_sum          1,381,700           1,476,370             94,670    6.85%   x 0.94
 iter::bench_skip_while_ref_sum               344,387             460,540              116,153   33.73%   x 0.75

These benchmarks however were directly impacted by my changes and the give the same results even if I remove all the other benchmarks, so I guess they're genuine regressions.

All these regressions are present even if I copy the default fold implementation but directly call I::next instead of passing through <&mut I>::next (which should just call I::next anyway). This makes me think these are also codegen problem.

I would like to not have these regressions but the other improvements probably outweights them.

rust-highfive · 2021-02-16T14:16:23Z

r? @Mark-Simulacrum

(rust-highfive has picked a reviewer for you, use r? to override)

library/core/src/iter/traits/double_ended.rs

library/core/src/iter/traits/iterator.rs

bluss · 2021-02-16T21:01:44Z

Is the intermediate trait SpecSizedIterator a workaround? Couldn't it be specialized on the Sized bound for the plain Iterator trait impl for &mut I? I know I tried that years ago (and at that point it ICE'd, so it was not possible to merge it). Just specializing the Iterator impl would avoid having to duplicate the default method implementations, but I admit I don't know the min specialization rules, if they allow specializing on I: Sized.

the8472 · 2021-02-16T21:07:36Z

Current std policy says to use private helper traits. https://std-dev-guide.rust-lang.org/code-considerations/using-unstable-lang/specialization.html

Mark-Simulacrum · 2021-02-20T22:54:49Z

r? @cuviper (or someone else more familiar with iterator code)

cuviper · 2021-02-21T01:36:39Z

Let's check the general perf effect:

@bors try @rust-timer queue

rust-timer · 2021-02-21T01:36:40Z

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

bors · 2021-02-21T01:36:51Z

⌛ Trying commit a09399b1f1ba90d72fdc4eadf65d0ab459a96287 with merge 4bf6a12cc7384e44f61e55b762c4efb6612c5aca...

bors · 2021-02-21T02:27:06Z

☀️ Try build successful - checks-actions
Build commit: 4bf6a12cc7384e44f61e55b762c4efb6612c5aca (4bf6a12cc7384e44f61e55b762c4efb6612c5aca)

rust-timer · 2021-02-21T02:27:07Z

Queued 4bf6a12cc7384e44f61e55b762c4efb6612c5aca with parent d2b38d6, future comparison URL.

rust-timer · 2021-02-21T05:40:51Z

Finished benchmarking try commit (4bf6a12cc7384e44f61e55b762c4efb6612c5aca): comparison url.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying rollup- to bors.

Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf

cuviper · 2021-03-03T01:03:13Z

@SkiFire13 those perf results don't look promising -- what's your take on it?

SkiFire13 · 2021-03-03T09:49:09Z

I'm not quite sure but my guesses are:

Calling I::try_fold instead of just calling next in a loop may result in a bigger function, which probably translates in bigger compile times. Since println somehow ends up calling for_each on a &mut vec::Drain (through BufWriter::flush_buf and then vec::Drain::drop) this might explain why the incr-patched: println builds were the most affected. Maybe we could change vec::Drain::drop to not call that for_each, or maybe discard the idea of specializing &mut Iterator and instead add another iterator adaptor that does the same job but is opt-in (this feels like duplicating by_ref but it's not like we can change its return type).
default_fold and friends being separate functions may result in more time spent deciding whether to inline them or not. I don't know exactly how this may affect compile times, but we may try manually inling (e.g. reverting the commit that introduced default_fold) or using #[inline(always)]. As a side note, a quick benchmark shows that after adding default_fold and friends some benchmarks didn't regress anymore, but others started regressing, which is pretty weird to me.

crlf0710 · 2021-03-20T05:50:47Z

@SkiFire13 Ping from triage, what's next steps here?

SkiFire13 · 2021-03-21T11:02:14Z

@SkiFire13 Ping from triage, what's next steps here?

After the perf run I decided to focus on solving the regressing benchmarks, hoping that their cause is the same as the regressions in the perf run. I tried inling more methods and changing vec::Drain::drop to not use .by_ref() but unfortunately the result in both cases was that some benches regress and other improve. Now I'm a bit lost. I'm not too familiar with how codegen works and how to debug it and the fact that the regressions behaves pretty weirdly doesn't help at all.

the8472 · 2021-03-21T12:33:46Z

If you click on the percentages you'll see a breakdown by what's consuming more time. It's mostly spending more time in LLVM, so one thing you could investigate is whether these changes caused more llvm IR to be emitted for those benchmarked crates. cargo llvm-lines can be useful there.

In the past optimization attempts in Vec have shown that while internal iteration methods such as fold() can generate better code they involve a lot more duplicate IR due to all the helper functions in the iterator adapters which makes compilation more expensive than the speed gains in the compiler itself. E.g. a 3rd party crate might become faster when executed (-0.2%) but the compiler spends more time in LLVM (+0.4%) and its own rust code only gets a little more efficient ( -0.1%) since it doesn't use that particular code path often which means it shows a +0.3% regression. Perf runs only capture the +0.3%, not the -0.2%.

Since println somehow ends up calling for_each on a &mut vec::Drain (through BufWriter::flush_buf and then vec::Drain::drop) this might explain why the incr-patched: println builds were the most affected.

Just run the experiment?

Another idea would be measuring the shuffling of the default implementations into separate traits (Spec...) and the additional Iterator + Sized specialization separately. Then we could see whether it's the outlining or the specialization of interest that causes the regression.

crlf0710 · 2021-04-09T13:20:12Z

@SkiFire13 Ping from triage, any updates on this?

…&mut impl Iterator

…old implementations for &mut impl DoubleEndedIterator

…o common functions

SkiFire13 · 2021-05-11T08:38:13Z

I've rebased onto #85157, let's see how it compares. It might be nice to have a perf-run comparison with both master and #85157. I've also converted this PR into a draft since it's clearly not ready for merging.

Mark-Simulacrum · 2021-05-11T17:23:53Z

@bors try @rust-timer queue

rust-timer · 2021-05-11T17:23:54Z

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

bors · 2021-05-11T17:24:02Z

⌛ Trying commit 17350de with merge 5d8726c5e6e2c8a635a0178c21932f6634271e89...

bors · 2021-05-11T18:13:54Z

☀️ Try build successful - checks-actions
Build commit: 5d8726c5e6e2c8a635a0178c21932f6634271e89 (5d8726c5e6e2c8a635a0178c21932f6634271e89)

rust-timer · 2021-05-11T18:13:57Z

Queued 5d8726c5e6e2c8a635a0178c21932f6634271e89 with parent 2bafe96, future comparison URL.

rust-timer · 2021-05-11T21:41:11Z

Finished benchmarking try commit (5d8726c5e6e2c8a635a0178c21932f6634271e89): comparison url.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying rollup- to bors.

Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf

SkiFire13 · 2021-05-11T21:45:02Z

Still pretty bad results

the8472 · 2021-05-11T22:20:51Z

The clap-rs-opt full and clap-rs-opt incr-patched: println benchmarks still are spending additional time in llvm, as they did in the previous perf run. And they're way outside the noise level in the perf charts. So that's probably unrelated to Drain::drop.

Running that benchmark locally for master and for the branch and comparing the llvm-lines output might help. Maybe there are lots of generic instantiations of iterator code somewhere else.

deeply-nested-async-opt incr-unchanged is interesting in a different way. It spends more time in rust passes, not llvm. That might be a genuine regression in iterator performance. Running that benchmark with perf output and then doing a perf diff might shed some light on that. Currently this PR optimizes 4 different iterator methods. Maybe we can just trim down the set to the more beneficial ones by rejecting the ones which cause regressions.

…imulacrum replace vec::Drain drop loops with drop_in_place The `Drain::drop` implementation came up in rust-lang#82185 (comment) as potentially interfering with other optimization work due its widespread use somewhere in `println!` `@rustbot` label T-libs-impl

Dylan-DPC · 2022-03-30T14:04:31Z

@SkiFire13 any updates on this?

SkiFire13 · 2022-03-30T20:37:20Z

I left this PR in draft mode in case I got new ideas but nothing came up and eventually I forgot about it.

rust-highfive assigned Mark-Simulacrum Feb 16, 2021

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Feb 16, 2021

pickfire reviewed Feb 16, 2021

View reviewed changes

library/core/src/iter/traits/double_ended.rs Show resolved Hide resolved

pickfire reviewed Feb 16, 2021

View reviewed changes

library/core/src/iter/traits/iterator.rs Show resolved Hide resolved

the8472 reviewed Feb 16, 2021

View reviewed changes

library/core/src/iter/traits/iterator.rs Outdated Show resolved Hide resolved

This comment has been minimized.

Sign in to view

rust-highfive assigned cuviper and unassigned Mark-Simulacrum Feb 20, 2021

rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 21, 2021

rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 21, 2021

crlf0710 added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Mar 20, 2021

crlf0710 removed the S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. label Apr 9, 2021

replace vec::Drain drop loops with drop_in_place

032fa35

SkiFire13 added 4 commits May 11, 2021 10:18

Add better Iterator::try_fold and Iterator::fold implementations for …

16da81c

…&mut impl Iterator

Add better DoubleEndedIterator::try_rfold and DoubleEndedIterator::rf…

f78d59a

…old implementations for &mut impl DoubleEndedIterator

refactor default fold, try_fold, rfold and try_rfold implementation t…

ade2cf8

…o common functions

Update ui test stderr

17350de

SkiFire13 force-pushed the improve-mut-ref-iter branch from 03cf754 to 17350de Compare May 11, 2021 08:35

SkiFire13 changed the title ~~Improve implementation of Iterator for &mut I~~ Specialize Iterator for &mut I where I: Sized May 11, 2021

SkiFire13 marked this pull request as draft May 11, 2021 08:36

rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label May 11, 2021

rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label May 11, 2021

inquisitivecrystal added the T-libs Relevant to the library team, which will review and decide on the PR/issue. label Aug 24, 2021

the8472 mentioned this pull request Feb 18, 2022

Let try_collect take advantage of try_fold overrides #94115

Merged

Dylan-DPC removed the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Feb 19, 2022

SkiFire13 closed this Mar 30, 2022

the8472 mentioned this pull request Dec 30, 2022

internal iteration for &mut I #100173

Closed

the8472 mentioned this pull request May 4, 2023

Optimize Iterator implementation for &mut impl Iterator + Sized #111200

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specialize Iterator for &mut I where I: Sized #82185

Specialize Iterator for &mut I where I: Sized #82185

SkiFire13 commented Feb 16, 2021 •

edited

Loading

rust-highfive commented Feb 16, 2021

This comment has been minimized.

bluss commented Feb 16, 2021 •

edited

Loading

the8472 commented Feb 16, 2021

Mark-Simulacrum commented Feb 20, 2021

cuviper commented Feb 21, 2021

rust-timer commented Feb 21, 2021

bors commented Feb 21, 2021

bors commented Feb 21, 2021

rust-timer commented Feb 21, 2021

rust-timer commented Feb 21, 2021

cuviper commented Mar 3, 2021

SkiFire13 commented Mar 3, 2021

crlf0710 commented Mar 20, 2021

SkiFire13 commented Mar 21, 2021

the8472 commented Mar 21, 2021

crlf0710 commented Apr 9, 2021

SkiFire13 commented May 11, 2021 •

edited

Loading

Mark-Simulacrum commented May 11, 2021

rust-timer commented May 11, 2021

bors commented May 11, 2021

bors commented May 11, 2021

rust-timer commented May 11, 2021

rust-timer commented May 11, 2021

SkiFire13 commented May 11, 2021

the8472 commented May 11, 2021

Dylan-DPC commented Mar 30, 2022

SkiFire13 commented Mar 30, 2022

Specialize Iterator for &mut I where I: Sized #82185

Specialize Iterator for &mut I where I: Sized #82185

Conversation

SkiFire13 commented Feb 16, 2021 • edited Loading

rust-highfive commented Feb 16, 2021

This comment has been minimized.

bluss commented Feb 16, 2021 • edited Loading

the8472 commented Feb 16, 2021

Mark-Simulacrum commented Feb 20, 2021

cuviper commented Feb 21, 2021

rust-timer commented Feb 21, 2021

bors commented Feb 21, 2021

bors commented Feb 21, 2021

rust-timer commented Feb 21, 2021

rust-timer commented Feb 21, 2021

cuviper commented Mar 3, 2021

SkiFire13 commented Mar 3, 2021

crlf0710 commented Mar 20, 2021

SkiFire13 commented Mar 21, 2021

the8472 commented Mar 21, 2021

crlf0710 commented Apr 9, 2021

SkiFire13 commented May 11, 2021 • edited Loading

Mark-Simulacrum commented May 11, 2021

rust-timer commented May 11, 2021

bors commented May 11, 2021

bors commented May 11, 2021

rust-timer commented May 11, 2021

rust-timer commented May 11, 2021

SkiFire13 commented May 11, 2021

the8472 commented May 11, 2021

Dylan-DPC commented Mar 30, 2022

SkiFire13 commented Mar 30, 2022

SkiFire13 commented Feb 16, 2021 •

edited

Loading

bluss commented Feb 16, 2021 •

edited

Loading

SkiFire13 commented May 11, 2021 •

edited

Loading