Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iterator Default Implementation for position() is slow #119551

Closed
marthadev opened this issue Jan 3, 2024 · 5 comments · Fixed by #119599
Closed

Iterator Default Implementation for position() is slow #119551

marthadev opened this issue Jan 3, 2024 · 5 comments · Fixed by #119599
Labels
A-iterators Area: Iterators I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs Relevant to the library team, which will review and decide on the PR/issue.

Comments

@marthadev
Copy link
Contributor

The compiler struggles to optimize the current default implementation for position on Iterators.
Simplifying the implementation has increased the efficiency in various scenarios OMM:

Old position():

    #[inline]
    fn check<T>(
        mut predicate: impl FnMut(T) -> bool,
    ) -> impl FnMut(usize, T) -> ControlFlow<usize, usize> {
        #[rustc_inherit_overflow_checks]
        move |i, x| {
            if predicate(x) { ControlFlow::Break(i) } else { ControlFlow::Continue(i + 1) }
        }
    }

    self.try_fold(0, check(predicate)).break_value()

Simplified version (There are several alternative implementations producing similar results as well):

    let mut position = 0;
    while let Some(x) = self.next() {
        if predicate(x) {
            return Some(position);
        }
        position += 1;
    }
    None

Bench results:

position            fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ arcstr_new       103.4 µs      │ 134.8 µs      │ 105.4 µs      │ 106.4 µs      │ 118     │ 944
├─ arcstr_old       103.8 µs      │ 143.7 µs      │ 105.6 µs      │ 107.1 µs      │ 117     │ 936
├─ big_range_new    33.33 ns      │ 25.79 µs      │ 34.95 ns      │ 36.06 ns      │ 642049  │ 642049
├─ big_range_old    805 µs        │ 893.4 µs      │ 828.3 µs      │ 830 µs        │ 121     │ 121
├─ small_range_new  547.3 ns      │ 60.17 µs      │ 622 ns        │ 659 ns        │ 128056  │ 128056
├─ small_range_old  3.174 µs      │ 54.47 µs      │ 3.195 µs      │ 3.298 µs      │ 28676   │ 28676
├─ u32_new          109.7 µs      │ 125.7 µs      │ 111.6 µs      │ 112.2 µs      │ 112     │ 896
├─ u32_old          245 µs        │ 261.4 µs      │ 247.6 µs      │ 248.5 µs      │ 100     │ 800
├─ u64_new          110.5 µs      │ 147.9 µs      │ 112.8 µs      │ 114.6 µs      │ 109     │ 872
├─ u64_old          255 µs        │ 378.5 µs      │ 258.4 µs      │ 264.1 µs      │ 100     │ 800
├─ u8_new           10.86 µs      │ 34.03 µs      │ 11.49 µs      │ 11.71 µs      │ 1064    │ 8512
╰─ u8_old           17.71 µs      │ 28.99 µs      │ 18.13 µs      │ 18.18 µs      │ 686     │ 5488
bench.rs
use std::sync::Arc;
use once_cell::sync::Lazy;

fn main() {
    divan::main();
}

static VECSTRING: Lazy<Vec<Arc<str>>> =
    Lazy::new(|| Vec::from_iter((0..1_000).map(|x| x.to_string().into())));
static VEC64: Lazy<Vec<u64>> = Lazy::new(|| Vec::from_iter(0..10_000));
static VEC32: Lazy<Vec<u32>> = Lazy::new(|| Vec::from_iter(0..10_000));
static VEC8: Lazy<Vec<u8>> = Lazy::new(|| Vec::from_iter(0..=255));

#[divan::bench(min_time = 0.1, sample_size = 8)]
fn u32_old() {
    for x in (0..10_000).step_by(100) {
        assert_eq!(
            divan::black_box(&VEC32)
                .iter()
                .copied()
                .position(|s| s == x),
            Some(x as usize)
        );
    }
}

#[divan::bench(min_time = 0.1, sample_size = 8)]
fn u32_new() {
    for x in (0..10_000).step_by(100) {
        assert_eq!(
            divan::black_box(&VEC32).iter().copied().locate(|s| s == x),
            Some(x as usize)
        );
    }
}

#[divan::bench(min_time = 0.1, sample_size = 8)]
fn u64_old() {
    for x in (0..10_000).step_by(100) {
        assert_eq!(
            divan::black_box(&VEC64)
                .iter()
                .copied()
                .position(|s| s == x),
            Some(x as usize)
        );
    }
}

#[divan::bench(min_time = 0.1, sample_size = 8)]
fn u64_new() {
    for x in (0..10_000).step_by(100) {
        assert_eq!(
            divan::black_box(&VEC64).iter().copied().locate(|s| s == x),
            Some(x as usize)
        );
    }
}

#[divan::bench(min_time = 0.1, sample_size = 8)]
fn u8_old() {
    for x in 0..=255 {
        assert_eq!(
            divan::black_box(&VEC8).iter().copied().position(|s| s == x),
            Some(x as usize)
        );
    }
}

#[divan::bench(min_time = 0.1, sample_size = 8)]
fn u8_new() {
    for x in 0..=255 {
        assert_eq!(
            divan::black_box(&VEC8).iter().copied().locate(|s| s == x),
            Some(x as usize)
        );
    }
}

#[divan::bench(min_time = 0.1, sample_size = 8)]
fn arcstr_old() {
    for x in (0..1_000).step_by(10) {
        let xstr = x.to_string();
        assert_eq!(
            divan::black_box(&VECSTRING)
                .iter()
                .clone()
                .position(|s| **s == *xstr.as_str()),
            Some(x as usize)
        );
    }
}

#[divan::bench(min_time = 0.1, sample_size = 8)]
fn arcstr_new() {
    for x in (0..1_000).step_by(10) {
        let xstr = x.to_string();
        assert_eq!(
            divan::black_box(&VECSTRING)
                .iter()
                .clone()
                .locate(|s| **s == *xstr.as_str()),
            Some(x as usize)
        );
    }
}

#[divan::bench(min_time = 0.1, sample_size = 1)]
fn big_range_old() {
    for x in (0..1_000_000).step_by(100_000) {
        assert_eq!(
            divan::black_box(0..1_000_000).position(|s| s == x),
            Some(x as usize)
        )
    }
}

#[divan::bench(min_time = 0.1, sample_size = 1)]
fn big_range_new() {
    for x in (0..1_000_000).step_by(100_000) {
        assert_eq!(
            divan::black_box(0..1_000_000).locate(|s| s == x),
            Some(x as usize)
        )
    }
}

#[divan::bench(min_time = 0.1, sample_size = 1)]
fn small_range_old() {
    for x in 0..100 {
        assert_eq!(
            divan::black_box(0..100).position(|s| s == x),
            Some(x as usize)
        )
    }
}

#[divan::bench(min_time = 0.1, sample_size = 1)]
fn small_range_new() {
    for x in 0..100 {
        assert_eq!(
            divan::black_box(0..100).locate(|s| s == x),
            Some(x as usize)
        )
    }
}

trait Locate {
    type Item;
    fn locate<P>(&mut self, predicate: P) -> Option<usize>
    where
        P: FnMut(Self::Item) -> bool;
}

impl<T, U> Locate for U
where
    U: Iterator<Item = T>,
{
    type Item = T;

    #[inline]
    fn locate<P>(&mut self, mut predicate: P) -> Option<usize>
    where
        P: FnMut(Self::Item) -> bool,
    {
        let mut position = 0;
        while let Some(x) = self.next() {
            if predicate(x) {
                return Some(position);
            }
            position += 1;
        }
        None
    }
}

Benchmarks are named after the iterator and were run using Divan.
Ranges benefit massively (big ones by multiple orders of magnitude), while for other types, speedups are usually between 0% and 100%.

A specific implementation for range calculating the position would of course be even faster, but I don't think position() is called that often on ranges.

The implementation is similar to the slice::iter::Iter one, which was introduced to reduce compile times #72166, so perhaps this would help in that regard as well.

However, the compiler appears to occasionally forget how to optimize the function after editing unrelated code. This also happens with the current implementation, so I don't know how reproducible the speedups are (FWIW Ranges have never failed to improve, though).

I did not attempt to benchmark compilation speed, but according to godbolt the memory usage is down from 11.8 MB to 8.3 MB which looks promising.

I was motivated by a thread on reddit, where some context and speculation can be found.

@rustbot rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Jan 3, 2024
@the8472
Copy link
Member

the8472 commented Jan 4, 2024

The try_fold-based implementation is likely intended to speed up position-finding in Chain or Flatten iterators. So measuring range or slice iters doesn't exercise it well.

That said, they should be equally fast for slices. Going through try_fold should end up with a very similar white-let loop.

The call-tree is deeper, so this might affect optimizations. Have you tried compiling with 1CGU or LTO to see if that makes a difference? Benchmarks are fickle.

@the8472 the8472 added I-slow Issue: Problems and improvements with respect to performance of generated code. A-iterators Area: Iterators T-libs Relevant to the library team, which will review and decide on the PR/issue. and removed needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Jan 4, 2024
@the8472
Copy link
Member

the8472 commented Jan 4, 2024

The godbolt example does show that try_fold version results in a few more instructions.

The LLVM IR looks like it's trying to do some branchless version of the accumulator increments here:

  %_30.i.i.i = getelementptr inbounds i32, ptr %_30.i12.i.i, i64 1, !dbg !71
  %.val.i.i = load i32, ptr %_30.i12.i.i, align 4, !dbg !84, !noalias !85, !noundef !11
  %_0.i.i.i.i.i = icmp ne i32 %.val.i.i, %0, !dbg !91
  %_8.0.i.i.i.i = zext i1 %_0.i.i.i.i.i to i64, !dbg !103
  %_0.sroa.3.0.i.i.i.i = add i64 %accum.0.i.i, %_8.0.i.i.i.i, !dbg !103

while

  %_30.i.i.i = getelementptr inbounds i32, ptr %_30.i.i56.i, i64 1, !dbg !81
  %2 = add nuw nsw i64 %accum.07.i, 1, !dbg !94
  %3 = icmp eq ptr %_30.i.i.i, %end_or_len, !dbg !36

is an unconditional increment sitting in its own block.

Maybe ControlFlow is not well-suited for carrying the value and the loop state at the same time? CC @scottmcm

@marthadev
Copy link
Contributor Author

marthadev commented Jan 4, 2024

The try_fold-based implementation is likely intended to speed up position-finding in Chain or Flatten iterators. So measuring range or slice iters doesn't exercise it well.

I did some quick measurements for a chain of vecs:

position      fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ chain_new  477 µs        │ 651.6 µs      │ 487.3 µs      │ 491.9 µs      │ 204     │ 204
╰─ chain_old  361.1 µs      │ 651.4 µs      │ 377 µs        │ 380.2 µs      │ 263     │ 263

So the current impl does indeed work better for chains

That said, they should be equally fast for slices. Going through try_fold should end up with a very similar white-let loop.

Yes, the difference is the ControlFlow abstraction

The call-tree is deeper, so this might affect optimizations. Have you tried compiling with 1CGU or LTO to see if that makes a difference? Benchmarks are fickle.

Yes:

[profile.bench]
incremental = false
codegen-units = 1
lto ="fat"

No real differences though, results were with these settings.

Maybe ControlFlow is not well-suited for carrying the value and the loop state at the same time? CC @scottmcm

This made me retest an alternative I used previously (reducing ControlFlow size from <usize, usize> to <usize, ()>):

let mut accum = 0;
self.try_fold((), |_, x| if predicate(x) {
    ControlFlow::Break(accum)
} else {
    accum += 1;
    ControlFlow::Continue(())
}).break_value()

And this one actually beats the current one for the chain test:


position      fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ chain_alt  264.9 µs      │ 450.4 µs      │ 271.3 µs      │ 276.9 µs      │ 361     │ 361
├─ chain_new  314.5 µs      │ 413.3 µs      │ 322.3 µs      │ 326.4 µs      │ 307     │ 307
╰─ chain_old  361.6 µs      │ 526.2 µs      │ 371 µs        │ 373.4 µs      │ 268     │ 268

The chain_alt is even faster, but it's an attempt to keep the current style, updated for keeping the counter outside the fold:

#[inline]
fn check<'a, T>(
    mut predicate: impl FnMut(T) -> bool + 'a,
    acc: &'a mut usize,
) -> impl FnMut((), T) -> ControlFlow<usize, ()> + 'a {
    move |_, x| {
        if predicate(x) {
            ControlFlow::Break(*acc)
        } else {
            *acc += 1;
            ControlFlow::Continue(())
        }
    }
}
let mut acc = 0;
self.try_fold((), check(predicate, &mut acc)).break_value()

It isn't pretty, but it works 😅
Godbolt for the alt version with chains
The Assembly and IR are too long for me too really identify where the benefit comes from, curiously the faster version has more instructions, but I guess with all the jumps everywhere that doesn't mean a lot.

I am interested in testing these more thoroughly, is there a nice resource with some real world data to test them on?

Edit: Went digging in recent projects and I had a solution for Advent of Code Day 6 using .rposition() and .position() with the relevant part being:

    let mut hold_durations = (0..time).map(|hold| hold * (time - hold));
    hold_durations.rposition(|d| d > distance).unwrap()
        - hold_durations.position(|d| d > distance).unwrap()
        + 1

And changing to the alt version (with check(predicate, acc)) improved runtime from 8.8 ms to 6.2 ms, a 40% speedup! (I know there is an algebraic solution to the problem, but still)

@the8472
Copy link
Member

the8472 commented Jan 5, 2024

I am interested in testing these more thoroughly, is there a nice resource with some real world data to test them on?

You could make a PR and we can do a perf run. It doesn't have dedicated benchmarks that stress position and I haven't checked if any of the runtime benchmarks make use of position at all, but the compiler also makes use of it internally so it might show up in check builds too.

marthadev added a commit to marthadev/rust that referenced this issue Jan 5, 2024
…g the accumulating value outside of the fold in an attempt to improve code generation
bors added a commit to rust-lang-ci/rust that referenced this issue Jan 5, 2024
Rewrite Iterator::position default impl

Storing the accumulating value outside the fold in an attempt to improve code generation has shown speedups on various handwritten benchmarks, see discussion at rust-lang#119551.
@the8472 the8472 linked a pull request Jan 5, 2024 that will close this issue
bors added a commit to rust-lang-ci/rust that referenced this issue Jan 6, 2024
Rewrite Iterator::position default impl

Storing the accumulating value outside the fold in an attempt to improve code generation has shown speedups on various handwritten benchmarks, see discussion at rust-lang#119551.
bors added a commit to rust-lang-ci/rust that referenced this issue Jan 6, 2024
Rewrite Iterator::position default impl

Storing the accumulating value outside the fold in an attempt to improve code generation has shown speedups on various handwritten benchmarks, see discussion at rust-lang#119551.
@bors bors closed this as completed in 0e9c3d9 Jan 7, 2024
@scottmcm
Copy link
Member

scottmcm commented Jan 7, 2024

@the8472 Yes, we learned in #76746 (comment) that if you don't need move access to the accumulator -- either because &mut is fine, as in that thread, or seemingly because you can Copy, as seen here -- then it's better to use try_for_each with a FnMut closure instead of needing LLVM to understand the threading through the try_fold+Fn.

(Maybe this'll be better once LLVM does better with 2-variant enums and stops losing information about them, but for now we're stuck with this.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-iterators Area: Iterators I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants