slice::iter() does not preserve number of iterations information for optimizer causing unneeded bounds checks #75935

sdroege · 2020-08-26T06:34:34Z

Godbolt link to the code below: https://rust.godbolt.org/z/aKf3Wq

pub fn foo1(x: &[u32], y: &[u32]) -> u32 {
    let mut sum = 0;
    let chunk_size = y.len();
    for (c, y) in y.iter().enumerate() {
        for chunk in x.chunks_exact(chunk_size) {
            sum += chunk[c] + y;
        }
    }
    sum
}

This code has a bounds check for chunk[c] although c < chunk_size by construction.

The same code a bit more convoluted gets rid of the bounds check

pub fn foo2(x: &[u32], y: &[u32]) -> u32 {
    let mut sum = 0;
    let chunk_size = y.len();
    for c in 0..chunk_size {
        let y = y[c];
        for chunk in x.chunks_exact(chunk_size) {
            sum += chunk[c] + y;
        }
    }
    sum
}

It seems like the information that 0 <= c < y.len() gets lost for the optimizer when going via y.iter().enumerate(). So this is unrelated to chunks_exact() specifically but I can't come up with an equivalent example without it.

edit: As noticed in #75935 (comment), this can be worked around by defining a custom slice iterator that does counting of elements instead of working with an end pointer.

The problem is that the slice::iter() works with an end pointer to know when the iteration can stop and keeps no information around for the optimizer that it's actually going to iterate exactly N times. Unclear to me how this information can be preserved without changing how the iterator works, which will probably have other negative effects.

edit2:

As noticed in #77822 this happens with C++/C too and can also simplified a lot on the Rust side

pub fn foo(y: &[u32]) {
    let mut x = 0;
    for (c, _y) in y.iter().enumerate() {
        assert!(c < y.len());
        x = c;
    }
    assert!(x == y.len());
}

#include <assert.h>
#include <stdint.h>
#include <stdlib.h>

#include <vector>

void foo1(const uint32_t *y, size_t y_len) {
  const uint32_t *y_end = y + y_len;
  size_t c = 0;
  for (const uint32_t *y_iter = y; y_iter != y_end; y_iter++, c++) {
    assert(c < y_len);
  }
  assert(c == y_len);
}

void foo2(const std::vector<uint32_t>& y) {
    size_t c = 0;
    for (auto y_iter: y) {
        assert(c < y.size());
        c++;
    }
    assert(c == y.size());
}

void foo3(const std::vector<uint32_t>& y) {
    size_t c = 0;
    for (auto y_iter = y.cbegin(); y_iter != y.cend(); y_iter++, c++) {
        assert(c < y.size());
    }
    assert(c == y.size());
}

edit3: This is now also reported to https://bugs.llvm.org/show_bug.cgi?id=48965

The text was updated successfully, but these errors were encountered:

sdroege · 2020-08-26T07:36:37Z

I should probably add that while this code is very contrived, it's based on real code that shows the same behaviour.

tesuji · 2020-08-26T08:23:21Z

Maybe a duplicate of #74938: An upgrade to LLVM 12 or so is needed to fix the issue.

sdroege · 2020-08-26T08:49:39Z

I'll try adding only that change to the LLVM version used by rustc. Let's see if that solves anything (or works at all :) ).

sdroege · 2020-08-26T09:39:20Z

No, does not work at all. Gives a segfault in exactly that function that is changed inside LLVM.

sdroege · 2020-08-26T10:05:12Z

Got it to work. It doesn't fix this issue here, but it fixes #75936 . This one here is still valid.

sdroege · 2020-08-26T10:12:24Z

It seems like the information that 0 <= c < y.len() somehow gets lost for the optimizer when going via y.iter().enumerate().

My guess is that this is because the slice iterators don't go via a counter but instead via an end pointer, so it's not obvious anymore that it's just iterating exactly self.len() times.

sdroege · 2020-08-26T10:53:14Z

Yes, going with a simple iterator that counts instead gets rid of the bounds check. Code:

pub fn foo1(x: &[u32], y: &[u32]) -> u32 {
    let mut sum = 0;
    let chunk_size = y.len();
    for (c, y) in Iter::new(y).enumerate() {
        for chunk in x.chunks_exact(chunk_size) {
            sum += chunk[c] + y;
        }
    }
    sum
}

struct Iter<'a> {
    ptr: *const u32,
    len: usize,
    phantom: std::marker::PhantomData<&'a [u32]>,
}

impl<'a> Iter<'a> {
    fn new(v: &'a [u32]) -> Iter<'a> {
        Iter {
            ptr: v.as_ptr(),
            len: v.len(),
            phantom: std::marker::PhantomData,
        }
    }
}

impl<'a> Iterator for Iter<'a> {
    type Item = &'a u32;

    fn next(&mut self) -> Option<&'a u32> {
        unsafe {
            if self.len == 0 {
                return None;
            }

            let item = &*self.ptr;
            self.ptr = self.ptr.add(1);
            self.len -= 1;
            Some(item)
        }
    }
}

tesuji · 2020-08-27T10:48:50Z

You could update the issue description for new information.

sdroege · 2020-08-27T11:02:11Z

You could update the issue description for new information.

Indeed, thanks. Done!

cynecx · 2020-08-27T11:20:33Z

Is there any particular reason why the std-implementation (slice::Iter) is doing iteration through end pointer equality compared to the counting variant?

sdroege · 2020-08-27T11:27:32Z

Is there any particular reason why the std-implementation (slice::Iter) is doing iteration through end pointer equality compared to the counting variant?

I don't know the history, but in theory one instruction less per iteration (one pointer addition vs. one pointer addition and one counter addition). And it might be taken advantage of in some specialized impls but I don't know.

Might be worth looking at what std::vector iterators in C++ are doing, I'd hope those are optimizing well with clang++.

tesuji · 2020-08-27T12:16:47Z

Is there any particular reason why the std-implementation (slice::Iter) is doing iteration through end pointer equality compared to the counting variant?

The comment in code says that it's because of an optimization for ZST:

rust/library/core/src/slice/mod.rs

Lines 4009 to 4014 in 118860a

    
           pub struct Iter<'a, T: 'a> { 
        
               ptr: NonNull<T>, 
        
               end: *const T, // If T is a ZST, this is actually ptr+len.  This encoding is picked so that 
        
               // ptr == end is a quick test for the Iterator being empty, that works 
        
               // for both ZST and non-ZST. 
        
               _marker: marker::PhantomData<&'a T>,

sdroege · 2020-08-27T13:05:56Z

The comment in code says that it's because of an optimization for ZST:

If you had a counter that would work basically the same way, you'd just check idx==len or remainder==0 or similar. For ZST the current encoding seems just like a way to make it possible to not worry about ZST special cases in most other places of the code.

sdroege · 2020-08-28T07:20:03Z

I can do an implementation of slice::iter() that does counting next week, but what would be the best way to check this doesn't cause any performance regressions elsewhere? How are such things usually checked (i.e. is there some extensive benchmark suite that I could run, ...)?

sdroege · 2020-10-11T13:11:01Z

And another variant with specialization of the Enumerate iterator, which should also catch various other cases (e.g. the Chunks and ChunksExact iterators on slices). This yields the most optimal code so far: no bounds checks, unrolled and auto-vectorized nicely.

Check assembly here, the new one is foo2().

The std::intrinsics::assume() in the specialized impl is the part that makes it work nicely.

I'll create a PR for this later.

pub fn foo2(x: &[u32], y: &[u32]) -> u32 {
    let mut sum = 0;
    let chunk_size = y.len();
    for (c, y) in Enumerate::new(y.iter()) {
        for chunk in x.chunks_exact(chunk_size) {
            sum += chunk[c] + y;
        }
    }
    sum
}

struct Enumerate<I> {
    iter: I,
    count: usize,
    len: usize,
}

impl<I: Iterator> Enumerate<I> {
    fn new(iter: I) -> Self {
        EnumerateImpl::new(iter)
    }
}

impl<I> Iterator for Enumerate<I>
where
    I: Iterator,
{
    type Item = (usize, <I as Iterator>::Item);

    #[inline]
    fn next(&mut self) -> Option<(usize, <I as Iterator>::Item)> {
        EnumerateImpl::next(self)
    }
}

// Enumerate specialization trait
#[doc(hidden)]
trait EnumerateImpl<I> {
    type Item;
    fn new(iter: I) -> Self;
    fn next(&mut self) -> Option<(usize, Self::Item)>;
}

impl<I> EnumerateImpl<I> for Enumerate<I>
where
    I: Iterator,
{
    type Item = I::Item;

    default fn new(iter: I) -> Self {
        Enumerate {
            iter,
            count: 0,
            len: 0, // unused
        }
    }

    #[inline]
    default fn next(&mut self) -> Option<(usize, I::Item)> {
        let a = self.iter.next()?;
        let i = self.count;
        // Possible undefined overflow.
        self.count += 1;
        Some((i, a))
    }
}

impl<I> EnumerateImpl<I> for Enumerate<I>
where
    // FIXME: Should probably be TrustedRandomAccess because otherwise size_hint() might be expensive?
    I: std::iter::TrustedLen + ExactSizeIterator + Iterator,
{
    fn new(iter: I) -> Self {
        let len = iter.size_hint().0;

        Enumerate {
            iter,
            count: 0,
            len,
        }
    }

    #[inline]
    fn next(&mut self) -> Option<(usize, I::Item)> {
        let a = self.iter.next()?;
        unsafe { std::intrinsics::assume(self.count < self.len); }
        let i = self.count;
        // Possible undefined overflow.
        self.count += 1;
        Some((i, a))
    }
}

sdroege · 2021-01-30T10:51:39Z

I've updated the issue description with a minimal testcase in Rust, C/C++ and as discussed in #77822 will report this to LLVM.

sdroege · 2021-01-30T12:58:24Z

Reported to https://bugs.llvm.org/show_bug.cgi?id=48965

This allows to hint at the compiler via `intrinsics::assume()` that the returned index is smaller than the length of the underlying iterator and allows the compiler to optimize code better based on that. Fixes rust-lang#75935

the8472 · 2021-02-03T22:27:17Z

I can do an implementation of slice::iter() that does counting next week, but what would be the best way to check this doesn't cause any performance regressions elsewhere? How are such things usually checked (i.e. is there some extensive benchmark suite that I could run, ...)?

Requesting a perf run on the PR would be a start, which runs an extensive performance test compiling (but not benchmarking) various crates. It might miss unusual number-crunching uses of slices though.

https://perf.rust-lang.org/

sdroege · 2021-02-04T07:47:26Z

Thanks, but considering that C++ iterators work the same way, it's probably safer to keep it with this pattern and wait for LLVM to solve that. I would assume that patterns (also) used by C++ are more likely to be optimized better.

scottmcm · 2024-06-14T17:50:27Z

Is there any particular reason why the std-implementation (slice::Iter) is doing iteration through end pointer equality compared to the counting variant?

Because in cases where the iterator doesn't get SRoA'd away, slice::Iter::next on pointers only needs to write one value to memory (the start pointer), whereas a counted implementation needs to update two (increase the pointer and decrease the count): https://rust.godbolt.org/z/9zdYnMd1E.

When computing the length of a slice iterator we use both sub nuw and udiv exact to tell LLVM that it doesn't need to worry about weird cases, so hypothetically it should have everything it needs here. Of course LLVM does tend to optimize away that information in the next it generates...

One thing we could to better is at least tell LLVM that the pointers are aligned even when not read, though. I wish we had a better core pointer type that could actually do that.

sdroege mentioned this issue Aug 26, 2020

Get rid of bounds check in slice::chunks_exact() and related function… #75936

Merged

sdroege changed the title ~~Doesn't elide bounds checks with slice::chunks_exact() in trivial cases~~ slice::iter() does not preserve number of iterations information for optimizer causing unneeded bounds checks Aug 26, 2020

camelid added A-iterators Area: Iterators A-slice Area: `[T]` labels Oct 7, 2020

sdroege mentioned this issue Oct 11, 2020

Implement a specialized version std::iter::Enumerate for TrustedLen #77822

Closed

Enselic added I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Dec 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slice::iter() does not preserve number of iterations information for optimizer causing unneeded bounds checks #75935

slice::iter() does not preserve number of iterations information for optimizer causing unneeded bounds checks #75935

sdroege commented Aug 26, 2020 •

edited

Loading

sdroege commented Aug 26, 2020

tesuji commented Aug 26, 2020 •

edited

Loading

sdroege commented Aug 26, 2020

sdroege commented Aug 26, 2020

sdroege commented Aug 26, 2020

sdroege commented Aug 26, 2020

sdroege commented Aug 26, 2020 •

edited

Loading

tesuji commented Aug 27, 2020

sdroege commented Aug 27, 2020

cynecx commented Aug 27, 2020

sdroege commented Aug 27, 2020

tesuji commented Aug 27, 2020

sdroege commented Aug 27, 2020

sdroege commented Aug 28, 2020 •

edited

Loading

sdroege commented Oct 11, 2020

sdroege commented Jan 30, 2021

sdroege commented Jan 30, 2021

the8472 commented Feb 3, 2021

sdroege commented Feb 4, 2021

scottmcm commented Jun 14, 2024

slice::iter() does not preserve number of iterations information for optimizer causing unneeded bounds checks #75935

slice::iter() does not preserve number of iterations information for optimizer causing unneeded bounds checks #75935

Comments

sdroege commented Aug 26, 2020 • edited Loading

sdroege commented Aug 26, 2020

tesuji commented Aug 26, 2020 • edited Loading

sdroege commented Aug 26, 2020

sdroege commented Aug 26, 2020

sdroege commented Aug 26, 2020

sdroege commented Aug 26, 2020

sdroege commented Aug 26, 2020 • edited Loading

tesuji commented Aug 27, 2020

sdroege commented Aug 27, 2020

cynecx commented Aug 27, 2020

sdroege commented Aug 27, 2020

tesuji commented Aug 27, 2020

sdroege commented Aug 27, 2020

sdroege commented Aug 28, 2020 • edited Loading

sdroege commented Oct 11, 2020

sdroege commented Jan 30, 2021

sdroege commented Jan 30, 2021

the8472 commented Feb 3, 2021

sdroege commented Feb 4, 2021

scottmcm commented Jun 14, 2024

sdroege commented Aug 26, 2020 •

edited

Loading

tesuji commented Aug 26, 2020 •

edited

Loading

sdroege commented Aug 26, 2020 •

edited

Loading

sdroege commented Aug 28, 2020 •

edited

Loading