Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLVM pointer range loop / autovectorization regression part two #37276

Open
bluss opened this issue Oct 19, 2016 · 2 comments
Open

LLVM pointer range loop / autovectorization regression part two #37276

bluss opened this issue Oct 19, 2016 · 2 comments
Labels
A-autovectorization Area: Autovectorization, which can impact perf or code size A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@bluss
Copy link
Member

bluss commented Oct 19, 2016

This is a follow up to #35662. The optimizer regression that resulted in that bug is still not fixed.

That bug created a simplified test case, and fixed it for that case. That's good. However, I have not been able to remove my workaround, so the issue still persists in the original code.

The issue appears for an 8x8 kernel and disappears if the kernel is shrunk to 4x4, so it's somehow related to the sheer size of the function, or the length it goes to in loop unrolling.

Preamble for the code that produces the desired codegen:

let mut ab: [[f32; 8]; 8];
ab = ::std::mem::uninitialized();
loop8!(i, loop8!(j, ab[i][j] = 0.));

What the loop8 macros do is that they expand the expression statically, so it corresponds to 64 assignments.

Initialization part for code which is not optimizing well:

let mut ab: [[f32; 8]; 8];
ab = [[0.; 8]; 8];

Another example which is not optimizing well:

let mut ab: [[f32; 8]; 8];
ab = ::std::mem::uninitialized();
for i in 0..8 {
    for j in 0..8 {
        ab[i][j] = 0.;
    }
}

Full reproducer in the next comment.

@bluss
Copy link
Member Author

bluss commented Oct 19, 2016

  • Tested using rustc 1.14.0-nightly (98a3502da 2016-10-15)
  • command line rustc -Copt-level=3 --test benchgemm.rs && ./benchgemm --bench
  • Assignments, Zeroed: 754 ns/iter (+/- 13)
  • ArrayLit, Loop: 2,771 ns/iter (+/- 51)
#![feature(test)]
extern crate test;
use test::Bencher;

pub type T = f32;

const MR: usize = 8;
const NR: usize = 8;

macro_rules! loop4 {
    ($i:ident, $e:expr) => {{
        let $i = 0; $e;
        let $i = 1; $e;
        let $i = 2; $e;
        let $i = 3; $e;
    }}
}

macro_rules! loop8 {
    ($i:ident, $e:expr) => {{
        let $i = 0; $e;
        let $i = 1; $e;
        let $i = 2; $e;
        let $i = 3; $e;
        let $i = 4; $e;
        let $i = 5; $e;
        let $i = 6; $e;
        let $i = 7; $e;
    }}
}

/// 4x4 matrix multiplication kernel
///
/// This does the matrix multiplication:
///
/// C ← α A B
///
/// + k: length of data in a, b
/// + a, b are packed
/// + c has general strides
/// + rsc: row stride of c
/// + csc: col stride of c
pub unsafe fn kernel(k: usize, alpha: T, a: *const T, b: *const T,
                     c: *mut T, rsc: isize, csc: isize)
{
    let mut ab: [[T; NR]; MR];

    enum Method {
        ArrayLit,
        Loop,
        Assignments,
        Zeroed,
    }
    let method = Method::Assignments;

    match method {
        Method::ArrayLit => {
            ab = [[0.; NR]; MR];
        }
        Method::Loop => {
            ab = ::std::mem::uninitialized();
            for i in 0..MR {
                for j in 0..NR {
                    ab[i][j] = 0.;
                }
            }
        }
        Method::Assignments => {
            ab = ::std::mem::uninitialized();
            loop8!(i, loop8!(j, ab[i][j] = 0.));
        }
        Method::Zeroed => {
            ab = ::std::mem::zeroed();
        }
    }
    let mut a = a;
    let mut b = b;

    // Compute matrix multiplication into ab[i][j]
    for _ in 0..k {
        loop8!(i, loop8!(j, ab[i][j] += at(a, i) * at(b, j)));

        a = a.offset(MR as isize);
        b = b.offset(NR as isize);
    }

    macro_rules! c {
        ($i:expr, $j:expr) => (*c.offset(rsc * $i as isize + csc * $j as isize));
    }

    // set C = α A B
    loop8!(j, loop8!(i, c![i, j] = alpha * ab[i][j]));
}

#[inline(always)]
unsafe fn at(ptr: *const T, i: usize) -> T {
    *ptr.offset(i as isize)
}

#[bench]
fn bench_gemm(bench: &mut Bencher) {
    const K: usize = 128;
    let mut a = [1.; MR * K];
    let mut b = [0.; NR * K];
    for (i, x) in a.iter_mut().enumerate() {
        *x = i as T;
    }

    for i in 0..NR {
        b[i + i * K] = 1.;
    }
    let mut c = [0.; NR * MR];
    bench.iter(|| {
        unsafe {
            kernel(K, 1., &a[0], &b[0], &mut c[0], 1, MR as isize);
        }
        c
    });
}

@bluss
Copy link
Member Author

bluss commented Oct 19, 2016

(Removed and moved this information to the top)

@sanxiyn sanxiyn added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-slow Issue: Problems and improvements with respect to performance of generated code. labels Dec 28, 2016
@Mark-Simulacrum Mark-Simulacrum added the C-enhancement Category: An issue proposing an enhancement or a PR with one. label Jul 26, 2017
@jonas-schievink jonas-schievink added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Jan 31, 2020
@workingjubilee workingjubilee added the A-autovectorization Area: Autovectorization, which can impact perf or code size label Jul 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-autovectorization Area: Autovectorization, which can impact perf or code size A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

5 participants