optimize simd widening mul #1247

TheIronBorn · 2022-08-11T05:47:44Z

stdsimd allows types larger than 512-bits so we can avoid the slow __mulddi3 path. If/when Simd<u128> arrives we can use it for 64-bit lanes as well

dhardy · 2022-08-11T08:29:14Z

I didn't see discussion on what Simd<u16, 64> etc. means given the lack of CPU support. Are we simply deferring to a portable-simd software implementation? Since LaneCount<LANES>: SupportedLaneCount is independent of T and the CPU feature set, I suppose this must be the case.

Did you run any benchmarks?

TheIronBorn · 2022-08-12T02:37:13Z

Right, sorry. The compiler is smart enough to treat it like two u16x32 (or even four u16x16 if you only have 256-bit). i.e.:

vpmullw	%zmm1, %zmm4, %zmm1
vpmovzxbw	32(%r8), %zmm4
vpmullw	%zmm2, %zmm4, %zmm2
vpxor	%xmm4, %xmm4, %xmm4
vpunpckhbw	%zmm4, %zmm0, %zmm5
vpunpckhbw	%zmm4, %zmm3, %zmm16
vpmullw	%zmm5, %zmm16, %zmm5
vpsrlw	$8, %zmm5, %zmm5
vpunpcklbw	%zmm4, %zmm0, %zmm0
vpunpcklbw	%zmm4, %zmm3, %zmm3
vpmullw	%zmm0, %zmm3, %zmm0
vpsrlw	$8, %zmm0, %zmm0
vpackuswb	%zmm5, %zmm0, %zmm0
vpmovwb	%zmm1, %ymm1
vpmovwb	%zmm2, %ymm2
vinserti64x4	$1, %ymm2, %zmm1, %zmm1

There's a chance it won't work on other architectures, though those architectures might not even have 512-bits

and benchmarks

test cast_wmul_u16x32    ... bench:     255,482 ns/iter (+/- 115,946) = 16417 MB/s
test cast_wmul_u32x16    ... bench:     287,282 ns/iter (+/- 10,553) = 14599 MB/s
test cast_wmul_u8x64     ... bench:     295,669 ns/iter (+/- 25,442) = 14185 MB/s
test mulddi3_wmul_u16x32 ... bench:     364,887 ns/iter (+/- 11,275) = 11494 MB/s
test mulddi3_wmul_u32x16 ... bench:     488,590 ns/iter (+/- 17,975) = 8584 MB/s
test mulddi3_wmul_u8x64  ... bench:     749,078 ns/iter (+/- 46,821) = 5599 MB/s

with

#[bench]
fn $fnn(b: &mut Bencher) {
    let x = <$ty>::splat(7);
    let y = <$ty>::splat(3);

    b.iter(|| {
        let mut accum = <$ty>::default();
        for _ in 0..RAND_BENCH_N {
            // no unrolling, so it's similar to gen_range without the overhead
            let (h, l) = test::black_box(x).$wmul_type(test::black_box(y));
            accum += h;
            accum += l;
        }
        accum
    });
    b.bytes = size_of::<$ty>() as u64 * 2 * RAND_BENCH_N;
}

performing two multiplications instead of four is easily going to be faster

optimize simd wmul

d60ab38

dhardy approved these changes Aug 12, 2022

View reviewed changes

TheIronBorn merged commit 9dd97b4 into master Aug 13, 2022

newpavlov deleted the TheIronBorn-patch-1 branch May 22, 2024 02:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize simd widening mul #1247

optimize simd widening mul #1247

TheIronBorn commented Aug 11, 2022

dhardy commented Aug 11, 2022

TheIronBorn commented Aug 12, 2022

optimize simd widening mul #1247

optimize simd widening mul #1247

Conversation

TheIronBorn commented Aug 11, 2022

dhardy commented Aug 11, 2022

TheIronBorn commented Aug 12, 2022