Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize simd widening mul #1247

Merged
merged 1 commit into from
Aug 13, 2022
Merged

optimize simd widening mul #1247

merged 1 commit into from
Aug 13, 2022

Conversation

TheIronBorn
Copy link
Collaborator

stdsimd allows types larger than 512-bits so we can avoid the slow __mulddi3 path. If/when Simd<u128> arrives we can use it for 64-bit lanes as well

@dhardy
Copy link
Member

dhardy commented Aug 11, 2022

I didn't see discussion on what Simd<u16, 64> etc. means given the lack of CPU support. Are we simply deferring to a portable-simd software implementation? Since LaneCount<LANES>: SupportedLaneCount is independent of T and the CPU feature set, I suppose this must be the case.

Did you run any benchmarks?

@TheIronBorn
Copy link
Collaborator Author

Right, sorry. The compiler is smart enough to treat it like two u16x32 (or even four u16x16 if you only have 256-bit). i.e.:

vpmullw	%zmm1, %zmm4, %zmm1
vpmovzxbw	32(%r8), %zmm4
vpmullw	%zmm2, %zmm4, %zmm2
vpxor	%xmm4, %xmm4, %xmm4
vpunpckhbw	%zmm4, %zmm0, %zmm5
vpunpckhbw	%zmm4, %zmm3, %zmm16
vpmullw	%zmm5, %zmm16, %zmm5
vpsrlw	$8, %zmm5, %zmm5
vpunpcklbw	%zmm4, %zmm0, %zmm0
vpunpcklbw	%zmm4, %zmm3, %zmm3
vpmullw	%zmm0, %zmm3, %zmm0
vpsrlw	$8, %zmm0, %zmm0
vpackuswb	%zmm5, %zmm0, %zmm0
vpmovwb	%zmm1, %ymm1
vpmovwb	%zmm2, %ymm2
vinserti64x4	$1, %ymm2, %zmm1, %zmm1

There's a chance it won't work on other architectures, though those architectures might not even have 512-bits

and benchmarks

test cast_wmul_u16x32    ... bench:     255,482 ns/iter (+/- 115,946) = 16417 MB/s
test cast_wmul_u32x16    ... bench:     287,282 ns/iter (+/- 10,553) = 14599 MB/s
test cast_wmul_u8x64     ... bench:     295,669 ns/iter (+/- 25,442) = 14185 MB/s
test mulddi3_wmul_u16x32 ... bench:     364,887 ns/iter (+/- 11,275) = 11494 MB/s
test mulddi3_wmul_u32x16 ... bench:     488,590 ns/iter (+/- 17,975) = 8584 MB/s
test mulddi3_wmul_u8x64  ... bench:     749,078 ns/iter (+/- 46,821) = 5599 MB/s

with

#[bench]
fn $fnn(b: &mut Bencher) {
    let x = <$ty>::splat(7);
    let y = <$ty>::splat(3);

    b.iter(|| {
        let mut accum = <$ty>::default();
        for _ in 0..RAND_BENCH_N {
            // no unrolling, so it's similar to gen_range without the overhead
            let (h, l) = test::black_box(x).$wmul_type(test::black_box(y));
            accum += h;
            accum += l;
        }
        accum
    });
    b.bytes = size_of::<$ty>() as u64 * 2 * RAND_BENCH_N;
}

performing two multiplications instead of four is easily going to be faster

@TheIronBorn TheIronBorn merged commit 9dd97b4 into master Aug 13, 2022
@newpavlov newpavlov deleted the TheIronBorn-patch-1 branch May 22, 2024 02:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants