-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SSE/AVX byte multiplication could be improved #109775
Comments
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
Simple unrolled benchmark on Rapter Lake:
Code: private Vector128<byte>[] a = new Vector128<byte>[1000], b = new Vector128<byte>[1000];
[Benchmark]
public Vector128<byte> SimpleMultiply()
{
Vector128<byte> result = default;
for (int i = 0; i < a.Length; i += 4)
{
result ^= SimpleMultiply(a[i], b[i]);
result ^= SimpleMultiply(a[i + 1], b[i + 1]);
result ^= SimpleMultiply(a[i + 2], b[i + 2]);
result ^= SimpleMultiply(a[i + 3], b[i + 3]);
}
return result;
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static Vector128<byte> SimpleMultiply(Vector128<byte> left, Vector128<byte> right) => left * right;
[Benchmark]
public Vector128<byte> Stratergy1()
{
Vector128<byte> result = default;
for (int i = 0; i < a.Length; i += 4)
{
result ^= Stratergy1(a[i], b[i]);
result ^= Stratergy1(a[i + 1], b[i + 1]);
result ^= Stratergy1(a[i + 2], b[i + 2]);
result ^= Stratergy1(a[i + 3], b[i + 3]);
}
return result;
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static Vector128<byte> Stratergy1(Vector128<byte> left, Vector128<byte> right)
{
Vector128<ushort> dst_even = left.AsUInt16() * right.AsUInt16();
Vector128<ushort> dst_odd = (left.AsUInt16() >> 8) * (right.AsUInt16() >> 8);
return ((dst_odd << 8) | (dst_even & Vector128.Create<ushort>(0xFF))).AsByte();
}
[Benchmark]
public Vector128<byte> Stratergy2()
{
Vector128<byte> result = default;
for (int i = 0; i < a.Length; i += 4)
{
result ^= Stratergy2(a[i], b[i]);
result ^= Stratergy2(a[i + 1], b[i + 1]);
result ^= Stratergy2(a[i + 2], b[i + 2]);
result ^= Stratergy2(a[i + 3], b[i + 3]);
}
return result;
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static Vector128<byte> Stratergy2(Vector128<byte> left, Vector128<byte> right)
{
Vector128<short> dst_even = Ssse3.MultiplyAddAdjacent(left, (right.AsUInt16() & Vector128.Create<ushort>(0xFF)).AsSByte());
Vector128<short> dst_odd = Ssse3.MultiplyAddAdjacent(left, Vector128.AndNot(Vector128.Create<ushort>(0xFF), right.AsUInt16()).AsSByte()) << 8;
return ((dst_odd << 8) | (dst_even & Vector128.Create<short>(0xFF))).AsByte();
} |
There are a lot of broader considerations that exist than simply whatever is fastest in an isolated benchmark. Other considerations include factoring in a broad range of hardware, how the code impacts the loops (alignment, micro-op cache size, etc), simplicity, what other code exists around the code in question, etc. There's also notably some notable differences here based on whether its XMM or YMM based. For example, on Intel
From the other strategies, Given the above, the raw throughput potential (not factoring in memory accesses) is roughly:
The cost of a load is then approximately 3-7 cycles in typical cases and can impact normal pipelining and other considerations, so strategy 2 gains 12-28 cycles; while the current strategy and strategy 1 gain 9-21 cycles. And despite pipelining being possible in some cases, the practical is that there are several longer dependency chains that will prevent code from executing in parallel in practice. So while theoretically strategy 1 can get the two So what we're left with is that strategy 1 is theoretically the same speed as the current strategy for XMM, but potentially slower while taking up more uop cache, causing less dense code, etc. While for YMM its likely faster on older hardware and hits the same considerations as XMM on newer hardware. And while strategy 2 is theoretically the same speed as strategy 1, the practical is it will always end up worse due to necessary memory accesses and more competing ports. In the end, I don't think its worth making the current pattern which is fairly idiomatic more complex for a theoretical 20% perf increase that is unlikely to pan out in many real world scenarios, especially on newer harwdware. |
As an example, here is the same benchmark @huoyaoyuan gave above running on some more hardware BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.2314)
BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4460/23H2/2023Update/SunValley3)
-- Will run on my AMD boxes when I get home tonight, and some various other older hardware I have as well. |
OK those benchmark results are damning, but while I agree that this is context and hardware dependent, I cannot agree with the details of that analysis. "Throughput potential" is not a familiar measure to me, it's not the reciprocal throughput because that would have been lower, it's not latency because that also would have been lower (and probably shouldn't be the main target). I would suggest considering the execution port pressures on some µarchs. If we're considering a variety of hypothetical contexts, strategies 1 and 2 having µops that spread better over the execution ports (as opposed to piling up on p5 - bad news for contexts in which p5 is already a problem, which is common) should be a good thing. The extra code size is of course a down side. Personally I consider an odd/even split more idiomatic than widening/narrowing (which has traditionally always been something to avoid) but that's more subjective. |
The numbers given there are raw latency, not factoring in pipelining, decoding, or other considerations. For example, the current strategy (not factoring in memory loads) is For strategy 1, you similarly have There is, however, the cost of the loads as well and those are also impactful. In strategy 1, the first While for the current strategy we have This then repros in real world benchmarks between the two, when you factor in a broad range of hardware. On pre IceLake hardware you're going to see Strategy 1 being 5-20% faster, depending on several factors. While on IceLake and later hardware you're going to find almost the inverse, with the current strategy being 5-20% faster. Beyond the raw numbers, there is also the real impact to the instruction decoder, micro-op cache, how it impacts code windows for loops or other tight bodies of code, and how it impacts surrounding code. While saturating the shuffle port isn't ideal (particularly on older hardware where only 1 exists), saturating the arithmetic ports (where there are typically 2) is also bad. When you factor in all this information across a broad range of scenarios, the current strategy ends up winning out.
There isn't really piling up here. The bottleneck only comes when there is pipelining potential, which means you need two non-dependent shuffles that are within the same decode window and without any code that would block reordering between them. This is highly unlikely to occur in practice for any of the 3 strategies given. The more practical case for pipelining will come from the arithmetic ports, which you'd change to saturating with the alternative strategies and which due to the increased code size, reduces the chance that surrounding code will live in a window that allows it.
Shuffling is a fundamental part of SIMD and its why newer CPUs have been increasing the number of shuffle ports available. The even/odd split has slowly been falling out of favor due to how it works with masking, embedded loads, and other scenarios; correspondingly many of the "built-in" even/odd and pairwise instructions were never included in the upgrade to EVEX and don't have ZMM support; so you have to emulate (often with shuffling) in those case. |
More benchmark results: Zen 5 BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.2314)
Unknown processor
.NET SDK 9.0.100
[Host] : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
DefaultJob : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
| Method | Mean | Error | StdDev |
|--------------- |---------:|--------:|--------:|
| SimpleMultiply | 435.8 ns | 0.17 ns | 0.16 ns |
| Stratergy1 | 425.1 ns | 0.36 ns | 0.30 ns |
| Stratergy2 | 457.1 ns | 0.60 ns | 0.47 ns | Zen 4 BenchmarkDotNet v0.14.0, Windows 10 (10.0.20348.2849)
AMD Ryzen 7 PRO 8700GE w/ Radeon 780M Graphics, 1 CPU, 16 logical and 8 physical cores
.NET SDK 9.0.100
[Host] : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
DefaultJob : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
| Method | Mean | Error | StdDev |
|--------------- |---------:|--------:|--------:|
| SimpleMultiply | 442.0 ns | 0.61 ns | 0.57 ns |
| Stratergy1 | 564.2 ns | 0.38 ns | 0.35 ns |
| Stratergy2 | 591.3 ns | 0.39 ns | 0.37 ns | Meteor Lake BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4317/23H2/2023Update/SunValley3)
Intel Core Ultra 7 155H, 1 CPU, 22 logical and 16 physical cores
.NET SDK 9.0.100
[Host] : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2
DefaultJob : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2
| Method | Mean | Error | StdDev |
|--------------- |-----------:|---------:|---------:|
| SimpleMultiply | 1,022.6 ns | 19.96 ns | 20.49 ns |
| Stratergy1 | 829.8 ns | 13.59 ns | 12.71 ns |
| Stratergy2 | 846.9 ns | 16.74 ns | 17.19 ns | |
Worth noting, the byte multiply codegen is different for AVX-512 vpmovzxbw ymm1, xmmword ptr [r11+rdi+0x10]
vpmovzxbw ymm2, xmmword ptr [r10+rdi+0x10]
vpmullw ymm1, ymm2, ymm1
vpmovwb ymm1, ymm1 Similarly, the and+or combo in the alternate versions is combined into a Zen 5 results with only AVX2: BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.2314)
Unknown processor
.NET SDK 9.0.100
[Host] : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2
DefaultJob : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2
| Method | Mean | Error | StdDev |
|--------------- |---------:|--------:|--------:|
| SimpleMultiply | 447.6 ns | 0.46 ns | 0.41 ns |
| Stratergy1 | 447.4 ns | 0.86 ns | 0.81 ns |
| Stratergy2 | 496.3 ns | 1.73 ns | 1.54 ns | |
More benchmark results: Sapphire Rapids (Golden Cove)
|
I suggest another alternative strategy. Strategy 3Based on Strategy 1, but removing some redundant instructions. vpsrlw xmm3, xmm0, 8
vpand xmm2, xmm1, xmm15 ; take higher 8 bits of xmm1
vpmullw xmm0, xmm1, xmm0
vpmullw xmm2, xmm2, xmm3
vpandn xmm0, xmm15, xmm0 ; ~xmm15 & xmm0
vpor xmm0, xmm0, xmm2 Since Added code in benchmark: [Benchmark]
public Vector128<byte> Stratergy3()
{
Vector128<byte> result = default;
for (var i = 0; i < a.Length; i += 4)
{
result ^= Strategy3(a[i], b[i]);
result ^= Strategy3(a[i + 1], b[i + 1]);
result ^= Strategy3(a[i + 2], b[i + 2]);
result ^= Strategy3(a[i + 3], b[i + 3]);
}
return result;
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static Vector128<byte> Strategy3(Vector128<byte> left, Vector128<byte> right)
{
var mask = Vector128.Create<ushort>((ushort)0xff00u);
var dst_even = left.AsUInt16() * right.AsUInt16();
var dst_odd = (left.AsUInt16() >> 8) * (right.AsUInt16() & mask);
return Avx512F.VL.IsSupported ? Avx512F.VL.TernaryLogic(dst_odd, dst_even, mask, 0xf4).AsByte() : (dst_odd | Vector128.AndNot(dst_even, mask)).AsByte();
} The result:
Disassembly.NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI; BenchmarkPlayground.Vector128ByteMultiplicationFromMemoryBenchmarks.SimpleMultiply()
push rdi
push rsi
push rbx
sub rsp,20
vxorps xmm0,xmm0,xmm0
xor eax,eax
mov r8,[rcx+8]
cmp dword ptr [r8+8],0
jle near ptr M00_L01
M00_L00:
mov r10,r8
mov r9d,[r10+8]
cmp eax,r9d
jae near ptr M00_L02
mov r11,rax
shl r11,4
vpmovzxbw ymm1,xmmword ptr [r10+r11+10]
mov r10,[rcx+10]
mov rbx,r10
mov esi,[rbx+8]
cmp eax,esi
jae near ptr M00_L02
vpmovzxbw ymm2,xmmword ptr [rbx+r11+10]
vpmullw ymm1,ymm2,ymm1
vpmovwb xmm1,ymm1
vpxor xmm0,xmm1,xmm0
mov r11,r8
lea ebx,[rax+1]
cmp ebx,r9d
jae near ptr M00_L02
mov edi,ebx
shl rdi,4
vpmovzxbw ymm1,xmmword ptr [r11+rdi+10]
mov r11,r10
cmp ebx,esi
jae near ptr M00_L02
vpmovzxbw ymm2,xmmword ptr [r11+rdi+10]
vpmullw ymm1,ymm2,ymm1
vpmovwb xmm1,ymm1
vpxor xmm0,xmm0,xmm1
mov r11,r8
lea ebx,[rax+2]
cmp ebx,r9d
jae short M00_L02
mov edi,ebx
shl rdi,4
vpmovzxbw ymm1,xmmword ptr [r11+rdi+10]
mov r11,r10
cmp ebx,esi
jae short M00_L02
vpmovzxbw ymm2,xmmword ptr [r11+rdi+10]
vpmullw ymm1,ymm2,ymm1
vpmovwb xmm1,ymm1
vpxor xmm0,xmm0,xmm1
mov r11,r8
lea ebx,[rax+3]
cmp ebx,r9d
jae short M00_L02
mov r9d,ebx
shl r9,4
vpmovzxbw ymm1,xmmword ptr [r11+r9+10]
cmp ebx,esi
jae short M00_L02
vpmovzxbw ymm2,xmmword ptr [r10+r9+10]
vpmullw ymm1,ymm2,ymm1
vpmovwb xmm1,ymm1
vpxor xmm0,xmm0,xmm1
add eax,4
cmp [r8+8],eax
jg near ptr M00_L00
M00_L01:
vmovups [rdx],xmm0
mov rax,rdx
vzeroupper
add rsp,20
pop rbx
pop rsi
pop rdi
ret
M00_L02:
call CORINFO_HELP_RNGCHKFAIL
int 3
; Total bytes of code 296 .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI; BenchmarkPlayground.Vector128ByteMultiplicationFromMemoryBenchmarks.Stratergy1()
push rdi
push rsi
push rbx
sub rsp,20
vxorps xmm0,xmm0,xmm0
xor eax,eax
mov r8,[rcx+8]
cmp dword ptr [r8+8],0
jle near ptr M00_L01
vmovups xmm1,[7FF7F76A8ED0]
M00_L00:
mov r10,r8
mov r9d,[r10+8]
cmp eax,r9d
jae near ptr M00_L02
mov r11,rax
shl r11,4
vmovups xmm2,[r10+r11+10]
mov r10,[rcx+10]
mov rbx,r10
mov esi,[rbx+8]
cmp eax,esi
jae near ptr M00_L02
vmovups xmm3,[rbx+r11+10]
vpsrlw xmm4,xmm2,8
vpsrlw xmm5,xmm3,8
vpmullw xmm4,xmm5,xmm4
vpsllw xmm4,xmm4,8
vpmullw xmm2,xmm2,xmm3
vpternlogd xmm2,xmm4,xmm1,0EC
vpxor xmm0,xmm2,xmm0
mov r11,r8
lea ebx,[rax+1]
cmp ebx,r9d
jae near ptr M00_L02
mov edi,ebx
shl rdi,4
vmovups xmm2,[r11+rdi+10]
mov r11,r10
cmp ebx,esi
jae near ptr M00_L02
vmovups xmm3,[r11+rdi+10]
vpsrlw xmm4,xmm2,8
vpsrlw xmm5,xmm3,8
vpmullw xmm4,xmm5,xmm4
vpsllw xmm4,xmm4,8
vpmullw xmm2,xmm2,xmm3
vpternlogd xmm2,xmm4,xmm1,0EC
vpxor xmm0,xmm2,xmm0
mov r11,r8
lea ebx,[rax+2]
cmp ebx,r9d
jae near ptr M00_L02
mov edi,ebx
shl rdi,4
vmovups xmm2,[r11+rdi+10]
mov r11,r10
cmp ebx,esi
jae near ptr M00_L02
vmovups xmm3,[r11+rdi+10]
vpsrlw xmm4,xmm2,8
vpsrlw xmm5,xmm3,8
vpmullw xmm4,xmm5,xmm4
vpsllw xmm4,xmm4,8
vpmullw xmm2,xmm2,xmm3
vpternlogd xmm2,xmm4,xmm1,0EC
vpxor xmm0,xmm2,xmm0
mov r11,r8
lea ebx,[rax+3]
cmp ebx,r9d
jae short M00_L02
mov r9d,ebx
shl r9,4
vmovups xmm2,[r11+r9+10]
cmp ebx,esi
jae short M00_L02
vmovups xmm3,[r10+r9+10]
vpsrlw xmm4,xmm2,8
vpsrlw xmm5,xmm3,8
vpmullw xmm4,xmm5,xmm4
vpsllw xmm4,xmm4,8
vpmullw xmm2,xmm2,xmm3
vpternlogd xmm2,xmm4,xmm1,0EC
vpxor xmm0,xmm2,xmm0
add eax,4
cmp [r8+8],eax
jg near ptr M00_L00
M00_L01:
vmovups [rdx],xmm0
mov rax,rdx
add rsp,20
pop rbx
pop rsi
pop rdi
ret
M00_L02:
call CORINFO_HELP_RNGCHKFAIL
int 3
; Total bytes of code 389 .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI; BenchmarkPlayground.Vector128ByteMultiplicationFromMemoryBenchmarks.Stratergy2()
push rdi
push rsi
push rbx
sub rsp,20
vxorps xmm0,xmm0,xmm0
xor eax,eax
mov r8,[rcx+8]
cmp dword ptr [r8+8],0
jle near ptr M00_L01
vmovups xmm1,[7FF7F76A8F00]
M00_L00:
mov r10,r8
mov r9d,[r10+8]
cmp eax,r9d
jae near ptr M00_L02
mov r11,rax
shl r11,4
vmovups xmm2,[r10+r11+10]
mov r10,[rcx+10]
mov rbx,r10
mov esi,[rbx+8]
cmp eax,esi
jae near ptr M00_L02
vmovups xmm3,[rbx+r11+10]
vpandn xmm4,xmm3,xmm1
vpmaddubsw xmm4,xmm2,xmm4
vpsllw xmm4,xmm4,8
vpsllw xmm4,xmm4,8
vpand xmm3,xmm1,xmm3
vpmaddubsw xmm2,xmm2,xmm3
vpternlogd xmm2,xmm4,xmm1,0EC
vpxor xmm0,xmm2,xmm0
mov r11,r8
lea ebx,[rax+1]
cmp ebx,r9d
jae near ptr M00_L02
mov edi,ebx
shl rdi,4
vmovups xmm2,[r11+rdi+10]
mov r11,r10
cmp ebx,esi
jae near ptr M00_L02
vmovups xmm3,[r11+rdi+10]
vpandn xmm4,xmm3,xmm1
vpmaddubsw xmm4,xmm2,xmm4
vpsllw xmm4,xmm4,8
vpsllw xmm4,xmm4,8
vpand xmm3,xmm1,xmm3
vpmaddubsw xmm2,xmm2,xmm3
vpternlogd xmm2,xmm4,xmm1,0EC
vpxor xmm0,xmm0,xmm2
mov r11,r8
lea ebx,[rax+2]
cmp ebx,r9d
jae near ptr M00_L02
mov edi,ebx
shl rdi,4
vmovups xmm2,[r11+rdi+10]
mov r11,r10
cmp ebx,esi
jae near ptr M00_L02
vmovups xmm3,[r11+rdi+10]
vpandn xmm4,xmm3,xmm1
vpmaddubsw xmm4,xmm2,xmm4
vpsllw xmm4,xmm4,8
vpsllw xmm4,xmm4,8
vpand xmm3,xmm1,xmm3
vpmaddubsw xmm2,xmm2,xmm3
vpternlogd xmm2,xmm4,xmm1,0EC
vpxor xmm0,xmm0,xmm2
mov r11,r8
lea ebx,[rax+3]
cmp ebx,r9d
jae short M00_L02
mov r9d,ebx
shl r9,4
vmovups xmm2,[r11+r9+10]
cmp ebx,esi
jae short M00_L02
vmovups xmm3,[r10+r9+10]
vpandn xmm4,xmm3,xmm1
vpmaddubsw xmm4,xmm2,xmm4
vpsllw xmm4,xmm4,8
vpsllw xmm4,xmm4,8
vpand xmm3,xmm1,xmm3
vpmaddubsw xmm2,xmm2,xmm3
vpternlogd xmm2,xmm4,xmm1,0EC
vpxor xmm0,xmm0,xmm2
add eax,4
cmp [r8+8],eax
jg near ptr M00_L00
M00_L01:
vmovups [rdx],xmm0
mov rax,rdx
add rsp,20
pop rbx
pop rsi
pop rdi
ret
M00_L02:
call CORINFO_HELP_RNGCHKFAIL
int 3
; Total bytes of code 409 .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI; BenchmarkPlayground.Vector128ByteMultiplicationFromMemoryBenchmarks.Stratergy3()
push rdi
push rsi
push rbx
sub rsp,20
vxorps xmm0,xmm0,xmm0
xor eax,eax
mov r8,[rcx+8]
cmp dword ptr [r8+8],0
jle near ptr M00_L01
vmovups xmm1,[7FF7F76A8F10]
M00_L00:
mov r10,r8
mov r9d,[r10+8]
cmp eax,r9d
jae near ptr M00_L02
mov r11,rax
shl r11,4
vmovups xmm2,[r10+r11+10]
mov r10,[rcx+10]
mov rbx,r10
mov esi,[rbx+8]
cmp eax,esi
jae near ptr M00_L02
vmovups xmm3,[rbx+r11+10]
vmovaps xmm4,xmm1
vpand xmm5,xmm3,xmm4
vpsrlw xmm16,xmm2,8
vpmullw xmm5,xmm5,xmm16
vpmullw xmm2,xmm2,xmm3
vpternlogd xmm4,xmm5,xmm2,0CE
vpxor xmm0,xmm4,xmm0
mov r11,r8
lea ebx,[rax+1]
cmp ebx,r9d
jae near ptr M00_L02
mov edi,ebx
shl rdi,4
vmovups xmm2,[r11+rdi+10]
mov r11,r10
cmp ebx,esi
jae near ptr M00_L02
vmovups xmm3,[r11+rdi+10]
vmovaps xmm4,xmm1
vpand xmm5,xmm3,xmm4
vpsrlw xmm16,xmm2,8
vpmullw xmm5,xmm5,xmm16
vpmullw xmm2,xmm2,xmm3
vpternlogd xmm4,xmm5,xmm2,0CE
vpxor xmm0,xmm4,xmm0
mov r11,r8
lea ebx,[rax+2]
cmp ebx,r9d
jae near ptr M00_L02
mov edi,ebx
shl rdi,4
vmovups xmm2,[r11+rdi+10]
mov r11,r10
cmp ebx,esi
jae near ptr M00_L02
vmovups xmm3,[r11+rdi+10]
vmovaps xmm4,xmm1
vpand xmm5,xmm3,xmm4
vpsrlw xmm16,xmm2,8
vpmullw xmm5,xmm5,xmm16
vpmullw xmm2,xmm2,xmm3
vpternlogd xmm4,xmm5,xmm2,0CE
vpxor xmm0,xmm4,xmm0
mov r11,r8
lea ebx,[rax+3]
cmp ebx,r9d
jae short M00_L02
mov r9d,ebx
shl r9,4
vmovups xmm2,[r11+r9+10]
cmp ebx,esi
jae short M00_L02
vmovups xmm3,[r10+r9+10]
vmovaps xmm4,xmm1
vpand xmm5,xmm3,xmm4
vpsrlw xmm16,xmm2,8
vpmullw xmm5,xmm5,xmm16
vpmullw xmm2,xmm2,xmm3
vpternlogd xmm4,xmm5,xmm2,0CE
vpxor xmm0,xmm4,xmm0
add eax,4
cmp [r8+8],eax
jg near ptr M00_L00
M00_L01:
vmovups [rdx],xmm0
mov rax,rdx
add rsp,20
pop rbx
pop rsi
pop rdi
ret
M00_L02:
call CORINFO_HELP_RNGCHKFAIL
int 3
; Total bytes of code 397 |
Very clever, but still slower on AMD Zen 5
Zen 4
Meteor Lake
|
IMHO, these new strategies may be useful for largest-available vector registers, such as _____AAAAA.Multiply3(System.Runtime.Intrinsics.Vector512`1<Byte>, System.Runtime.Intrinsics.Vector512`1<Byte>)
L0000: vzeroupper
L0003: vmovups zmm0, [rdx]
L0009: vpsrlw zmm1, zmm0, 8
L0010: vmovups zmm2, [r8]
L0016: vpandd zmm3, zmm2, [0x7ffa4c130500]
L0020: vpmullw zmm1, zmm1, zmm3
L0026: vpmullw zmm0, zmm0, zmm2
L002c: vpternlogd zmm1, zmm0, [0x7ffa4c130500], 0xf4
L0037: vmovups [rcx], zmm1
L003d: mov rax, rcx
L0040: vzeroupper
L0043: ret
|
That creates a deviation in behavior, which is often undesirable and can lead to hidden bugs/quirks. Notably, such newer hardware also tends to explicitly have better shuffle ports and other considerations so its already typically not an issue and will allow the same parallelization. The potential gains vs the additional complexity/risks are often not worth it in such cases and its going to be tradeoffs all around, especially as newer instructions get added and allow the existing logic to be trivially simplified as its the arguably the most idiomatic (from a conceptual perspective of how you'd do the operations). I definitely appreciate the discussion so far and am more than willing to take optimizations where the cost vs benefit is much more clearly defined across a range of hardware/scenarios; I just don' think this case in particular meets that bar. |
SSE/AVX don't have a built-in multiplication of bytes, so some trick is needed. .NET 9 uses a strategy like this, based on widening a vector to twice as wide then using 16-bit multiplication and narrowing the result:
Which isn't bad, but we can do a little bit better. The central issue in the code above is that shuffles are expensive and there are 4 of them (vpmovzxbw * 2, vpackuswb, vpermq). On for example an Intel Skylake or Intel Rocket Lake or anything in between, those 4 shuffles limit the throughput to at best 1 of those multiplications per 4 cycles.
I suggest two alternative strategies.
Strategy 1
First, using an odd/even split and two vpmullw, based on (but modified) this QA on stackoverflow, in C++ it would look like this.
(the original answer tries to avoid the _mm_set1_epi16 if there is no AVX2 for
VPBROADCASTW
, but it would turn into a vector constant anyway so that is not important)By avoiding pmovzxbw this strategy is also applicable to SSE2 rather than requiring SSE4.1, and it's not a bad strategy for the higher x64 ISA levels either.
In a comparable situation as the cited "current implementation" (loading a and b from memory and ignoring for now where the result goes),
This is more instructions, but they're cheaper instructions. On Skylake or Rocket lake, UICA estimates a throughput of performing the above thing (in a loop) once per 2.5 cycles (bottleneck spread evenly across p0 and p1), compared to the original one per 4 cycles (bottlenecked by p5, due to the shuffles).
Strategy 2
Secondly, a strategy based on pmaddubsw (requiring SSSE3). pmaddubsw multiplies unsigned by signed bytes, which is odd but turns out to be harmless (the lowest byte of the product is not affected, and the upper byte will be discarded). It also horizontally adds adjacent pairs, but that can be made harmless by making half of the pairs zero. The saturation does not come into play in that case either, neither 255*-128 nor 255*127 would saturate by themselves and the other element in each pair will be zero.
Representative code (the andnot was compiled into an and by inverted mask, for older processors it may be better to reduce the number of loads?):
For this code (when put in a loop), UICA estimates a slightly worse throughput of once per 2.75 cycles on Skylake (due to not executing from the loop stream buffer but from the uop cache, I don't know why that happens here), but a slightly better throughput of once per 2.33 cycles on Rocket Lake.
This strategy is essentially what Clang 19 does when autovectorizing a byte multiplication, in earlier versions Clang did it differently (with more shuffles).
I could not evaluate either of these strategies (nor the original) on AMD Zen since I don't have access to one and UICA does not include it.
The text was updated successfully, but these errors were encountered: