Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSE/AVX byte multiplication could be improved #109775

Closed
IJzerbaard opened this issue Nov 13, 2024 · 13 comments
Closed

SSE/AVX byte multiplication could be improved #109775

IJzerbaard opened this issue Nov 13, 2024 · 13 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue

Comments

@IJzerbaard
Copy link

IJzerbaard commented Nov 13, 2024

SSE/AVX don't have a built-in multiplication of bytes, so some trick is needed. .NET 9 uses a strategy like this, based on widening a vector to twice as wide then using 16-bit multiplication and narrowing the result:

   vpmovzxbw ymm0, qword ptr [rdx]
   vpmovzxbw ymm1, qword ptr [r8]
   vpmullw  ymm0, ymm0, ymm1
   vpand    ymm0, ymm0, ymmword ptr [reloc @RWD00]
   vpackuswb ymm0, ymm0, ymm0
   vpermq   ymm0, ymm0, -40

Which isn't bad, but we can do a little bit better. The central issue in the code above is that shuffles are expensive and there are 4 of them (vpmovzxbw * 2, vpackuswb, vpermq). On for example an Intel Skylake or Intel Rocket Lake or anything in between, those 4 shuffles limit the throughput to at best 1 of those multiplications per 4 cycles.

I suggest two alternative strategies.

Strategy 1

First, using an odd/even split and two vpmullw, based on (but modified) this QA on stackoverflow, in C++ it would look like this.

    __m128i mullo_epi8(__m128i a, __m128i b)
    {
        // unpack and multiply
        __m128i dst_even = _mm_mullo_epi16(a, b);
        __m128i dst_odd = _mm_mullo_epi16(_mm_srli_epi16(a, 8),_mm_srli_epi16(b, 8));
        // repack
        return _mm_or_si128(_mm_slli_epi16(dst_odd, 8), _mm_and_si128(dst_even, _mm_set1_epi16(0xFF)));
    }

(the original answer tries to avoid the _mm_set1_epi16 if there is no AVX2 for VPBROADCASTW, but it would turn into a vector constant anyway so that is not important)

By avoiding pmovzxbw this strategy is also applicable to SSE2 rather than requiring SSE4.1, and it's not a bad strategy for the higher x64 ISA levels either.

In a comparable situation as the cited "current implementation" (loading a and b from memory and ignoring for now where the result goes),

    vmovdqa xmm0, xmmword ptr [rdi]
    vmovdqa xmm1, xmmword ptr [rsi]
    vpmullw xmm2, xmm1, xmm0
    vpsrlw  xmm0, xmm0, 8
    vpsrlw  xmm1, xmm1, 8
    vpmullw xmm0, xmm1, xmm0
    vpsllw  xmm0, xmm0, 8
    vpand   xmm1, xmm2, xmmword ptr [rip + .LCPI1_0]
    vpor    xmm0, xmm1, xmm0

This is more instructions, but they're cheaper instructions. On Skylake or Rocket lake, UICA estimates a throughput of performing the above thing (in a loop) once per 2.5 cycles (bottleneck spread evenly across p0 and p1), compared to the original one per 4 cycles (bottlenecked by p5, due to the shuffles).

Strategy 2

Secondly, a strategy based on pmaddubsw (requiring SSSE3). pmaddubsw multiplies unsigned by signed bytes, which is odd but turns out to be harmless (the lowest byte of the product is not affected, and the upper byte will be discarded). It also horizontally adds adjacent pairs, but that can be made harmless by making half of the pairs zero. The saturation does not come into play in that case either, neither 255*-128 nor 255*127 would saturate by themselves and the other element in each pair will be zero.

    __m128i mullo_epi8(__m128i a, __m128i b)
    {
        __m128i dst_even = _mm_maddubs_epi16(a, _mm_and_si128(b, _mm_set1_epi16(0xFF)));
        __m128i dst_odd = _mm_slli_epi16(_mm_maddubs_epi16(a, _mm_andnot_si128(_mm_set1_epi16(0xFF), b)), 8);
        return _mm_or_si128(_mm_and_si128(dst_even, _mm_set1_epi16(0xFF)), dst_odd);
    }

Representative code (the andnot was compiled into an and by inverted mask, for older processors it may be better to reduce the number of loads?):

    vmovdqa xmm0, xmmword ptr [rdi]
    vmovdqa xmm1, xmmword ptr [rsi]
    vmovdqa xmm2, xmmword ptr [rip + .LCPI2_0]
    vpand   xmm3, xmm1, xmm2
    vpmaddubsw      xmm3, xmm0, xmm3
    vpand   xmm1, xmm1, xmmword ptr [rip + .LCPI2_1]
    vpmaddubsw      xmm0, xmm0, xmm1
    vpsllw  xmm0, xmm0, 8
    vpand   xmm1, xmm3, xmm2
    vpor    xmm0, xmm1, xmm0

For this code (when put in a loop), UICA estimates a slightly worse throughput of once per 2.75 cycles on Skylake (due to not executing from the loop stream buffer but from the uop cache, I don't know why that happens here), but a slightly better throughput of once per 2.33 cycles on Rocket Lake.

This strategy is essentially what Clang 19 does when autovectorizing a byte multiplication, in earlier versions Clang did it differently (with more shuffles).

I could not evaluate either of these strategies (nor the original) on AMD Zen since I don't have access to one and UICA does not include it.

@IJzerbaard IJzerbaard added the tenet-performance Performance related issue label Nov 13, 2024
@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Nov 13, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Nov 13, 2024
@huoyaoyuan huoyaoyuan added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Nov 13, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@huoyaoyuan
Copy link
Member

Simple unrolled benchmark on Rapter Lake:

Method Mean Error StdDev Code Size
SimpleMultiply 739.7 ns 8.74 ns 8.18 ns 340 B
Stratergy1 577.5 ns 2.70 ns 2.40 ns 393 B
Stratergy2 584.8 ns 3.84 ns 3.59 ns 413 B

Code:

        private Vector128<byte>[] a = new Vector128<byte>[1000], b = new Vector128<byte>[1000];

        [Benchmark]
        public Vector128<byte> SimpleMultiply()
        {
            Vector128<byte> result = default;
            for (int i = 0; i < a.Length; i += 4)
            {
                result ^= SimpleMultiply(a[i], b[i]);
                result ^= SimpleMultiply(a[i + 1], b[i + 1]);
                result ^= SimpleMultiply(a[i + 2], b[i + 2]);
                result ^= SimpleMultiply(a[i + 3], b[i + 3]);
            }
            return result;
        }

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        private static Vector128<byte> SimpleMultiply(Vector128<byte> left, Vector128<byte> right) => left * right;

        [Benchmark]
        public Vector128<byte> Stratergy1()
        {
            Vector128<byte> result = default;
            for (int i = 0; i < a.Length; i += 4)
            {
                result ^= Stratergy1(a[i], b[i]);
                result ^= Stratergy1(a[i + 1], b[i + 1]);
                result ^= Stratergy1(a[i + 2], b[i + 2]);
                result ^= Stratergy1(a[i + 3], b[i + 3]);
            }
            return result;
        }

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        private static Vector128<byte> Stratergy1(Vector128<byte> left, Vector128<byte> right)
        {
            Vector128<ushort> dst_even = left.AsUInt16() * right.AsUInt16();
            Vector128<ushort> dst_odd = (left.AsUInt16() >> 8) * (right.AsUInt16() >> 8);
            return ((dst_odd << 8) | (dst_even & Vector128.Create<ushort>(0xFF))).AsByte();
        }

        [Benchmark]
        public Vector128<byte> Stratergy2()
        {
            Vector128<byte> result = default;
            for (int i = 0; i < a.Length; i += 4)
            {
                result ^= Stratergy2(a[i], b[i]);
                result ^= Stratergy2(a[i + 1], b[i + 1]);
                result ^= Stratergy2(a[i + 2], b[i + 2]);
                result ^= Stratergy2(a[i + 3], b[i + 3]);
            }
            return result;
        }

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        private static Vector128<byte> Stratergy2(Vector128<byte> left, Vector128<byte> right)
        {
            Vector128<short> dst_even = Ssse3.MultiplyAddAdjacent(left, (right.AsUInt16() & Vector128.Create<ushort>(0xFF)).AsSByte());
            Vector128<short> dst_odd = Ssse3.MultiplyAddAdjacent(left, Vector128.AndNot(Vector128.Create<ushort>(0xFF), right.AsUInt16()).AsSByte()) << 8;
            return ((dst_odd << 8) | (dst_even & Vector128.Create<short>(0xFF))).AsByte();
        }

@tannergooding
Copy link
Member

There are a lot of broader considerations that exist than simply whatever is fastest in an isolated benchmark.

Other considerations include factoring in a broad range of hardware, how the code impacts the loops (alignment, micro-op cache size, etc), simplicity, what other code exists around the code in question, etc.

There's also notably some notable differences here based on whether its XMM or YMM based.


For example, on Intel vpmovzxbw takes 1 cycle for XMM but 3 for YMM and while there is only port 5 available pre-IceLake, on IceLake or late XMM has port 1 or 5 available. Alder Lake-E similarly reduced the cost of YMM to 1 cycle as well. -- AMD Zen2 and later similarly has 2 ports available (1 or 2) and takes 1 cycle for both XMM and YMM (it is similar to IceLake for Zen1).

vpmullw takes 5 cycles and has 2 ports available (0 or 1). It improved to 3-4 cycles on Alder Lake-E. On AMD Zen, it is 3 cycles and has a single port available (0), but this increased to 2 ports on Zen3

vpand takes 1 cycle and has 3 ports (0, 1, or 5) on Intel. On Zen it similarly takes 1 cycle and has 4 ports available (0, 1, 2, 3).

vpackuswb is 1 cycle on Skylake but slowed to 3 cycles on IceLake+, only ever having port 5 available. On Zen it takes 1 cycle and has 2 ports available (1 or 2).

vpermq takes 3 cycles and is only available on port 5. On Zen it is 4 cycles and has 2 ports (1 or 2)

From the other strategies, vpor has the same throughput as vpand. While vpsllw and vpsrlw take 1 cycle and have 2 ports available (0 or 1 on Intel and 1 or 2 on AMD). Finally vpmaddubsw matches the cost of vpmullw.


Given the above, the raw throughput potential (not factoring in memory accesses) is roughly:

Strategy AMD Intel
Current 11 11-12 (XMM), 14-18 (YMM)
1 15 15
2 15 15

The cost of a load is then approximately 3-7 cycles in typical cases and can impact normal pipelining and other considerations, so strategy 2 gains 12-28 cycles; while the current strategy and strategy 1 gain 9-21 cycles.

And despite pipelining being possible in some cases, the practical is that there are several longer dependency chains that will prevent code from executing in parallel in practice. So while theoretically strategy 1 can get the two vpmullw dispatched with a 1 cycle delay from eachother (causing them to take 6, not 10), this isn't a guarantee and may be infeasible in many scenarios.

So what we're left with is that strategy 1 is theoretically the same speed as the current strategy for XMM, but potentially slower while taking up more uop cache, causing less dense code, etc. While for YMM its likely faster on older hardware and hits the same considerations as XMM on newer hardware.

And while strategy 2 is theoretically the same speed as strategy 1, the practical is it will always end up worse due to necessary memory accesses and more competing ports.

In the end, I don't think its worth making the current pattern which is fairly idiomatic more complex for a theoretical 20% perf increase that is unlikely to pan out in many real world scenarios, especially on newer harwdware.

@tannergooding
Copy link
Member

As an example, here is the same benchmark @huoyaoyuan gave above running on some more hardware


BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.2314)
Intel Xeon Platinum 8370C CPU 2.80GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK 9.0.100
[Host] : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
DefaultJob : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

Method Mean Error StdDev
SimpleMultiply 1.006 us 0.0067 us 0.0052 us
Stratergy1 1.142 us 0.0159 us 0.0149 us
Stratergy2 1.215 us 0.0074 us 0.0062 us

BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4460/23H2/2023Update/SunValley3)
11th Gen Intel Core i9-11900H 2.50GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK 9.0.100
[Host] : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
DefaultJob : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

Method Mean Error StdDev
SimpleMultiply 792.1 ns 15.77 ns 19.94 ns
Stratergy1 859.9 ns 17.22 ns 20.50 ns
Stratergy2 922.2 ns 17.64 ns 18.11 ns

-- Will run on my AMD boxes when I get home tonight, and some various other older hardware I have as well.

@IJzerbaard
Copy link
Author

OK those benchmark results are damning, but while I agree that this is context and hardware dependent, I cannot agree with the details of that analysis. "Throughput potential" is not a familiar measure to me, it's not the reciprocal throughput because that would have been lower, it's not latency because that also would have been lower (and probably shouldn't be the main target). I would suggest considering the execution port pressures on some µarchs.

If we're considering a variety of hypothetical contexts, strategies 1 and 2 having µops that spread better over the execution ports (as opposed to piling up on p5 - bad news for contexts in which p5 is already a problem, which is common) should be a good thing. The extra code size is of course a down side.

Personally I consider an odd/even split more idiomatic than widening/narrowing (which has traditionally always been something to avoid) but that's more subjective.

@tannergooding
Copy link
Member

it's not latency because that also would have been lower

The numbers given there are raw latency, not factoring in pipelining, decoding, or other considerations. For example, the current strategy (not factoring in memory loads) is 1+1+5+1+1+3 on Skylake (12 cycles) for XMM and 3+3+5+1+1+3 (15) for YMM. On IceLake YMM is 3+3+5+1+3+3 (18).

For strategy 1, you similarly have 5+1+1+5+1+1+1 (15). With pipelining, you can theoretically dispatch vpmullw on port 0 and vpsrlw on port 1 at the same time, then wait a cycle and reuse port 1 for vpsrlw again, then wait a cycle and dispatch vpmullw. You then need to wait 5 cycles before the third vpsllw can execute (by which point the first vpmullw is done). This gives you potentially 1+1+5+1+1+1, getting you to 10 cycles in the best case.

There is, however, the cost of the loads as well and those are also impactful. In strategy 1, the first vpmullw is dependent on 2 loads being completed before any work can start. These two loads are likely pipelined but still have a 3-7 cycle latency, so we have something like 5+1+1+5+1+5+1 (19) practical cycles.

While for the current strategy we have 5+5+5+1+1+3 (20) practical cycles, with newer hardware able to pipeline the two vpmovzxbw w/ embedded load giving us instead 15 practical cycles.

This then repros in real world benchmarks between the two, when you factor in a broad range of hardware. On pre IceLake hardware you're going to see Strategy 1 being 5-20% faster, depending on several factors. While on IceLake and later hardware you're going to find almost the inverse, with the current strategy being 5-20% faster.

Beyond the raw numbers, there is also the real impact to the instruction decoder, micro-op cache, how it impacts code windows for loops or other tight bodies of code, and how it impacts surrounding code. While saturating the shuffle port isn't ideal (particularly on older hardware where only 1 exists), saturating the arithmetic ports (where there are typically 2) is also bad.

When you factor in all this information across a broad range of scenarios, the current strategy ends up winning out.

(as opposed to piling up on p5 - bad news for contexts in which p5 is already a problem, which is common) should be a good thing. The extra code size is of course a down side.

There isn't really piling up here. The bottleneck only comes when there is pipelining potential, which means you need two non-dependent shuffles that are within the same decode window and without any code that would block reordering between them. This is highly unlikely to occur in practice for any of the 3 strategies given. The more practical case for pipelining will come from the arithmetic ports, which you'd change to saturating with the alternative strategies and which due to the increased code size, reduces the chance that surrounding code will live in a window that allows it.

Personally I consider an odd/even split more idiomatic than widening/narrowing (which has traditionally always been something to avoid) but that's more subjective.

Shuffling is a fundamental part of SIMD and its why newer CPUs have been increasing the number of shuffle ports available. The even/odd split has slowly been falling out of favor due to how it works with masking, embedded loads, and other scenarios; correspondingly many of the "built-in" even/odd and pairwise instructions were never included in the upgrade to EVEX and don't have ZMM support; so you have to emulate (often with shuffling) in those case.

@saucecontrol
Copy link
Member

More benchmark results:

Zen 5

BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.2314)
Unknown processor
.NET SDK 9.0.100
  [Host]     : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  DefaultJob : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI


| Method         | Mean     | Error   | StdDev  |
|--------------- |---------:|--------:|--------:|
| SimpleMultiply | 435.8 ns | 0.17 ns | 0.16 ns |
| Stratergy1     | 425.1 ns | 0.36 ns | 0.30 ns |
| Stratergy2     | 457.1 ns | 0.60 ns | 0.47 ns |

Zen 4

BenchmarkDotNet v0.14.0, Windows 10 (10.0.20348.2849)
AMD Ryzen 7 PRO 8700GE w/ Radeon 780M Graphics, 1 CPU, 16 logical and 8 physical cores
.NET SDK 9.0.100
  [Host]     : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  DefaultJob : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI


| Method         | Mean     | Error   | StdDev  |
|--------------- |---------:|--------:|--------:|
| SimpleMultiply | 442.0 ns | 0.61 ns | 0.57 ns |
| Stratergy1     | 564.2 ns | 0.38 ns | 0.35 ns |
| Stratergy2     | 591.3 ns | 0.39 ns | 0.37 ns |

Meteor Lake

BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4317/23H2/2023Update/SunValley3)
Intel Core Ultra 7 155H, 1 CPU, 22 logical and 16 physical cores
.NET SDK 9.0.100
  [Host]     : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2
  DefaultJob : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2


| Method         | Mean       | Error    | StdDev   |
|--------------- |-----------:|---------:|---------:|
| SimpleMultiply | 1,022.6 ns | 19.96 ns | 20.49 ns |
| Stratergy1     |   829.8 ns | 13.59 ns | 12.71 ns |
| Stratergy2     |   846.9 ns | 16.74 ns | 17.19 ns |

@saucecontrol
Copy link
Member

Worth noting, the byte multiply codegen is different for AVX-512

vpmovzxbw ymm1, xmmword ptr [r11+rdi+0x10]
vpmovzxbw ymm2, xmmword ptr [r10+rdi+0x10]
vpmullw  ymm1, ymm2, ymm1
vpmovwb  ymm1, ymm1

Similarly, the and+or combo in the alternate versions is combined into a vpternlog on AVX-512.

Zen 5 results with only AVX2:

BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.2314)
Unknown processor
.NET SDK 9.0.100
  [Host]     : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2
  DefaultJob : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2


| Method         | Mean     | Error   | StdDev  |
|--------------- |---------:|--------:|--------:|
| SimpleMultiply | 447.6 ns | 0.46 ns | 0.41 ns |
| Stratergy1     | 447.4 ns | 0.86 ns | 0.81 ns |
| Stratergy2     | 496.3 ns | 1.73 ns | 1.54 ns |

@MineCake147E
Copy link
Contributor

More benchmark results:

Sapphire Rapids (Golden Cove)


BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4460/23H2/2023Update/SunValley3)
Intel Xeon w5-2455X, 1 CPU, 24 logical and 12 physical cores
.NET SDK 9.0.100
  [Host]     : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  DefaultJob : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI


Method Mean Error StdDev Code Size
SimpleMultiply 777.5 ns 11.41 ns 10.67 ns 296 B
Stratergy1 805.2 ns 6.97 ns 6.52 ns 389 B
Stratergy2 836.1 ns 6.29 ns 5.89 ns 409 B

@MineCake147E
Copy link
Contributor

I suggest another alternative strategy.

Strategy 3

Based on Strategy 1, but removing some redundant instructions.

vpsrlw xmm3, xmm0, 8
vpand xmm2, xmm1, xmm15   ; take higher 8 bits of xmm1
vpmullw xmm0, xmm1, xmm0
vpmullw xmm2, xmm2, xmm3
vpandn xmm0, xmm15, xmm0  ; ~xmm15 & xmm0
vpor xmm0, xmm0, xmm2

Since vpand xmm2, xmm1, xmm15 has cleared lower 8 bits for each 16-bit elements, vpmullw xmm2, xmm2, xmm3 also clears lower 8 bits.
Because xmm2 isn't shifted right, we can treat that 16-bit value to be already shifted 8 bits left. The results of 8-bit multiplications are stored in the higher part of each 16-bit elements, eliminating the need for additional shifting.

Added code in benchmark:

[Benchmark]
public Vector128<byte> Stratergy3()
{
    Vector128<byte> result = default;
    for (var i = 0; i < a.Length; i += 4)
    {
        result ^= Strategy3(a[i], b[i]);
        result ^= Strategy3(a[i + 1], b[i + 1]);
        result ^= Strategy3(a[i + 2], b[i + 2]);
        result ^= Strategy3(a[i + 3], b[i + 3]);
    }
    return result;
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static Vector128<byte> Strategy3(Vector128<byte> left, Vector128<byte> right)
{
    var mask = Vector128.Create<ushort>((ushort)0xff00u);
    var dst_even = left.AsUInt16() * right.AsUInt16();
    var dst_odd = (left.AsUInt16() >> 8) * (right.AsUInt16() & mask);
    return Avx512F.VL.IsSupported ? Avx512F.VL.TernaryLogic(dst_odd, dst_even, mask, 0xf4).AsByte() : (dst_odd | Vector128.AndNot(dst_even, mask)).AsByte();
}

The result:


BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4460/23H2/2023Update/SunValley3)
Intel Xeon w5-2455X, 1 CPU, 24 logical and 12 physical cores
.NET SDK 9.0.100
  [Host]     : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  DefaultJob : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI


Method Mean Error StdDev Code Size
SimpleMultiply 777.3 ns 9.71 ns 9.09 ns 296 B
Stratergy1 808.5 ns 7.37 ns 6.89 ns 389 B
Stratergy2 835.2 ns 4.63 ns 3.87 ns 409 B
Stratergy3 729.2 ns 6.53 ns 6.11 ns 397 B
Disassembly

.NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

; BenchmarkPlayground.Vector128ByteMultiplicationFromMemoryBenchmarks.SimpleMultiply()
       push      rdi
       push      rsi
       push      rbx
       sub       rsp,20
       vxorps    xmm0,xmm0,xmm0
       xor       eax,eax
       mov       r8,[rcx+8]
       cmp       dword ptr [r8+8],0
       jle       near ptr M00_L01
M00_L00:
       mov       r10,r8
       mov       r9d,[r10+8]
       cmp       eax,r9d
       jae       near ptr M00_L02
       mov       r11,rax
       shl       r11,4
       vpmovzxbw ymm1,xmmword ptr [r10+r11+10]
       mov       r10,[rcx+10]
       mov       rbx,r10
       mov       esi,[rbx+8]
       cmp       eax,esi
       jae       near ptr M00_L02
       vpmovzxbw ymm2,xmmword ptr [rbx+r11+10]
       vpmullw   ymm1,ymm2,ymm1
       vpmovwb   xmm1,ymm1
       vpxor     xmm0,xmm1,xmm0
       mov       r11,r8
       lea       ebx,[rax+1]
       cmp       ebx,r9d
       jae       near ptr M00_L02
       mov       edi,ebx
       shl       rdi,4
       vpmovzxbw ymm1,xmmword ptr [r11+rdi+10]
       mov       r11,r10
       cmp       ebx,esi
       jae       near ptr M00_L02
       vpmovzxbw ymm2,xmmword ptr [r11+rdi+10]
       vpmullw   ymm1,ymm2,ymm1
       vpmovwb   xmm1,ymm1
       vpxor     xmm0,xmm0,xmm1
       mov       r11,r8
       lea       ebx,[rax+2]
       cmp       ebx,r9d
       jae       short M00_L02
       mov       edi,ebx
       shl       rdi,4
       vpmovzxbw ymm1,xmmword ptr [r11+rdi+10]
       mov       r11,r10
       cmp       ebx,esi
       jae       short M00_L02
       vpmovzxbw ymm2,xmmword ptr [r11+rdi+10]
       vpmullw   ymm1,ymm2,ymm1
       vpmovwb   xmm1,ymm1
       vpxor     xmm0,xmm0,xmm1
       mov       r11,r8
       lea       ebx,[rax+3]
       cmp       ebx,r9d
       jae       short M00_L02
       mov       r9d,ebx
       shl       r9,4
       vpmovzxbw ymm1,xmmword ptr [r11+r9+10]
       cmp       ebx,esi
       jae       short M00_L02
       vpmovzxbw ymm2,xmmword ptr [r10+r9+10]
       vpmullw   ymm1,ymm2,ymm1
       vpmovwb   xmm1,ymm1
       vpxor     xmm0,xmm0,xmm1
       add       eax,4
       cmp       [r8+8],eax
       jg        near ptr M00_L00
M00_L01:
       vmovups   [rdx],xmm0
       mov       rax,rdx
       vzeroupper
       add       rsp,20
       pop       rbx
       pop       rsi
       pop       rdi
       ret
M00_L02:
       call      CORINFO_HELP_RNGCHKFAIL
       int       3
; Total bytes of code 296

.NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

; BenchmarkPlayground.Vector128ByteMultiplicationFromMemoryBenchmarks.Stratergy1()
       push      rdi
       push      rsi
       push      rbx
       sub       rsp,20
       vxorps    xmm0,xmm0,xmm0
       xor       eax,eax
       mov       r8,[rcx+8]
       cmp       dword ptr [r8+8],0
       jle       near ptr M00_L01
       vmovups   xmm1,[7FF7F76A8ED0]
M00_L00:
       mov       r10,r8
       mov       r9d,[r10+8]
       cmp       eax,r9d
       jae       near ptr M00_L02
       mov       r11,rax
       shl       r11,4
       vmovups   xmm2,[r10+r11+10]
       mov       r10,[rcx+10]
       mov       rbx,r10
       mov       esi,[rbx+8]
       cmp       eax,esi
       jae       near ptr M00_L02
       vmovups   xmm3,[rbx+r11+10]
       vpsrlw    xmm4,xmm2,8
       vpsrlw    xmm5,xmm3,8
       vpmullw   xmm4,xmm5,xmm4
       vpsllw    xmm4,xmm4,8
       vpmullw   xmm2,xmm2,xmm3
       vpternlogd xmm2,xmm4,xmm1,0EC
       vpxor     xmm0,xmm2,xmm0
       mov       r11,r8
       lea       ebx,[rax+1]
       cmp       ebx,r9d
       jae       near ptr M00_L02
       mov       edi,ebx
       shl       rdi,4
       vmovups   xmm2,[r11+rdi+10]
       mov       r11,r10
       cmp       ebx,esi
       jae       near ptr M00_L02
       vmovups   xmm3,[r11+rdi+10]
       vpsrlw    xmm4,xmm2,8
       vpsrlw    xmm5,xmm3,8
       vpmullw   xmm4,xmm5,xmm4
       vpsllw    xmm4,xmm4,8
       vpmullw   xmm2,xmm2,xmm3
       vpternlogd xmm2,xmm4,xmm1,0EC
       vpxor     xmm0,xmm2,xmm0
       mov       r11,r8
       lea       ebx,[rax+2]
       cmp       ebx,r9d
       jae       near ptr M00_L02
       mov       edi,ebx
       shl       rdi,4
       vmovups   xmm2,[r11+rdi+10]
       mov       r11,r10
       cmp       ebx,esi
       jae       near ptr M00_L02
       vmovups   xmm3,[r11+rdi+10]
       vpsrlw    xmm4,xmm2,8
       vpsrlw    xmm5,xmm3,8
       vpmullw   xmm4,xmm5,xmm4
       vpsllw    xmm4,xmm4,8
       vpmullw   xmm2,xmm2,xmm3
       vpternlogd xmm2,xmm4,xmm1,0EC
       vpxor     xmm0,xmm2,xmm0
       mov       r11,r8
       lea       ebx,[rax+3]
       cmp       ebx,r9d
       jae       short M00_L02
       mov       r9d,ebx
       shl       r9,4
       vmovups   xmm2,[r11+r9+10]
       cmp       ebx,esi
       jae       short M00_L02
       vmovups   xmm3,[r10+r9+10]
       vpsrlw    xmm4,xmm2,8
       vpsrlw    xmm5,xmm3,8
       vpmullw   xmm4,xmm5,xmm4
       vpsllw    xmm4,xmm4,8
       vpmullw   xmm2,xmm2,xmm3
       vpternlogd xmm2,xmm4,xmm1,0EC
       vpxor     xmm0,xmm2,xmm0
       add       eax,4
       cmp       [r8+8],eax
       jg        near ptr M00_L00
M00_L01:
       vmovups   [rdx],xmm0
       mov       rax,rdx
       add       rsp,20
       pop       rbx
       pop       rsi
       pop       rdi
       ret
M00_L02:
       call      CORINFO_HELP_RNGCHKFAIL
       int       3
; Total bytes of code 389

.NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

; BenchmarkPlayground.Vector128ByteMultiplicationFromMemoryBenchmarks.Stratergy2()
       push      rdi
       push      rsi
       push      rbx
       sub       rsp,20
       vxorps    xmm0,xmm0,xmm0
       xor       eax,eax
       mov       r8,[rcx+8]
       cmp       dword ptr [r8+8],0
       jle       near ptr M00_L01
       vmovups   xmm1,[7FF7F76A8F00]
M00_L00:
       mov       r10,r8
       mov       r9d,[r10+8]
       cmp       eax,r9d
       jae       near ptr M00_L02
       mov       r11,rax
       shl       r11,4
       vmovups   xmm2,[r10+r11+10]
       mov       r10,[rcx+10]
       mov       rbx,r10
       mov       esi,[rbx+8]
       cmp       eax,esi
       jae       near ptr M00_L02
       vmovups   xmm3,[rbx+r11+10]
       vpandn    xmm4,xmm3,xmm1
       vpmaddubsw xmm4,xmm2,xmm4
       vpsllw    xmm4,xmm4,8
       vpsllw    xmm4,xmm4,8
       vpand     xmm3,xmm1,xmm3
       vpmaddubsw xmm2,xmm2,xmm3
       vpternlogd xmm2,xmm4,xmm1,0EC
       vpxor     xmm0,xmm2,xmm0
       mov       r11,r8
       lea       ebx,[rax+1]
       cmp       ebx,r9d
       jae       near ptr M00_L02
       mov       edi,ebx
       shl       rdi,4
       vmovups   xmm2,[r11+rdi+10]
       mov       r11,r10
       cmp       ebx,esi
       jae       near ptr M00_L02
       vmovups   xmm3,[r11+rdi+10]
       vpandn    xmm4,xmm3,xmm1
       vpmaddubsw xmm4,xmm2,xmm4
       vpsllw    xmm4,xmm4,8
       vpsllw    xmm4,xmm4,8
       vpand     xmm3,xmm1,xmm3
       vpmaddubsw xmm2,xmm2,xmm3
       vpternlogd xmm2,xmm4,xmm1,0EC
       vpxor     xmm0,xmm0,xmm2
       mov       r11,r8
       lea       ebx,[rax+2]
       cmp       ebx,r9d
       jae       near ptr M00_L02
       mov       edi,ebx
       shl       rdi,4
       vmovups   xmm2,[r11+rdi+10]
       mov       r11,r10
       cmp       ebx,esi
       jae       near ptr M00_L02
       vmovups   xmm3,[r11+rdi+10]
       vpandn    xmm4,xmm3,xmm1
       vpmaddubsw xmm4,xmm2,xmm4
       vpsllw    xmm4,xmm4,8
       vpsllw    xmm4,xmm4,8
       vpand     xmm3,xmm1,xmm3
       vpmaddubsw xmm2,xmm2,xmm3
       vpternlogd xmm2,xmm4,xmm1,0EC
       vpxor     xmm0,xmm0,xmm2
       mov       r11,r8
       lea       ebx,[rax+3]
       cmp       ebx,r9d
       jae       short M00_L02
       mov       r9d,ebx
       shl       r9,4
       vmovups   xmm2,[r11+r9+10]
       cmp       ebx,esi
       jae       short M00_L02
       vmovups   xmm3,[r10+r9+10]
       vpandn    xmm4,xmm3,xmm1
       vpmaddubsw xmm4,xmm2,xmm4
       vpsllw    xmm4,xmm4,8
       vpsllw    xmm4,xmm4,8
       vpand     xmm3,xmm1,xmm3
       vpmaddubsw xmm2,xmm2,xmm3
       vpternlogd xmm2,xmm4,xmm1,0EC
       vpxor     xmm0,xmm0,xmm2
       add       eax,4
       cmp       [r8+8],eax
       jg        near ptr M00_L00
M00_L01:
       vmovups   [rdx],xmm0
       mov       rax,rdx
       add       rsp,20
       pop       rbx
       pop       rsi
       pop       rdi
       ret
M00_L02:
       call      CORINFO_HELP_RNGCHKFAIL
       int       3
; Total bytes of code 409

.NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

; BenchmarkPlayground.Vector128ByteMultiplicationFromMemoryBenchmarks.Stratergy3()
       push      rdi
       push      rsi
       push      rbx
       sub       rsp,20
       vxorps    xmm0,xmm0,xmm0
       xor       eax,eax
       mov       r8,[rcx+8]
       cmp       dword ptr [r8+8],0
       jle       near ptr M00_L01
       vmovups   xmm1,[7FF7F76A8F10]
M00_L00:
       mov       r10,r8
       mov       r9d,[r10+8]
       cmp       eax,r9d
       jae       near ptr M00_L02
       mov       r11,rax
       shl       r11,4
       vmovups   xmm2,[r10+r11+10]
       mov       r10,[rcx+10]
       mov       rbx,r10
       mov       esi,[rbx+8]
       cmp       eax,esi
       jae       near ptr M00_L02
       vmovups   xmm3,[rbx+r11+10]
       vmovaps   xmm4,xmm1
       vpand     xmm5,xmm3,xmm4
       vpsrlw    xmm16,xmm2,8
       vpmullw   xmm5,xmm5,xmm16
       vpmullw   xmm2,xmm2,xmm3
       vpternlogd xmm4,xmm5,xmm2,0CE
       vpxor     xmm0,xmm4,xmm0
       mov       r11,r8
       lea       ebx,[rax+1]
       cmp       ebx,r9d
       jae       near ptr M00_L02
       mov       edi,ebx
       shl       rdi,4
       vmovups   xmm2,[r11+rdi+10]
       mov       r11,r10
       cmp       ebx,esi
       jae       near ptr M00_L02
       vmovups   xmm3,[r11+rdi+10]
       vmovaps   xmm4,xmm1
       vpand     xmm5,xmm3,xmm4
       vpsrlw    xmm16,xmm2,8
       vpmullw   xmm5,xmm5,xmm16
       vpmullw   xmm2,xmm2,xmm3
       vpternlogd xmm4,xmm5,xmm2,0CE
       vpxor     xmm0,xmm4,xmm0
       mov       r11,r8
       lea       ebx,[rax+2]
       cmp       ebx,r9d
       jae       near ptr M00_L02
       mov       edi,ebx
       shl       rdi,4
       vmovups   xmm2,[r11+rdi+10]
       mov       r11,r10
       cmp       ebx,esi
       jae       near ptr M00_L02
       vmovups   xmm3,[r11+rdi+10]
       vmovaps   xmm4,xmm1
       vpand     xmm5,xmm3,xmm4
       vpsrlw    xmm16,xmm2,8
       vpmullw   xmm5,xmm5,xmm16
       vpmullw   xmm2,xmm2,xmm3
       vpternlogd xmm4,xmm5,xmm2,0CE
       vpxor     xmm0,xmm4,xmm0
       mov       r11,r8
       lea       ebx,[rax+3]
       cmp       ebx,r9d
       jae       short M00_L02
       mov       r9d,ebx
       shl       r9,4
       vmovups   xmm2,[r11+r9+10]
       cmp       ebx,esi
       jae       short M00_L02
       vmovups   xmm3,[r10+r9+10]
       vmovaps   xmm4,xmm1
       vpand     xmm5,xmm3,xmm4
       vpsrlw    xmm16,xmm2,8
       vpmullw   xmm5,xmm5,xmm16
       vpmullw   xmm2,xmm2,xmm3
       vpternlogd xmm4,xmm5,xmm2,0CE
       vpxor     xmm0,xmm4,xmm0
       add       eax,4
       cmp       [r8+8],eax
       jg        near ptr M00_L00
M00_L01:
       vmovups   [rdx],xmm0
       mov       rax,rdx
       add       rsp,20
       pop       rbx
       pop       rsi
       pop       rdi
       ret
M00_L02:
       call      CORINFO_HELP_RNGCHKFAIL
       int       3
; Total bytes of code 397

@saucecontrol
Copy link
Member

Very clever, but still slower on AMD

Zen 5

BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.2314)
Unknown processor
.NET SDK 9.0.100
  [Host]     : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  DefaultJob : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
Method Mean Error StdDev
SimpleMultiply 437.8 ns 0.38 ns 0.30 ns
Stratergy1 430.9 ns 0.62 ns 0.55 ns
Stratergy2 464.0 ns 1.08 ns 1.01 ns
Stratergy3 450.8 ns 0.59 ns 0.49 ns

Zen 4

BenchmarkDotNet v0.14.0, Windows 10 (10.0.20348.2849)
AMD Ryzen 7 PRO 8700GE w/ Radeon 780M Graphics, 1 CPU, 16 logical and 8 physical cores
.NET SDK 9.0.100
  [Host]     : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  DefaultJob : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
Method Mean Error StdDev
SimpleMultiply 442.2 ns 0.63 ns 0.59 ns
Stratergy1 564.9 ns 0.52 ns 0.49 ns
Stratergy2 590.2 ns 0.50 ns 0.47 ns
Stratergy3 563.0 ns 0.48 ns 0.45 ns

Meteor Lake

BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4460/23H2/2023Update/SunValley3)
Intel Core Ultra 7 155H, 1 CPU, 22 logical and 16 physical cores
.NET SDK 9.0.100
  [Host]     : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2
  DefaultJob : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2
Method Mean Error StdDev
SimpleMultiply 1,011.5 ns 7.67 ns 7.17 ns
Stratergy1 836.5 ns 6.84 ns 6.40 ns
Stratergy2 834.3 ns 6.35 ns 5.94 ns
Stratergy3 782.4 ns 5.83 ns 5.45 ns

@MineCake147E
Copy link
Contributor

MineCake147E commented Nov 18, 2024

IMHO, these new strategies may be useful for largest-available vector registers, such as zmm registers in CPUs with AVX512-BW, as there's no space to fit all 64 zero-extended ushort values in a single vector register, and 2 multiplications will be needed anyway.
SharpLab

_____AAAAA.Multiply3(System.Runtime.Intrinsics.Vector512`1<Byte>, System.Runtime.Intrinsics.Vector512`1<Byte>)
    L0000: vzeroupper
    L0003: vmovups zmm0, [rdx]
    L0009: vpsrlw zmm1, zmm0, 8
    L0010: vmovups zmm2, [r8]
    L0016: vpandd zmm3, zmm2, [0x7ffa4c130500]
    L0020: vpmullw zmm1, zmm1, zmm3
    L0026: vpmullw zmm0, zmm0, zmm2
    L002c: vpternlogd zmm1, zmm0, [0x7ffa4c130500], 0xf4
    L0037: vmovups [rcx], zmm1
    L003d: mov rax, rcx
    L0040: vzeroupper
    L0043: ret

@tannergooding
Copy link
Member

That creates a deviation in behavior, which is often undesirable and can lead to hidden bugs/quirks.

Notably, such newer hardware also tends to explicitly have better shuffle ports and other considerations so its already typically not an issue and will allow the same parallelization.

The potential gains vs the additional complexity/risks are often not worth it in such cases and its going to be tradeoffs all around, especially as newer instructions get added and allow the existing logic to be trivially simplified as its the arguably the most idiomatic (from a conceptual perspective of how you'd do the operations).

I definitely appreciate the discussion so far and am more than willing to take optimizations where the cost vs benefit is much more clearly defined across a range of hardware/scenarios; I just don' think this case in particular meets that bar.

@tannergooding tannergooding closed this as not planned Won't fix, can't repro, duplicate, stale Nov 19, 2024
@dotnet-policy-service dotnet-policy-service bot removed the untriaged New issue has not been triaged by the area owner label Nov 19, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Dec 20, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

5 participants