Revise how constant SIMD vectors are defined in BCL #44115

EgorBo · 2020-11-01T00:40:17Z

There are 3 patterns we currently use across the BCL for const vectors:

    //
    // Case 1: Plain Vector.Create
    public Vector128<byte> Case1(Vector128<byte> vec)
    {
        Vector128<byte> mask = Vector128.Create(
            0xFF, 0xFF, 0, 0xFF, 0xFF, 0xFF, 1, 0xFF,
            0xFF, 0xFF, 2, 0xFF, 0xFF, 0xFF, 3, 0xFF);

        return Ssse3.Shuffle(vec, mask);
    }


    //
    // Case 1.1: Plain Vector.Create as argument of some SIMD instruction directly
    // Should be the same codegen as for Case1 ^ (spoiler: it's not. Forward Substitution? see #4655)
    public Vector128<byte> Case1_1(Vector128<byte> vec)
    {
        return Ssse3.Shuffle(vec,
                    Vector128.Create( // used without "mask" local as in Case1 ^
                        0xFF, 0xFF, 0, 0xFF, 0xFF, 0xFF, 1, 0xFF,
                        0xFF, 0xFF, 2, 0xFF, 0xFF, 0xFF, 3, 0xFF));
    }


    //
    // Case 2: static readonly Vector
    private static readonly Vector128<byte> s_mask = Vector128.Create(
        0xFF, 0xFF, 0, 0xFF, 0xFF, 0xFF, 1, 0xFF,
        0xFF, 0xFF, 2, 0xFF, 0xFF, 0xFF, 3, 0xFF);
    public Vector128<byte> Case2(Vector128<byte> vec)
    {
        // we also often save it to a local first (e.g. before loops)
        return Ssse3.Shuffle(vec, s_mask);
    }


    //
    // Case 3: Roslyn's hack
    private static ReadOnlySpan<byte> Mask => new byte[] {
        0xFF, 0xFF, 0, 0xFF, 0xFF, 0xFF, 1, 0xFF,
        0xFF, 0xFF, 2, 0xFF, 0xFF, 0xFF, 3, 0xFF };
    public Vector128<byte> Case3(Vector128<byte> vec)
    {
        return Ssse3.Shuffle(vec, Unsafe.ReadUnaligned<Vector128<byte>>(
            ref MemoryMarshal.GetReference(Mask)));
    }

Here is the current codegen for these cases:

; Method Case1
G_M46269_IG01:
       vzeroupper 
G_M46269_IG02:
       vmovupd  xmm0, xmmword ptr [reloc @RWD00]        ; loaded from the data section, OK
       vmovupd  xmm1, xmmword ptr [r8]
       vpshufb  xmm0, xmm1, xmm0
       vmovupd  xmmword ptr [rdx], xmm0
       mov      rax, rdx
G_M46269_IG03:
       ret      
RWD00  	dq	FF01FFFFFF00FFFFh, FF03FFFFFF02FFFFh    ; <-- it's here
; Total bytes of code: 29



; Method Case1_1
G_M7091_IG01:
       vzeroupper 
G_M7091_IG02:
       vmovupd  xmm0, xmmword ptr [r8]
       vpshufb  xmm0, xmm0, xmmword ptr [reloc @RWD00]  ; loaded as part of vshufb without 
                                                        ; additional registers from the data secion - PERFECT!!
       vmovupd  xmmword ptr [rdx], xmm0
       mov      rax, rdx
G_M7091_IG03:
       ret      
RWD00  	dq	FF01FFFFFF00FFFFh, FF03FFFFFF02FFFFh    ; <-- it's here
; Total bytes of code: 25



; Method Case2
G_M31870_IG01:
       push     rsi
       sub      rsp, 48
       vzeroupper 
       mov      rsi, rdx
G_M31870_IG02:
       vmovupd  xmm0, xmmword ptr [r8]
       vmovupd  xmmword ptr [rsp+20H], xmm0
       mov      rcx, 0xD1FFAB1E
       mov      edx, 2
       call     CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE  ; static initialization (in some cases can be eliminated by jit but still...)
       mov      rax, 0xD1FFAB1E                          ; <- additional mov
       mov      rax, gword ptr [rax]
       vmovupd  xmm0, xmmword ptr [rsp+20H]
       vpshufb  xmm0, xmm0, xmmword ptr [rax+8]
       vmovupd  xmmword ptr [rsi], xmm0
       mov      rax, rsi
G_M31870_IG03:
       add      rsp, 48
       pop      rsi
       ret      
; Total bytes of code: 80



; Method Case3
G_M23615_IG01:
       vzeroupper 
G_M23615_IG02:
       mov      rax, 0xD1FFAB1E                          ; <- additional mov
                                                         ; kind of makes sense if the same mask is used from different methods
                                                         ; but the C# code looks a bit ugly
       vmovupd  xmm0, xmmword ptr [rax]
       vmovupd  xmm1, xmmword ptr [r8]
       vpshufb  xmm0, xmm1, xmm0
       vmovupd  xmmword ptr [rdx], xmm0
       mov      rax, rdx
G_M23615_IG03:
       ret      
; Total bytes of code: 35

The first case used to be avoided due to some codegen issues, but looks like those were resolved (e.g. JIT now saves such vectors into the data section, does Value Numbering for SIMDs including constant vectors, does CSE, etc - #31834?) so we now have a lot of static readonly fields and we can revise them and convert into Case1(1.1)-style where possible (maybe even if we need to duplicate them in different methods), e.g.:

Places to revise:

Base64Encoder.cs
Base64Decoder.cs
Sse2Helper.cs
Ssse3Helper.cs
BitArray.cs
Maybe there are more
Scan other repos: aspnetcore, ML.NET, etc

Known limitations for Case1:

See Case1.1 comment
JIT doesn't hoist Vector.Create from loops' bodies yet (would be nice to have)

/cc @stephentoub @GrabYourPitchforks @benaadams @tannergooding

The text was updated successfully, but these errors were encountered:

Dotnet-GitSync-Bot · 2020-11-01T00:40:27Z

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

benaadams · 2020-11-01T15:51:13Z

RWD00  	dq	FF01FFFFFF00FFFFh, FF03FFFFFF02FFFFh    ; <-- it's here

Is this aligned?

tannergooding · 2020-11-01T16:05:23Z

Both the 16 and 32-byte SIMD constants will be properly aligned if we aren't emitting "small" code:
https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/lowerxarch.cpp#L716-L717

tannergooding · 2020-11-01T16:06:20Z

~~I don't believe we currently have the logic to allow those being folded into the corresponding operand (which is always safe for AVX+ and safe for non-small code for SSE+).~~

Nevermind, the example above is already folding 👍

EgorBo · 2020-11-01T16:12:33Z

@tannergooding there is a minor issue with alignment:

    public static void Test(ref Vector128<double> a, ref Vector256<double> b)
    {
        a = Vector128.Create(1.0, 2.0);
        b = Vector256.Create(11.0, 12.0, 13.0, 14.0);
    }

; Method Egor:Test(byref,byref)
G_M24708_IG01:
       vzeroupper 
G_M24708_IG02:
       vmovupd  xmm0, xmmword ptr [reloc @RWD00]
       vmovupd  xmmword ptr [rcx], xmm0
       vmovupd  ymm0, ymmword ptr[reloc @RWD32]
       vmovupd  ymmword ptr[rdx], ymm0
G_M24708_IG03:
       vzeroupper 
       ret      
RWD00  	dq	3FF0000000000000h, 4000000000000000h
RWD16  	dd	00000000h, 00000000h, 00000000h, 00000000h ;; <--- padding
RWD32  	dq	4026000000000000h, 4028000000000000h, 402A000000000000h, 402C000000000000h
; Total bytes of code: 31

It could be:

RWD00  	dq	4026000000000000h, 4028000000000000h, 402A000000000000h, 402C000000000000h
RWD32  	dq	3FF0000000000000h, 4000000000000000h

slightly more compact

tannergooding · 2020-11-01T16:22:15Z

Right, we don't currently do any sorting of values and I don't know how amiable the existing data structures are to having that happen.

ghost · 2020-11-02T15:06:54Z

Tagging subscribers to this area: @tannergooding, @jeffhandley
See info in area-owners.md if you want to be subscribed.

gfoidl · 2020-11-02T16:09:52Z

We need to take into account that on older runtimes there may be a regression by using Vector128.Create (JIT got improved for this recently). E.g. System.Memory (for Base64) could regress, when the package gets uupdated for an .NET Core 3.1 target.

tannergooding · 2021-06-17T23:23:57Z

Marked as easy and up-for-grabs.

gfoidl · 2021-06-18T13:50:10Z

Do we need to care about the older runtimes or just use the current best pattern?

tannergooding · 2021-06-18T13:57:12Z

This mostly applies to hardware intrinsics which are .NET Core 3.1+ only, so I don't believe we have much to be concerned about here (I don't believe we are compiling for netcoreapp3.1 and net6.0 simultaneously).

If we do target downlevel TFMs, then we should consider the perf implications as .NET 3.1 will have continued support through Dec 2022: https://dotnet.microsoft.com/platform/support/policy/dotnet-core

gfoidl · 2021-06-18T14:14:41Z

I can create a PR (next week or so), then we can discuss more over there?!

gfoidl · 2021-06-28T13:44:11Z

Scan other repos: ... ML.NET, etc

ML.NET's highest .NET version (for the relevant project) is .NET Core 3.1, so current code over there seems to be the best option. Using Vector{128|256}.Create would be a de-optimization.

Places:
https://github.com/dotnet/machinelearning/blob/1b3cb77b9752fe4279376039ee20fc42822e4845/src/Microsoft.ML.CpuMath/AvxIntrinsics.cs#L48
https://github.com/dotnet/machinelearning/blob/1b3cb77b9752fe4279376039ee20fc42822e4845/src/Microsoft.ML.CpuMath/FactorizationMachine/AvxIntrinsics.cs#L15

ASP.NET Core uses Vector{128|256}.Create already, no need to change something over there.

EgorBo added the tenet-performance Performance related issue label Nov 1, 2020

Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Nov 1, 2020

EgorBo mentioned this issue Nov 1, 2020

Vectorize HexConverter.EncodeToUtf16 using SSSE3 #44111

Merged

jeffschwMSFT added the area-System.Runtime.Intrinsics label Nov 2, 2020

tannergooding added good first issue Issue should be easy to implement, good for first-time contributors help wanted [up-for-grabs] Good issue for external contributors and removed untriaged New issue has not been triaged by the area owner labels Jun 17, 2021

tannergooding added this to the Future milestone Jun 17, 2021

gfoidl mentioned this issue Jun 28, 2021

Use inline Vector{128|256}.Create for constants #54827

Merged

ghost added the in-pr There is an active PR which will close this issue when it is merged label Jun 28, 2021

gfoidl mentioned this issue Jun 30, 2021

Use inline Vector128.Create for constants dotnet/aspnetcore#33969

Merged

ghost closed this as completed in #54827 Jul 12, 2021

ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jul 12, 2021

ghost locked as resolved and limited conversation to collaborators Aug 11, 2021

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revise how constant SIMD vectors are defined in BCL #44115

Revise how constant SIMD vectors are defined in BCL #44115

EgorBo commented Nov 1, 2020 •

edited

Loading

Dotnet-GitSync-Bot commented Nov 1, 2020

benaadams commented Nov 1, 2020

tannergooding commented Nov 1, 2020

tannergooding commented Nov 1, 2020 •

edited

Loading

EgorBo commented Nov 1, 2020 •

edited

Loading

tannergooding commented Nov 1, 2020

ghost commented Nov 2, 2020

gfoidl commented Nov 2, 2020

tannergooding commented Jun 17, 2021

gfoidl commented Jun 18, 2021

tannergooding commented Jun 18, 2021

gfoidl commented Jun 18, 2021

gfoidl commented Jun 28, 2021 •

edited

Loading

Revise how constant SIMD vectors are defined in BCL #44115

Revise how constant SIMD vectors are defined in BCL #44115

Comments

EgorBo commented Nov 1, 2020 • edited Loading

Dotnet-GitSync-Bot commented Nov 1, 2020

benaadams commented Nov 1, 2020

tannergooding commented Nov 1, 2020

tannergooding commented Nov 1, 2020 • edited Loading

EgorBo commented Nov 1, 2020 • edited Loading

tannergooding commented Nov 1, 2020

ghost commented Nov 2, 2020

gfoidl commented Nov 2, 2020

tannergooding commented Jun 17, 2021

gfoidl commented Jun 18, 2021

tannergooding commented Jun 18, 2021

gfoidl commented Jun 18, 2021

gfoidl commented Jun 28, 2021 • edited Loading

EgorBo commented Nov 1, 2020 •

edited

Loading

tannergooding commented Nov 1, 2020 •

edited

Loading

EgorBo commented Nov 1, 2020 •

edited

Loading

gfoidl commented Jun 28, 2021 •

edited

Loading