Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Vector128 and Vector256.Create methods #35857

Merged
merged 14 commits into from
May 8, 2020

Conversation

tannergooding
Copy link
Member

This resolves #11965 and #10033 by updating the Vector128/256.Create methods to be intrinsic on x86.

They are handled entirely in lowering where they will be replaced with the corresponding constant (as was done for GT_SIMD nodes) or where they are lowered to the correct sequence of HWIntrinsics (which allows containment and other checks to "just work"). This should make it rather trivial to support "partial constants" as well (that is, a vector where say 50% of the inputs are constant and the other half are not).

It might be beneficial to eventually create a proper GenTreeVecCns node and to also try and handle this earlier (which would allow constants to be deduplicated and other features), but that is a more involved change.

jit-analyze reports (AVX2):

Found 271 files with textual diffs.

Summary of Code Size diffs:
(Lower is better)

Total bytes of diff: -680 (-0.001% of base)
    diff is an improvement.

Top file improvements (bytes):
        -336 : diff\System.Text.Encodings.Web.dasm (-1.001% of base)
        -216 : diff\System.Memory.dasm (-0.087% of base)
        -102 : diff\System.Private.CoreLib.dasm (-0.002% of base)
         -26 : diff\System.Collections.dasm (-0.005% of base)

4 total files with Code Size differences (4 improved, 0 regressed), 262 unchanged.

Top method improvements (bytes):
        -277 (-71.762% of base) : diff\System.Text.Encodings.Web.dasm - Ssse3Helper:.cctor()
         -72 (-5.660% of base) : diff\System.Memory.dasm - Base64:EncodeToUtf8(ReadOnlySpan`1,Span`1,byref,byref,bool):int
         -36 (-2.302% of base) : diff\System.Memory.dasm - Base64:DecodeFromUtf8(ReadOnlySpan`1,Span`1,byref,byref,bool):int
         -36 (-11.921% of base) : diff\System.Memory.dasm - Base64:Avx2Encode(byref,byref,long,int,int,long,long)
         -36 (-13.585% of base) : diff\System.Memory.dasm - Base64:Ssse3Encode(byref,byref,long,int,int,long,long)
         -33 (-4.867% of base) : diff\System.Text.Encodings.Web.dasm - DefaultJavaScriptEncoderBasicLatin:FindFirstCharacterToEncode(long,int):int:this
         -30 (-21.429% of base) : diff\System.Private.CoreLib.dasm - Vector128`1:get_AllBitsSet():Vector128`1 (6 methods)
         -30 (-19.355% of base) : diff\System.Private.CoreLib.dasm - Vector256`1:get_AllBitsSet():Vector256`1 (6 methods)
         -26 (-5.817% of base) : diff\System.Text.Encodings.Web.dasm - Sse2Helper:.cctor()
         -18 (-6.000% of base) : diff\System.Memory.dasm - Base64:Avx2Decode(byref,byref,long,int,int,long,long)
         -18 (-6.122% of base) : diff\System.Memory.dasm - Base64:Ssse3Decode(byref,byref,long,int,int,long,long)
         -18 (-4.557% of base) : diff\System.Private.CoreLib.dasm - Utf16Utility:GetPointerToFirstInvalidChar(long,int,byref,byref):long
         -14 (-6.731% of base) : diff\System.Collections.dasm - BitArray:.cctor()
         -12 (-3.093% of base) : diff\System.Private.CoreLib.dasm - ASCIIUtility:GetIndexOfFirstNonAsciiChar_Sse2(long,long):long
          -6 (-1.583% of base) : diff\System.Collections.dasm - BitArray:Not():BitArray:this
          -6 (-0.391% of base) : diff\System.Collections.dasm - BitArray:CopyTo(Array,int):this
          -6 (-0.437% of base) : diff\System.Private.CoreLib.dasm - Utf8Utility:TranscodeToUtf8(long,int,long,int,byref,byref):int
          -6 (-3.846% of base) : diff\System.Private.CoreLib.dasm - ASCIIUtility:NarrowUtf16ToAscii_Sse2(long,long,long):long

Top method improvements (percentages):
        -277 (-71.762% of base) : diff\System.Text.Encodings.Web.dasm - Ssse3Helper:.cctor()
         -30 (-21.429% of base) : diff\System.Private.CoreLib.dasm - Vector128`1:get_AllBitsSet():Vector128`1 (6 methods)
         -30 (-19.355% of base) : diff\System.Private.CoreLib.dasm - Vector256`1:get_AllBitsSet():Vector256`1 (6 methods)
         -36 (-13.585% of base) : diff\System.Memory.dasm - Base64:Ssse3Encode(byref,byref,long,int,int,long,long)
         -36 (-11.921% of base) : diff\System.Memory.dasm - Base64:Avx2Encode(byref,byref,long,int,int,long,long)
         -14 (-6.731% of base) : diff\System.Collections.dasm - BitArray:.cctor()
         -18 (-6.122% of base) : diff\System.Memory.dasm - Base64:Ssse3Decode(byref,byref,long,int,int,long,long)
         -18 (-6.000% of base) : diff\System.Memory.dasm - Base64:Avx2Decode(byref,byref,long,int,int,long,long)
         -26 (-5.817% of base) : diff\System.Text.Encodings.Web.dasm - Sse2Helper:.cctor()
         -72 (-5.660% of base) : diff\System.Memory.dasm - Base64:EncodeToUtf8(ReadOnlySpan`1,Span`1,byref,byref,bool):int
         -33 (-4.867% of base) : diff\System.Text.Encodings.Web.dasm - DefaultJavaScriptEncoderBasicLatin:FindFirstCharacterToEncode(long,int):int:this
         -18 (-4.557% of base) : diff\System.Private.CoreLib.dasm - Utf16Utility:GetPointerToFirstInvalidChar(long,int,byref,byref):long
          -6 (-3.846% of base) : diff\System.Private.CoreLib.dasm - ASCIIUtility:NarrowUtf16ToAscii_Sse2(long,long,long):long
         -12 (-3.093% of base) : diff\System.Private.CoreLib.dasm - ASCIIUtility:GetIndexOfFirstNonAsciiChar_Sse2(long,long):long
         -36 (-2.302% of base) : diff\System.Memory.dasm - Base64:DecodeFromUtf8(ReadOnlySpan`1,Span`1,byref,byref,bool):int
          -6 (-1.583% of base) : diff\System.Collections.dasm - BitArray:Not():BitArray:this
          -6 (-0.437% of base) : diff\System.Private.CoreLib.dasm - Utf8Utility:TranscodeToUtf8(long,int,long,int,byref,byref):int
          -6 (-0.391% of base) : diff\System.Collections.dasm - BitArray:CopyTo(Array,int):this

18 total methods with Code Size differences (18 improved, 0 regressed), 244800 unchanged.

1 files had text diffs but no metric diffs.
diff\System.Drawing.Primitives.dasm had 2 diff

jit-analyze reports (SSE2):

Found 270 files with textual diffs.

Summary of Code Size diffs:
(Lower is better)

Total bytes of diff: -708 (-0.002% of base)
    diff is an improvement.

Top file regressions (bytes):
          11 : diff\System.Collections.dasm (0.002% of base)

Top file improvements (bytes):
        -370 : diff\System.Text.Encodings.Web.dasm (-1.079% of base)
        -200 : diff\System.Private.CoreLib.dasm (-0.005% of base)
        -149 : diff\System.Memory.dasm (-0.059% of base)

4 total files with Code Size differences (3 improved, 1 regressed), 262 unchanged.

Top method regressions (bytes):
          18 (9.231% of base) : diff\System.Collections.dasm - BitArray:.cctor()

Top method improvements (bytes):
        -279 (-74.005% of base) : diff\System.Text.Encodings.Web.dasm - Ssse3Helper:.cctor()
         -99 (-18.574% of base) : diff\System.Memory.dasm - Base64:Ssse3Encode(byref,byref,long,int,int,long,long)
         -56 (-12.670% of base) : diff\System.Text.Encodings.Web.dasm - Sse2Helper:.cctor()
         -55 (-10.358% of base) : diff\System.Private.CoreLib.dasm - Utf16Utility:GetPointerToFirstInvalidChar(long,int,byref,byref):long
         -50 (-7.257% of base) : diff\System.Memory.dasm - Base64:Ssse3Decode(byref,byref,long,int,int,long,long)
         -35 (-29.167% of base) : diff\System.Private.CoreLib.dasm - Vector128`1:get_AllBitsSet():Vector128`1 (6 methods)
         -35 (-3.747% of base) : diff\System.Text.Encodings.Web.dasm - DefaultJavaScriptEncoderBasicLatin:FindFirstCharacterToEncode(long,int):int:this
         -18 (-1.098% of base) : diff\System.Private.CoreLib.dasm - SpanHelpers:IndexOfAny(byref,ubyte,ubyte,ubyte,int):int (2 methods)
         -17 (-4.038% of base) : diff\System.Private.CoreLib.dasm - ASCIIUtility:GetIndexOfFirstNonAsciiChar_Sse2(long,long):long
         -17 (-7.623% of base) : diff\System.Private.CoreLib.dasm - ASCIIUtility:NarrowUtf16ToAscii_Sse2(long,long,long):long
         -16 (-1.309% of base) : diff\System.Private.CoreLib.dasm - SpanHelpers:IndexOfAny(byref,ubyte,ubyte,int):int (2 methods)
          -7 (-1.877% of base) : diff\System.Collections.dasm - BitArray:Not():BitArray:this
          -6 (-2.000% of base) : diff\System.Private.CoreLib.dasm - SpanHelpers:IndexOf(byref,ushort,int):int
          -6 (-0.713% of base) : diff\System.Private.CoreLib.dasm - SpanHelpers:IndexOf(byref,ubyte,int):int (2 methods)
          -6 (-15.385% of base) : diff\System.Private.CoreLib.dasm - Vector128:Create(ubyte):Vector128`1
          -6 (-15.000% of base) : diff\System.Private.CoreLib.dasm - Vector128:Create(byte):Vector128`1
          -6 (-2.449% of base) : diff\System.Private.CoreLib.dasm - Vector128:Create(ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte):Vector128`1
          -6 (-2.317% of base) : diff\System.Private.CoreLib.dasm - Vector128:Create(byte,byte,byte,byte,byte,byte,byte,byte,byte,byte,byte,byte,byte,byte,byte,byte):Vector128`1
          -3 (-10.000% of base) : diff\System.Private.CoreLib.dasm - Vector128:Create(short):Vector128`1
          -3 (-10.345% of base) : diff\System.Private.CoreLib.dasm - Vector128:Create(ushort):Vector128`1

Top method regressions (percentages):
          18 (9.231% of base) : diff\System.Collections.dasm - BitArray:.cctor()

Top method improvements (percentages):
        -279 (-74.005% of base) : diff\System.Text.Encodings.Web.dasm - Ssse3Helper:.cctor()
         -35 (-29.167% of base) : diff\System.Private.CoreLib.dasm - Vector128`1:get_AllBitsSet():Vector128`1 (6 methods)
         -99 (-18.574% of base) : diff\System.Memory.dasm - Base64:Ssse3Encode(byref,byref,long,int,int,long,long)
          -6 (-15.385% of base) : diff\System.Private.CoreLib.dasm - Vector128:Create(ubyte):Vector128`1
          -6 (-15.000% of base) : diff\System.Private.CoreLib.dasm - Vector128:Create(byte):Vector128`1
         -56 (-12.670% of base) : diff\System.Text.Encodings.Web.dasm - Sse2Helper:.cctor()
         -55 (-10.358% of base) : diff\System.Private.CoreLib.dasm - Utf16Utility:GetPointerToFirstInvalidChar(long,int,byref,byref):long
          -3 (-10.345% of base) : diff\System.Private.CoreLib.dasm - Vector128:Create(ushort):Vector128`1
          -3 (-10.000% of base) : diff\System.Private.CoreLib.dasm - Vector128:Create(short):Vector128`1
         -17 (-7.623% of base) : diff\System.Private.CoreLib.dasm - ASCIIUtility:NarrowUtf16ToAscii_Sse2(long,long,long):long
         -50 (-7.257% of base) : diff\System.Memory.dasm - Base64:Ssse3Decode(byref,byref,long,int,int,long,long)
         -17 (-4.038% of base) : diff\System.Private.CoreLib.dasm - ASCIIUtility:GetIndexOfFirstNonAsciiChar_Sse2(long,long):long
         -35 (-3.747% of base) : diff\System.Text.Encodings.Web.dasm - DefaultJavaScriptEncoderBasicLatin:FindFirstCharacterToEncode(long,int):int:this
          -6 (-2.449% of base) : diff\System.Private.CoreLib.dasm - Vector128:Create(ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte):Vector128`1
          -6 (-2.317% of base) : diff\System.Private.CoreLib.dasm - Vector128:Create(byte,byte,byte,byte,byte,byte,byte,byte,byte,byte,byte,byte,byte,byte,byte,byte):Vector128`1
          -6 (-2.000% of base) : diff\System.Private.CoreLib.dasm - SpanHelpers:IndexOf(byref,ushort,int):int
          -7 (-1.877% of base) : diff\System.Collections.dasm - BitArray:Not():BitArray:this
         -16 (-1.309% of base) : diff\System.Private.CoreLib.dasm - SpanHelpers:IndexOfAny(byref,ubyte,ubyte,int):int (2 methods)
         -18 (-1.098% of base) : diff\System.Private.CoreLib.dasm - SpanHelpers:IndexOfAny(byref,ubyte,ubyte,ubyte,int):int (2 methods)
          -6 (-0.713% of base) : diff\System.Private.CoreLib.dasm - SpanHelpers:IndexOf(byref,ubyte,int):int (2 methods)

21 total methods with Code Size differences (20 improved, 1 regressed), 244797 unchanged.

The regression for SSE2 is because we are now inlining a Vector128.Create(long) call where we were not previously

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 5, 2020
@tannergooding
Copy link
Member Author

Diffs are similar to:

-       mov      ecx, 128
-       vmovd    xmm0, ecx
-       vpbroadcastw xmm0, xmm0
+       vmovupd  xmm0, xmmword ptr [reloc @RWD00]
        ...
+ RWD00  db	080h, 000h, 080h, 000h, 080h, 000h, 080h, 000h, 080h, 000h, 080h, 000h, 080h, 000h, 080h, 000h

@tannergooding
Copy link
Member Author

CC. @CarolEidt, @echesakovMSFT

@tannergooding
Copy link
Member Author

As a separate issue, I noticed for the various intrinsics that take a byte, sbyte, short, or ushort, we will often get nodes like:

               [000038] -----+------                 +--*  CAST      int <- ubyte <- int
               [000000] -----+------                 |  \--*  LCL_VAR   int    V01 arg0

In the case where the instruction can take a mem8 or mem16 (such as pinsrb xmm, r32/m8, imm8 and pinsrw) we could contain the operand and support reading directly from the memory location

@CarolEidt
Copy link
Contributor

I think the cast should be handled as contained along with its operand.

Copy link
Contributor

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, but it's a lot of code and could use some more textual comments describing what's being done, and it would be helpful to give meaningful name to the temps (the t*s) in the pseudo-IR to make it easier to follow.

// TODO-XARCH-CQ: It may be beneficial to emit the movq
// instruction, which takes a 64-bit memory address and
// works on 32-bit x86 systems.
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you chose not to do this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not something we are handling today anywhere else and we have an identical TODO in the other places.

I believe this would just require us creating a GT_IND node since x86 requires it be a long* or ulong*, but I'm not positive if that is the ideal/correct way to handle it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting, this is also not something we are handling in the managed implementation today.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I gathered that, but was just curious, given the comment.

Copy link
Member Author

@tannergooding tannergooding May 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super familiar with the differences between the different addressing forms in the JIT and when each is appropriate to use.

Do we have a helper function that will create a T* or ref T from an arbitrary T (In this case a ulong* from a ulong) when the given operand on the stack could be a constant, local, field, byref, or other indirection?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC you really want a GT_ADDR, to get the address of an operand. I the JIT addresses are not strongly typed, they are always TYP_BYREF, so it would just be something like:

GenTree* addr = gtNewOperNode(GT_ADDR, TYP_BYREF, op);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though I'd note that I'm not really pressing for you to do this now - I was just curious why you added the comment rather than doing it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll attempt this in a follow up PR, hopefully it is that straightforward 😄

src/coreclr/src/jit/lowerxarch.cpp Show resolved Hide resolved
src/coreclr/src/jit/lowerxarch.cpp Show resolved Hide resolved
src/coreclr/src/jit/lowerxarch.cpp Outdated Show resolved Hide resolved
// /--* t? simd32
// +--* t? simd16
// +--* t? int
// t0 = * HWINTRINSIC simd32 T InsertVector128
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment doesn't match what's generated below (e.g. you're creating a constant of 0x01). Also, it would be easier to follow if you referred to 'op1', 'op2' and 'idx' here (and maybe add 'v' for the target of NI_Vector128_Create and result for the final value, instead of using 't2' and 't?'.
I would also do something similar for the comments below.

Copy link
Member Author

@tannergooding tannergooding May 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like I forgot to fill in the generated LIR for this one. I'm updating it and the other comments to be closer to the following:

            // We will be constructing the following parts:
            //          /--*  op1  T
            //   tmp  = *  HWINTRINSIC simd16 T Create
            //          /--*  tmp  simd16
            //          *  STORE_LCL_VAR simd16
            //   tmp  =    LCL_VAR simd16
            //   dup  =    LCL_VAR simd16
            //          /--*  dup  simd16
            //   dup  = *  HWINTRINSIC simd16 T ToVector256Unsafe
            //   idx  =    CNS_INT     int    0
            //          /--*  dup  simd32
            //          +--*  tmp  simd16
            //          +--*  idx  int
            //   node = *  HWINTRINSIC simd32 T InsertVector128

            // This is roughly the following managed code:
            //   var tmp = Vector128.Create(op1);
            //   var dup = tmp.ToVector256Unsafe(tmp);
            //   return Avx.InsertVector128(dup, tmp, 1);

src/coreclr/src/jit/lowerxarch.cpp Show resolved Hide resolved
src/coreclr/src/jit/lowerxarch.cpp Outdated Show resolved Hide resolved
src/coreclr/src/jit/lowerxarch.cpp Outdated Show resolved Hide resolved
}
else
{
tmp = op2->AsArgList()->Current();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm quite confused by this, and I suspect I'm missing something. You've handle the 4-argument case above, and this handles 2. Where do you handle 8 arguments?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one actually handles everything that is > 4 and I can add an assert + comment to clarify.

For Vector256 (which this code path is restricted to) we will have 4, 8, 16, or 32 operands, so in all cases we have a GT_LIST.
Both paths construct the upper and lower Vector128 portions and then combine them. The for loop here ensures that op1 points to the first operand and op2 points to the 3rd, 5th, 9th, or 17th operand.

The first path needs to exist since it will be 2 operands per half, and se we need to track them in gtOp1/gtOp2 rather than in a GT_LIST
While the second path can just use the original list, split into two halves with the 2nd, 4th, 8th, or 16th operand no longer have a gtOp2 (which normally points to the next operand in the list).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, I see now that op1 and op2 are still lists, and the two NI_Vector128_Create intrinsics are then recursively lowered. I might ask that you create a new GenTree* variable for those rather than reusing op1. I know that the JIT reuses variables all over the place, but I don't think it's a great practice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment explaining the split and what values are expected where, using new variables named lo and hi to track the lower and upper halves of the Vector256 being created.

@tannergooding
Copy link
Member Author

I believe I addressed all the feedback given and everything passes for the various ISAs locally so I've kicked off the jitstress-isas-x86 job.

@tannergooding
Copy link
Member Author

Resolved an issue that showed up with ARM where the Create overloads that took more than 1 argument were being recognized as intrinsic and causing asserts later.
HWIntrinsicInfo::lookupId now takes the number of arguments into account when matching to avoid this issue.

@tannergooding
Copy link
Member Author

tannergooding commented May 7, 2020

ARM64 failure is because mustExpand doesn't appear to take into account that the recursion is in a dead code path.

Edit: To clarify, we don't take recursion into account when some overload of the method is intrinsic.

So if you have Vector128.Create(T) which is intrinsic and Vector128.Create(T, ..., T) which is not; then you'll get NI_Illegal rather than NI_Throw_PlatformNotSupportedException.
For cases like Vector64.Create on x86/x64 or Vector256.Create on ARM64 where no overloads are intrinsic, this isn't an issue and is handled.

@tannergooding
Copy link
Member Author

tannergooding commented May 7, 2020

I've updated the calls to not be recursive on ARM64 for right now. I'm working on getting the same support done for ARM64, but it will be a separate PR to help keep the size of this one smaller.

Edit: We also need to expose a couple of additional intrinsics before I can implement the Create(T) versions. We have INS_dup but no corresponding intrinsic it can be lowered to yet

Copy link
Contributor

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thanks for the additional comments; they may seem verbose, but they really help with understanding what's going on!

@tannergooding
Copy link
Member Author

Looks like there is some superpmicollect failure for x86, investigating:

ERROR: Exception thrown: DebugBreak or AV Exception 123
ERROR: main method 4 of size 1086 failed to load and compile correctly.
ERROR: Exception thrown: DebugBreak or AV Exception 123
ERROR: main method 4 of size 1086 failed to load and compile correctly.
ERROR: replay of final file is not error free

@tannergooding
Copy link
Member Author

Seems superpmi ignores all CorJitAllocMemFlag and so asserting the alignment fails. We've just been getting lucky on x86 with the limited number of vector constants that we had in Corelib so far and them likely being off the paths the superpmicollect test was covering.

@CarolEidt, I've updated MyICJI::allocMem to start respecting a subset of the flags, could you give the latest commit a glance?

Copy link
Contributor

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thanks

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize Vector128/256.Create
3 participants