Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vector.Narrow performance #9766

Closed
benaadams opened this issue Feb 21, 2018 · 23 comments
Closed

Vector.Narrow performance #9766

benaadams opened this issue Feb 21, 2018 · 23 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI Cost:S Work that requires one engineer up to 1 week help wanted [up-for-grabs] Good issue for external contributors JitUntriaged CLR JIT issues needing additional triage tenet-performance Performance related issue
Milestone

Comments

@benaadams
Copy link
Member

I was trying to work out why I wasn't getting the performance expected from Vector.Narrow

using System;
using System.Linq;
using System.Numerics;
using System.Runtime.CompilerServices;
using System.Text;

class Program
{
    static unsafe void Main(string[] args)
    {
        var charArray = Enumerable.Repeat('a', Vector<ushort>.Count * 2).ToArray();
        var byteArray = new byte[Vector<byte>.Count];

        fixed (char* pChar = charArray)
        fixed (byte* pByte = byteArray)
        {
            Narrow(pChar, pByte);
        }

        Console.WriteLine(Encoding.ASCII.GetString(byteArray));
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    private static unsafe void Narrow(char* input, byte* output)
    {
        var bytes = Vector.Narrow(
            Unsafe.AsRef<Vector<ushort>>(input), 
            Unsafe.AsRef<Vector<ushort>>(input + Vector<ushort>.Count));
        Unsafe.AsRef<Vector<byte>>(output) = bytes;
    }
}

Generates

; Assembly listing for method Program:Narrow(long,long)
; Emitting BLENDED_CODE for X64 CPU with AVX
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  4,  4   )    long  ->  rcx        
;  V01 arg1         [V01,T01] (  3,  3   )    long  ->  rdx        
;  V02 loc0         [V02,T07] (  2,  2   )  simd32  ->  mm0        
;  V03 tmp0         [V03,T02] (  2,  4   )  simd32  ->  mm0        
;* V04 tmp1         [V04,T08] (  0,  0   )   byref  ->  zero-ref   
;* V05 tmp2         [V05,T09] (  0,  0   )   byref  ->  zero-ref   
;  V06 tmp3         [V06,T03] (  2,  2   )   byref  ->  rax        
;* V07 tmp4         [V07    ] (  0,  0   )    long  ->  zero-ref   
;  V08 tmp5         [V08,T04] (  2,  2   )   byref  ->  rax        
;  V09 tmp6         [V09,T05] (  2,  2   )   byref  ->  rax        
;  V10 tmp7         [V10,T06] (  2,  2   )   byref  ->  rax        
;# V11 OutArgs      [V11    ] (  1,  1   )  lclBlk ( 0) [rsp+0x00]  
;
; Lcl frame size = 0

G_M2586_IG01:
       C5F877               vzeroupper 

G_M2586_IG02:
       C4E17D1001           vmovupd  ymm0, ymmword ptr[rcx]
       488D4120             lea      rax, [rcx+32]
       C4E17D1008           vmovupd  ymm1, ymmword ptr[rax]
       C4E37D46D920         vperm2i128 ymm3, ymm0, ymm1, 32
       C4E37D46D131         vperm2i128 ymm2, ymm0, ymm1, 49
       C4E16571F308         vpsllw   ymm3, 8
       C4E16571D308         vpsrlw   ymm3, 8
       C4E16D71F208         vpsllw   ymm2, 8
       C4E16D71D208         vpsrlw   ymm2, 8
       C4E16567C2           vpackuswb ymm0, ymm3, ymm2
       488BC2               mov      rax, rdx
       C4E17D1100           vmovupd  ymmword ptr[rax], ymm0

G_M2586_IG03:
       C5F877               vzeroupper 
       C3                   ret      

; Total bytes of code 70, prolog size 3 for method Program:Narrow(long,long)

Is this correct (performance wise)?

/cc @mikedn @CarolEidt

category:cq
theme:vector-codegen
skill-level:intermediate
cost:medium

@benaadams
Copy link
Member Author

What I'm comparing the performance to does 128 bytes at a time as; so maybe its just close

const int Shift16Shift24 = (1 << 16) | (1 << 24);
const int Shift8Identity = (1 <<  8) | (1);
        
int ulongDoubleCount = (length - i) & ~0x7;
for (; i < ulongDoubleCount; i += 8)
{
    ulong inputUlong0 = *(ulong*)(input + i);
    ulong inputUlong1 = *(ulong*)(input + i + 4);
    // Pack 16 ASCII chars into 16 bytes
    *(uint*)(output + i) =
        ((uint)((inputUlong0 * Shift16Shift24) >> 24) & 0xffff) |
        ((uint)((inputUlong0 * Shift8Identity) >> 24) & 0xffff0000);
    *(uint*)(output + i + 4) =
        ((uint)((inputUlong1 * Shift16Shift24) >> 24) & 0xffff) |
        ((uint)((inputUlong1 * Shift8Identity) >> 24) & 0xffff0000);
}

@glenn-slayden
Copy link

glenn-slayden commented Jul 26, 2018

I do not repro your emitted asm. For this code...

[MethodImpl(MethodImplOptions.NoInlining)]
static void narrow_simd(char* src, byte* dst, int c)
{
    Unsafe.AsRef<Vector<byte>>(dst) = 
        Vector.Narrow(Unsafe.AsRef<Vector<ushort>>(src + 0),
                      Unsafe.AsRef<Vector<ushort>>(src + 8));
}

...I currently get:

mov         rax,rdx
movups      xmm0,xmmword ptr [rcx]
lea         rdx,[rcx+10h]
movups      xmm1,xmmword ptr [rdx]
movaps      xmm0,xmm0
movaps      xmm2,xmm1
psllw       xmm0,8
psrlw       xmm0,8
psllw       xmm2,8
psrlw       xmm2,8
packuswb    xmm0,xmm2
movups      xmmword ptr [rax],xmm0
ret 

...which lacks the two vperm2i128 instructions you reported.

Could your result be an artifact of your test or debug apparatus? Or perhaps fallback code to preclude CPU incapabilities? For example your comment indicates that it is displaying the "blended code for AVX." I don't know what that means, but fwiw my output is directly from the vs2017 debugger metal on "Intel(R) Core(TM)2 Extreme CPU X9770." Otherwise, you may wish to re-test with the latest bits.
[edit: emitted based on different CPU capabilities; see below]

Having said that, there remains the PSRLW 8/PSLLW 8 sequence which will apparently be emitted for each-and-every vector that is used as an input to the PACKUSWB instruction. I was wondering what the deal was with this so I looked into it.

My conclusion on that front, unfortunately, is that the steps are indeed mandatory here, and are the result of a catastrophically misguided design choice regarding the PACKUSWB/PACKSSWB instructions. For details on this, read on...

On the design of PACKUSWB/PACKSSWB

The effect of the PSRLW 8/PSLLW 8 instruction sequence is to zero the high byte in each of a vector's eight char values. The requirement for this is mentioned here:

https://github.com/dotnet/coreclr/blob/eddc1a621b66d63ac9feda289f08fc0e9f67f850/src/jit/simdcodegenxarch.cpp#L507-L509

This code comment tersely summarizes a philosophical conundrum that arises with lossy instructions such as SSD PACKUSWB "Pack with unsigned saturation" and PACKSSWB "Pack with signed saturation." So what is this so-called "saturation," anyway? At issue here is what to do if there happen to be non-zero bits in any of the high bytes which are to be discarded:

  1. One view would be that it is the caller's responsibility to know that narrowing operations are inherently lossy and use such instructions accordingly. To this thinking, it would be presumptous for these instructions to deem the presence of non-zero upper bytes an error condition. And it would be pointless too, since, absent further information about the caller's intention, it is unclear what might be done. So lest do further damage, the best action is to do nothing. In this school of thought, the upper bytes could simply be ignored in the first place, and altogether.

  2. The alternative idea is that it is processor's responsibility--given its vast, omniscient view--to help as much as possible at all times. As such, the presence of non-zero bits in a discardable upper byte must most certainly represent an unintentional--and likely fatal--loss of data which must be signaled at all costs.

I tried to dial down the sarcasm, but I fear I may have betrayed my opinion. Unfortunately the design of the PACKUSWB/PACKSSWB opcodes hew to the latter ideology (2.), so apparently the meaning of "saturate" in the opcode descriptions is, "Be super-helpful by detecting an arbitrary condition that the user did, in fact, willfully submit, and then signal that condition by overwriting the data result in order to contaminate the mainline codepath!". Having worked in both industries, I suspect that that typically different worldviews of CPU firmware vs. software engineers might be contributing to the dissonance here: As you get closer to EE, every signal starts to look continuous. Seriously, though here are some excerpts from the (corrected) canonical descriptions:

PACKSSWB - Pack with Signed Saturation
If the signed doubleword [16-bit word integer] value is beyond the range of an unsigned word [signed byte integer] (i.e. greater than 007Fh or less than 0080h [negative]), the saturated signed byte integer value of 7Fh or 80h, respectively, is stored in the destination.

PACKUSWB - Pack with Unsigned Saturation
If a signed [16-bit] word integer value is beyond the range of an unsigned byte integer (that is, greater than 00FFh or less than 0000h [negative]), the saturated unsigned byte integer value of FFh or 00h, respectively, is stored in the destination.

Note that both instructions require that the source data values be under signed word interpretation. This, in turn entails that those 16-bit values represent scalar--as opposed to discrete--data. Now as far as I can tell, there are no unsigned-source equivalents for these instructions, so right off the bat we can note that discrete data uses, such truncating enumeration, bitmap/flag, or most importantly--and thus as we'll see, catastrophically--Unicode character data, are not a priority.

Scalar data values

So anyway, fine. If we assume a "scalar mentality," the most charitable view of saturation is as scalar clamping. This may indeed seem like a useful feature. Let's examine things a bit more closely. Because these narrowing instructions don't perform any actual scaling, they're essentially asserting that their 16-bit source vectors be already under byte interpretation. And the problem is, this directly contradicts the original assumption (previous paragraph) that the source values be interpreted as signed word values.

It's not hard to see how the SSE designers got here. While it's true that rudely truncating a signed 16-bit scalar to 8-bits is basically always nonsensical, that fact is slightly less true for the unsigned case. Whereas the low-order bits are garbage in the former case, in the latter they remain interpretable as the (e.g.) high-frequency components of the original time-series signal. So, one might conclude, "Great! We don't need to worry about unsigned variants for PACKUSWB/PACKSSWB!" Cross them off the list.

Well, perhaps. But the flaw in this analysis came earlier, when we assumed the "scalar mentality." Though that's understandably the default worldview for engineers seeped in the microcode of a CPU's vector unit, narrowing is a special case, since it ultimately effects an operation that's inherently non-scalar. There's nothing about bitwise summary-truncation that might remotely be considered an operation on continuous data. If the user wants clamping, let them explicitly apply it with full knowledge of the compression scenario.

Whether permanently attenuating a 16-bit source to 8-bit resolution by explicit (e.g.) linear scaling, or round-tripping an 8-bit source to 16-bit to minimize overflow losses during a close-ended sequence of 8-bit edits (example of a non-scaling use on scalar data), any/all logical re-scaling can really only meaningfully be applied prior to invoking PACKUSWB or PACKSSWB. By the time you issue PACKUSWB/PACKSSWB, it's too late to deal with scalar clamping, because you must already be dealing with (logical) 8-bit values--regardless of signed/unsigned--that happen to be stored as (physical) 16-bit words.

Piling on now, obviously you shouldn't need to clamp a scalar to 8-bits if it's already logically-8-bit, but in fact, you couldn't even if you wanted to, because as an "already-logically-8-bit" value, the overflow bits that would be needed to properly do so must be considered already lost. Despite the wishful designs of PACKUSWB/PACKSSWB, any bits in the upper-byte of the happenstance physical representation of a logically-8-bit value should never have been considered interpretable.

In short, any clamping must be explicitly coordinated with whatever preapplied logical scaling is applied, because once scaled, clamping a logically-same-sized value is (obviously) vacuous. It follows that the design of PACKUSWB/PACKSSWB did NOT need to assume that its source words were scalar. Rippling back now, since clamping can only be correctly performed in cahoots with some actual scaling operation (without which word-sized input values would be literal garbage), the (ridiculous) special clamping behavior (a.k.a. "saturation") is therefore quite entirely useless. If useless, then any/all sign-meddling it inflicts on the result values is wholly unnecessary, or at least quite unhelpful. Add in the fact that the very notion of clamping is inherently antagonistic to non-scalar usage, plus that the behavior can't be turned off, and one can only conclude that the clamping behavior of the PACKUSWB/PACKSSWB is nothing short of detrimental.

Discrete data values

Enough of that; since the "scalar mentality" was a bust, let's put on our discrete hat now.

Suddenly, the behaviors quoted in the bulleted list above seem downright strange. From a non-scalar point of view, each instruction (PACKUSWB and PACKSSWB) has the audacity to pluck two values ({ 0x7F, 0x80 } and { 0xFF, 0x00 }, respectively), right out of our in-band data codomain to serve as special control/sentinel values. To us, because we have our discrete hat on, each of the four appropriated values, though surely fixed and well-known, are arbitrary. Obviously, each may conflate with--and thus may occlude--one of our perfectly legitimate runtime values.

Worse, with discrete data we're guaranteed by definition that the conditions signaled by these sentinels will be semantically meaningless, which further implies that such signals, if allowed to fire, will be random and thus indistinguishable from the true (discrete) values they conflate with, a scenario sometimes referred to as unmitigated data corruption. Specifically for us here, the only way to prevent the signalling of meaningless conditions from corrupting our output--which thus becomes mandatory practice--is to take explicit steps to preclude the meddlesome conditions from manifesting in the input values we submit.

The fact that discrete data apps are required to ensure that each-and-every input vector issued to PACKUSWB has its discard-destined bytes zeroed-out is an cruel irony at least, and persistent runtime penalty at worst. The annoying burden is made all the worse by the knowledge that, as I tried to establish in some detail above, there's really no realistic scenario--scalar or otherwise--that might tangibly benefit from the bankrupt design choice (a.k.a., clamping or "saturation") that necessitates the chore. Though cheerfully well-intentioned, this particular help is not just unhelpful; it's a scourge.

@mikedn
Copy link
Contributor

mikedn commented Jul 26, 2018

For example your comment indicates that it is displaying the "blended code for AVX." I don't know what that means, but fwiw my output is directly from the vs2017 debugger metal on "Intel(R) Core(TM)2 Extreme CPU X9770."

That processor does not support AVX so there's no way you'll see vperm2i128 being generated.

@glenn-slayden
Copy link

@mikedn Oh, I'm starting to see that part. You're suggesting that maybe...

vmovupd  ymm0, ymmword ptr[rcx]
lea      rax, [rcx+32]
vmovupd  ymm1, ymmword ptr[rax]
vperm2i128 ymm3, ymm0, ymm1, 32
vperm2i128 ymm2, ymm0, ymm1, 49

is functionally the same as, but in in some ways perhaps an improvement over...

movups      xmm0,xmmword ptr [rcx]
lea         rdx,[rcx+10h]
movups      xmm1,xmmword ptr [rdx]
movaps      xmm0,xmm0
movaps      xmm2,xmm1

I remain a bit suspicious since while these exhibit the same number of fetches, the instruction stream for the latter, presumably "legacy", CPU (this wasn't news to me) likely has a tighter byte count (i.e. for keeping loops in-cache...)?

@mikedn
Copy link
Contributor

mikedn commented Jul 26, 2018

You're suggesting that maybe is functionally the same as, but in in some ways perhaps an improvement over

I'm not suggesting anything, I'm simply saying that you're seeing different code because you're using a CPU with different capabilities.

I don't have time to investigate this issue in detail. The use of permute is very likely required due to the per-lane operation of AVX version of packuswb, not an "improvement".

Vector's Narrow is a high level operation that may work or not satisfactorily depending on whatever instructions the target architecture has. If that's not enough then the low level HW intrinsics should be used instead. In this particular case packuswb could likely be used without shifts as the code assumes that the text contains only ASCII characters.

@glenn-slayden
Copy link

Ok, thanks; I see that I'm in way over my head.

@glenn-slayden
Copy link

...if that's not enough then the low level HW intrinsics should be used...

@mikedn Is it actually possible to issue individual "low-level HW intrinsics" from managed IL somehow?

@tannergooding
Copy link
Member

@glenn-slayden, it should be in netcoreapp3.0 (provided the feature gets completed/ships 😄).

You can browse most of the APIs here: https://source.dot.net/#System.Private.CoreLib/shared/System/Runtime/Intrinsics/X86/Sse.cs,a190e303dd574c72

@glenn-slayden
Copy link

Spectacular. Keep up the great work.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@saucecontrol
Copy link
Member

I was just looking at this codegen a couple of weeks ago, in the context of SixLabors/ImageSharp#1143

The AVX2 implementation could be improved slightly by swapping out the two vperm2i128 at the beginning for a single vpermq at the end.

And it might be a little more efficient in the SSE2 implementation to construct a mask (mov + pshufd) and vpand to zero the upper bytes rather than the left and right shifts. However, that would likely not be better with AVX2 because a broadcast or permute would be needed to construct the mask.

In the end, it will always be inefficient because it attempts to preserve the truncating behavior of a scalar short->byte conversion, while the SIMD instructions narrow with saturation. (That is, BTW, not a design flaw in the SIMD instructions -- that behavior is extremely desirable in signal-processing applications).

Is this worth keeping open now that HWIntrinsics have landed?

@tannergooding
Copy link
Member

And it might be a little more efficient in the SSE2 implementation to construct a mask (mov + pshufd) and vpand to zero the upper bytes rather than the left and right shifts. However, that would likely not be better with AVX2 because a broadcast or permute would be needed to construct the mask.

The mask is constant, so I don't think that would be a problem. I imagine that and + shuffle + combine would be more efficient and would break some dependencies that look to exist, but the only way to validate that would be to profile and test.

Is this worth keeping open now that HWIntrinsics have landed?

Yes. HWIntrinsics provide machine specific optimizations while Vector<T> provides platform independent optimizations. We still want to continue improving both where applicable/as-possible.

@saucecontrol
Copy link
Member

Actually had a look at all the codegen for narrow, and some of it is quite poor. For example:

GetEmitter()->emitIns_R_R_I(INS_vextracti128, EA_32BYTE, tmpReg, op1Reg, 0x01);
GetEmitter()->emitIns_R_R_I(INS_vextracti128, EA_32BYTE, tmpReg2, op2Reg, 0x01);
GetEmitter()->emitIns_R_R_I(INS_vinserti128, EA_32BYTE, tmpReg, tmpReg2, 0x01);
inst_RV_RV(ins_Copy(simdType), tmpReg2, op1Reg, simdType, emitSize);
GetEmitter()->emitIns_R_R_I(INS_vinserti128, EA_32BYTE, tmpReg2, op2Reg, 0x01);
GetEmitter()->emitIns_R_R_I(INS_pshufd, emitSize, tmpReg, tmpReg, (int8_t)SHUFFLE_XXZX);
GetEmitter()->emitIns_R_R_I(INS_pshufd, emitSize, targetReg, tmpReg2, (int8_t)SHUFFLE_XXZX);
inst_RV_RV_RV(INS_punpcklqdq, targetReg, targetReg, tmpReg, emitSize);

That entire sequence of vextracti128vinserti128 could also be replaced with a single vpermq after the unpack. This seems like a prime case where we could get both a performance and maintenance cost win by re-implementing in managed code with HWIntrinsics. Are there still some roadblocks to starting that effort?

@tannergooding
Copy link
Member

Are there still some roadblocks to starting that effort?

This is tracked by #956 and is still blocked pending us improving inlining so the generated trees aren't as complex.
We'd still need sign-off from the three I tagged before moving forward or for the underlying blocking issue to be resolved.

@saucecontrol
Copy link
Member

Seems we're deadlocked on that one. Is there something concrete and actionable to be done on the inlining side?

I'd be happy to do the Vector<T> widen/narrow/conversion as a POC if it helps demonstrate the value or helps define the scope of the required inlining improvements.

I can see quite a few easy perf/complexity wins in the codegen for these, and it's easy to see the JIT impact since all that code was added in one commit here: dotnet/coreclr@965d5ee

@CarolEidt @jkotas @AndyAyersMS any thoughts on starting the #956 effort with these?

@jkotas
Copy link
Member

jkotas commented Mar 27, 2020

That entire sequence of vextracti128 … vinserti128 could also be replaced with a single vpermq after the unpack.

How complex is to fix this in the JIT?

any thoughts on starting the #956 effort with these?

If this results in measurably better code and it is too complex to achieve the equivalent improvement by fixing the current JIT implementation, I think it is a fine idea to start the effort if these.

@saucecontrol
Copy link
Member

Some of the fixes (like reducing the number of permutes by delaying until after narrow) would be easy to do in the JIT. There are other potential optimizations (for example using an SSSE3 pshufb to do truncation and narrowing in a single step) that would be much more complicated to evaluate and implement on the JIT side.

@CarolEidt
Copy link
Contributor

I would definitely be supportive of a POC, especially to the extent that we can leverage that to quantify and potentially consider how to address the increase in complexity of the IR that's produced. I don't know if anyone has considered the alternative of doing the transformation in the importer - that is, having SIMD intrinsic importation generate HW intrinsic nodes where possible.

@tannergooding
Copy link
Member

How complex is to fix this in the JIT?

I'd actually say it isn't that difficult. We already have infrastructure in place for doing SSE2 vs SSE41 vs AVX2 versions of the code (for the System.Numerics.Vector types) and so its just a bit more work to make that happen.
We can't make it as "fine grained" as we could with direct HWIntrinsics in managed code. That is we currently have to target SSE41 and can't light up on SSSE3, but given the age of SSSE3 and SSE41 hardware, I'm not sure that is of particular concern.

The bigger work item is that the System.Numerics.Vector don't support many of the same codegen improvements that HWIntrinsics have. For example, while they can use the VEX encoding, they are not necessarily VEX aware. Likewise, they do not support containment. This means we often have additional moves (register to register and memory to register) inserted over the equivalent HWIntrinsic code. Updating it to support this would be non-trivial.

I don't know if anyone has considered the alternative of doing the transformation in the importer - that is, having SIMD intrinsic importation generate HW intrinsic nodes where possible.

There have been a few discussions and this is likely an ok intermediate solution. It would provide us with the better codegen without needing to update the System.Numerics.Vector intrinsics to be VEX aware or support containment themselves.
A downside is this still doesn't play "great" with AOT scenarios (both existing and the ones being worked on).

@tannergooding
Copy link
Member

the alternative of doing the transformation in the importer - that is, having SIMD intrinsic importation generate HW intrinsic nodes where possible.

I started a prototype for this: https://github.com/tannergooding/runtime/tree/proto-simd-as-hwintrinsic. It's a very rough prototype and only covers a few methods right now, but it does show some positive results:

Top file improvements (bytes):
       -1079 : System.Private.CoreLib.dasm (-0.02% of base)
          -8 : System.Text.Json.dasm (-0.00% of base)
          -4 : System.Net.WebSockets.dasm (-0.01% of base)
          -4 : System.Net.WebSockets.WebSocketProtocol.dasm (-0.01% of base)

4 total files with Code Size differences (4 improved, 0 regressed), 260 unchanged.

Top method regressions (bytes):
           4 ( 8.00% of base) : System.Private.CoreLib.dasm - Vector3:Distance(Vector3,Vector3):float
           4 ( 8.70% of base) : System.Private.CoreLib.dasm - Vector3:DistanceSquared(Vector3,Vector3):float

Top method improvements (bytes):
         -40 (-6.37% of base) : System.Private.CoreLib.dasm - Vector`1:ConditionalSelect(Vector`1,Vector`1,Vector`1):Vector`1 (6 methods)
         -40 (-4.87% of base) : System.Private.CoreLib.dasm - Vector:AndNot(Vector`1,Vector`1):Vector`1 (6 methods)
         -38 (-15.02% of base) : System.Private.CoreLib.dasm - Vector:Equals(Vector`1,Vector`1):Vector`1 (10 methods)
         -36 (-3.00% of base) : System.Private.CoreLib.dasm - Matrix4x4:CreateConstrainedBillboard(Vector3,Vector3,Vector3,Vector3,Vector3):Matrix4x4
         -36 (-5.45% of base) : System.Private.CoreLib.dasm - Vector:LessThanOrEqual(Vector`1,Vector`1):Vector`1 (10 methods)
         -36 (-5.45% of base) : System.Private.CoreLib.dasm - Vector:GreaterThanOrEqual(Vector`1,Vector`1):Vector`1 (10 methods)
         -34 (-12.36% of base) : System.Private.CoreLib.dasm - Vector:GreaterThan(Vector`1,Vector`1):Vector`1 (10 methods)
         -32 (-11.64% of base) : System.Private.CoreLib.dasm - Vector:LessThan(Vector`1,Vector`1):Vector`1 (10 methods)
         -30 (-1.80% of base) : System.Private.CoreLib.dasm - Matrix4x4:Decompose(Matrix4x4,byref,byref,byref):bool
         -22 (-6.41% of base) : System.Private.CoreLib.dasm - ASCIIUtility:GetIndexOfFirstNonAsciiChar_Default(long,long):long
         -20 (-3.03% of base) : System.Private.CoreLib.dasm - Matrix4x4:CreateLookAt(Vector3,Vector3,Vector3):Matrix4x4
         -20 (-3.79% of base) : System.Private.CoreLib.dasm - Matrix4x4:CreateWorld(Vector3,Vector3,Vector3):Matrix4x4
         -20 (-9.76% of base) : System.Private.CoreLib.dasm - Vector`1:op_UnaryNegation(Vector`1):Vector`1 (6 methods)
         -20 (-4.07% of base) : System.Private.CoreLib.dasm - Vector`1:op_OnesComplement(Vector`1):Vector`1 (6 methods)
         -20 (-14.29% of base) : System.Private.CoreLib.dasm - Vector:BitwiseAnd(Vector`1,Vector`1):Vector`1 (6 methods)
         -20 (-14.29% of base) : System.Private.CoreLib.dasm - Vector:BitwiseOr(Vector`1,Vector`1):Vector`1 (6 methods)
         -20 (-2.88% of base) : System.Private.CoreLib.dasm - Vector:OnesComplement(Vector`1):Vector`1 (6 methods)
         -20 (-14.29% of base) : System.Private.CoreLib.dasm - Vector:Xor(Vector`1,Vector`1):Vector`1 (6 methods)
         -20 (-7.58% of base) : System.Private.CoreLib.dasm - Vector:EqualsAny(Vector`1,Vector`1):bool (6 methods)
         -20 (-2.54% of base) : System.Private.CoreLib.dasm - Vector:LessThanOrEqualAll(Vector`1,Vector`1):bool (6 methods)

All of the gains are either from the codegen becoming VEX aware (where we can do ins dst, op1, op2 rather than mov dst, op1; ins dst, op2) or from supporting containment (where we can do ins dst, op1, [mem] rather than mov tmp, [mem]; ins dst, op1, op2)
The one regression is because I haven't updated DotProduct to be included with this yet and so an extra movaps was inserted, it would (most likely) be removed once everything is updated.

If we wanted to pursue this further, we would need to account for the ISA levels (AVX2 vs SSE4.2 vs SSE2) as I'm only assuming latest in the prototype and would likely need a decent sized refactoring of some of the code and eventual removal of the GT_SIMD node type.

@CarolEidt, do you think this is something worth pursuing further?

@saucecontrol
Copy link
Member

I've done a quick prototype of the Vector.Narrow methods and was able to improve on all but the double->float narrow, which is identical.

Benchmark results
BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
Intel Core i7-6700K CPU 4.00GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.1.100
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.16006, CoreFX 5.0.20.16006), X64 RyuJIT
  Job-BAYMLV : .NET Core 5.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT
  Job-YQTAVT : .NET Core 5.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable  IterationTime=250.0000 ms  
MaxIterationCount=20  MinIterationCount=15  WarmupCount=1  
Method Toolchain Mean Error StdDev Median Min Max Ratio Gen 0 Gen 1 Gen 2 Allocated
Narrow_double \corerun-master\CoreRun.exe 820.8 ns 8.10 ns 7.57 ns 818.1 ns 812.5 ns 838.0 ns 1.00 - - - -
Narrow_double \corerun-vector-convert\CoreRun.exe 819.5 ns 4.86 ns 4.55 ns 819.6 ns 814.1 ns 825.9 ns 1.00 - - - -
Narrow_short \corerun-master\CoreRun.exe 959.7 ns 10.97 ns 10.26 ns 958.5 ns 939.8 ns 975.9 ns 1.00 - - - -
Narrow_short \corerun-vector-convert\CoreRun.exe 844.4 ns 8.40 ns 7.86 ns 841.6 ns 834.2 ns 858.5 ns 0.88 - - - -
Narrow_ushort \corerun-master\CoreRun.exe 1,106.3 ns 9.51 ns 8.89 ns 1,104.6 ns 1,097.0 ns 1,122.1 ns 1.00 - - - -
Narrow_ushort \corerun-vector-convert\CoreRun.exe 977.5 ns 6.46 ns 6.04 ns 977.2 ns 969.7 ns 990.6 ns 0.88 - - - -
Narrow_int \corerun-master\CoreRun.exe 1,103.9 ns 5.58 ns 5.22 ns 1,103.2 ns 1,096.7 ns 1,114.6 ns 1.00 - - - -
Narrow_int \corerun-vector-convert\CoreRun.exe 968.9 ns 5.37 ns 4.76 ns 968.6 ns 960.6 ns 978.8 ns 0.88 - - - -
Narrow_uint \corerun-master\CoreRun.exe 1,229.0 ns 12.44 ns 11.63 ns 1,224.7 ns 1,215.6 ns 1,258.0 ns 1.00 - - - -
Narrow_uint \corerun-vector-convert\CoreRun.exe 975.2 ns 9.24 ns 8.64 ns 973.4 ns 964.6 ns 996.9 ns 0.79 - - - -
Narrow_long \corerun-master\CoreRun.exe 1,705.8 ns 8.42 ns 7.88 ns 1,704.4 ns 1,695.0 ns 1,719.8 ns 1.00 - - - -
Narrow_long \corerun-vector-convert\CoreRun.exe 985.3 ns 9.25 ns 8.65 ns 982.2 ns 973.1 ns 999.4 ns 0.58 - - - -
Narrow_ulong \corerun-master\CoreRun.exe 1,705.2 ns 7.06 ns 6.60 ns 1,702.0 ns 1,698.3 ns 1,715.8 ns 1.00 - - - -
Narrow_ulong \corerun-vector-convert\CoreRun.exe 978.8 ns 6.37 ns 5.64 ns 977.1 ns 971.9 ns 992.7 ns 0.57 - - - -

COMPlus_EnableAVX2=0

Method Toolchain Mean Error StdDev Median Min Max Ratio Gen 0 Gen 1 Gen 2 Allocated
Narrow_double \corerun-master\CoreRun.exe 736.8 ns 3.47 ns 3.25 ns 736.2 ns 732.2 ns 744.6 ns 1.00 - - - -
Narrow_double \corerun-vector-convert\CoreRun.exe 734.6 ns 3.45 ns 3.23 ns 733.7 ns 730.5 ns 740.4 ns 1.00 - - - -
Narrow_short \corerun-master\CoreRun.exe 798.4 ns 5.85 ns 5.47 ns 796.3 ns 792.4 ns 812.0 ns 1.00 - - - -
Narrow_short \corerun-vector-convert\CoreRun.exe 678.2 ns 4.42 ns 4.13 ns 676.9 ns 673.1 ns 688.5 ns 0.85 - - - -
Narrow_ushort \corerun-master\CoreRun.exe 799.1 ns 4.81 ns 4.27 ns 798.0 ns 794.2 ns 809.6 ns 1.00 - - - -
Narrow_ushort \corerun-vector-convert\CoreRun.exe 678.0 ns 3.67 ns 3.26 ns 677.5 ns 673.3 ns 686.0 ns 0.85 - - - -
Narrow_int \corerun-master\CoreRun.exe 797.9 ns 5.72 ns 5.35 ns 795.0 ns 791.6 ns 810.0 ns 1.00 - - - -
Narrow_int \corerun-vector-convert\CoreRun.exe 682.3 ns 6.19 ns 5.79 ns 680.0 ns 676.3 ns 695.4 ns 0.86 - - - -
Narrow_uint \corerun-master\CoreRun.exe 799.7 ns 5.03 ns 4.70 ns 798.4 ns 793.9 ns 807.4 ns 1.00 - - - -
Narrow_uint \corerun-vector-convert\CoreRun.exe 678.4 ns 4.22 ns 3.94 ns 677.4 ns 672.7 ns 685.0 ns 0.85 - - - -
Narrow_long \corerun-master\CoreRun.exe 977.3 ns 6.25 ns 5.84 ns 975.6 ns 969.6 ns 985.8 ns 1.00 - - - -
Narrow_long \corerun-vector-convert\CoreRun.exe 733.5 ns 3.95 ns 3.50 ns 732.3 ns 729.0 ns 741.2 ns 0.75 - - - -
Narrow_ulong \corerun-master\CoreRun.exe 977.7 ns 5.86 ns 5.20 ns 976.7 ns 970.3 ns 988.5 ns 1.00 - - - -
Narrow_ulong \corerun-vector-convert\CoreRun.exe 734.4 ns 4.68 ns 4.38 ns 734.4 ns 728.5 ns 742.7 ns 0.75 - - - -

For most of the AVX2 implementations, the gains come from delaying the permute step until the end as discussed above. The long->int Narrow implementations in the JIT are nonsensical, so I was able to improve on them significantly.

Current AVX2 long->int JIT implementation

vmovupd      ymm1, ymmword ptr[rdx]
vmovupd      ymm2, ymmword ptr[rdx+20h]
vextracti128 ymm3, ymm1, 1h
vextracti128 ymm4, ymm2, 1h
vinserti128  ymm3, ymm4, 1h
vmovaps      ymm4, ymm1
vinserti128  ymm4, ymm2, 1h
vpshufd      ymm3, ymm3, 8h
vpshufd      ymm1, ymm4, 8h
vpunpcklqdq  ymm1, ymm1, ymm3

Improved version:

vmovupd      ymm1, ymmword ptr[rdx]
vmovupd      ymm2, ymmword ptr[rdx+20h]
vpshufd      ymm2, ymm2, 8h
vpshufd      ymm1, ymm1, 8h
vpunpcklqdq  ymm1, ymm1, ymm2
vpermq       ymm1, ymm1, d8h

There were also some surprise gains in the SSE2 benchmarks. I used the same implementation on for all but long->int on the managed side, but we got slightly better codegen with HWIntrinsics.

Current SSE2 short->byte JIT implementation

vmovupd      xmm1, xmmword ptr[rdx]
vmovupd      xmm2, xmmword ptr[rdx+10h]
vmovaps      xmm1, xmm1
vmovaps      xmm3, xmm2
vpsllw       xmm1, 8h
vpsrlw       xmm1, 8h
vpsllw       xmm3, 8h
vpsrlw       xmm3, 8h
vpackuswb    xmm1, xmm3

And the HWIntrinsics version, which omits the two redundant movaps

vmovupd      xmm1, xmmword ptr[rdx]
vmovupd      xmm2, xmmword ptr[rdx+10h]
vpsllw       xmm2, xmm2, 8h
vpsrlw       xmm2, xmm2, 8h
vpsllw       xmm1, xmm1, 8h
vpsrlw       xmm1, xmm1, 8h
vpackuswb    xmm1, xmm1, xmm2

@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@tannergooding tannergooding added Cost:S Work that requires one engineer up to 1 week help wanted [up-for-grabs] Good issue for external contributors labels Jan 15, 2021
@tannergooding
Copy link
Member

tannergooding commented Jan 15, 2021

This is likely a fairly small change, in cost, and so I've marked it up-for-grabs. Much of the Vector<T> code has already been updated to internally use HWIntrinsics internally.

The process for making the changes should be something akin to:

  1. Add new entries to https://github.com/dotnet/runtime/blob/master/src/coreclr/jit/simdashwintrinsiclistxarch.h and https://github.com/dotnet/runtime/blob/master/src/coreclr/jit/simdashwintrinsiclistarm64.h (the tables for VectorT are at the bottom and are split into VectorT128 and VectorT256).
  2. Update https://github.com/dotnet/runtime/blob/master/src/coreclr/jit/simdashwintrinsic.cpp to handle the intrinsic with the IR replacement if it is non-trivial

@saucecontrol
Copy link
Member

saucecontrol commented Jan 7, 2022

I believe #60094 addressed this fully

@tannergooding
Copy link
Member

Yeah, I believe so as well. Feel free to re-open if there is any additional improvements that could happen here.

@ghost ghost locked as resolved and limited conversation to collaborators Feb 7, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI Cost:S Work that requires one engineer up to 1 week help wanted [up-for-grabs] Good issue for external contributors JitUntriaged CLR JIT issues needing additional triage tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

10 participants