-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vector.Narrow performance #9766
Comments
What I'm comparing the performance to does 128 bytes at a time as; so maybe its just close const int Shift16Shift24 = (1 << 16) | (1 << 24);
const int Shift8Identity = (1 << 8) | (1);
int ulongDoubleCount = (length - i) & ~0x7;
for (; i < ulongDoubleCount; i += 8)
{
ulong inputUlong0 = *(ulong*)(input + i);
ulong inputUlong1 = *(ulong*)(input + i + 4);
// Pack 16 ASCII chars into 16 bytes
*(uint*)(output + i) =
((uint)((inputUlong0 * Shift16Shift24) >> 24) & 0xffff) |
((uint)((inputUlong0 * Shift8Identity) >> 24) & 0xffff0000);
*(uint*)(output + i + 4) =
((uint)((inputUlong1 * Shift16Shift24) >> 24) & 0xffff) |
((uint)((inputUlong1 * Shift8Identity) >> 24) & 0xffff0000);
} |
I do not repro your emitted asm. For this code...
...I currently get:
...which lacks the two
Having said that, there remains the My conclusion on that front, unfortunately, is that the steps are indeed mandatory here, and are the result of a catastrophically misguided design choice regarding the On the design of PACKUSWB/PACKSSWBThe effect of the This code comment tersely summarizes a philosophical conundrum that arises with lossy instructions such as SSD
I tried to dial down the sarcasm, but I fear I may have betrayed my opinion. Unfortunately the design of the
Note that both instructions require that the source data values be under signed word interpretation. This, in turn entails that those 16-bit values represent scalar--as opposed to discrete--data. Now as far as I can tell, there are no unsigned-source equivalents for these instructions, so right off the bat we can note that discrete data uses, such truncating enumeration, bitmap/flag, or most importantly--and thus as we'll see, catastrophically--Unicode character data, are not a priority. Scalar data valuesSo anyway, fine. If we assume a "scalar mentality," the most charitable view of saturation is as scalar clamping. This may indeed seem like a useful feature. Let's examine things a bit more closely. Because these narrowing instructions don't perform any actual scaling, they're essentially asserting that their 16-bit source vectors be already under byte interpretation. And the problem is, this directly contradicts the original assumption (previous paragraph) that the source values be interpreted as signed word values. It's not hard to see how the SSE designers got here. While it's true that rudely truncating a signed 16-bit scalar to 8-bits is basically always nonsensical, that fact is slightly less true for the unsigned case. Whereas the low-order bits are garbage in the former case, in the latter they remain interpretable as the (e.g.) high-frequency components of the original time-series signal. So, one might conclude, "Great! We don't need to worry about unsigned variants for Well, perhaps. But the flaw in this analysis came earlier, when we assumed the "scalar mentality." Though that's understandably the default worldview for engineers seeped in the microcode of a CPU's vector unit, narrowing is a special case, since it ultimately effects an operation that's inherently non-scalar. There's nothing about bitwise summary-truncation that might remotely be considered an operation on continuous data. If the user wants clamping, let them explicitly apply it with full knowledge of the compression scenario. Whether permanently attenuating a 16-bit source to 8-bit resolution by explicit (e.g.) linear scaling, or round-tripping an 8-bit source to 16-bit to minimize overflow losses during a close-ended sequence of 8-bit edits (example of a non-scaling use on scalar data), any/all logical re-scaling can really only meaningfully be applied prior to invoking Piling on now, obviously you shouldn't need to clamp a scalar to 8-bits if it's already logically-8-bit, but in fact, you couldn't even if you wanted to, because as an "already-logically-8-bit" value, the overflow bits that would be needed to properly do so must be considered already lost. Despite the wishful designs of In short, any clamping must be explicitly coordinated with whatever preapplied logical scaling is applied, because once scaled, clamping a logically-same-sized value is (obviously) vacuous. It follows that the design of Discrete data valuesEnough of that; since the "scalar mentality" was a bust, let's put on our discrete hat now. Suddenly, the behaviors quoted in the bulleted list above seem downright strange. From a non-scalar point of view, each instruction ( Worse, with discrete data we're guaranteed by definition that the conditions signaled by these sentinels will be semantically meaningless, which further implies that such signals, if allowed to fire, will be random and thus indistinguishable from the true (discrete) values they conflate with, a scenario sometimes referred to as unmitigated data corruption. Specifically for us here, the only way to prevent the signalling of meaningless conditions from corrupting our output--which thus becomes mandatory practice--is to take explicit steps to preclude the meddlesome conditions from manifesting in the input values we submit. The fact that discrete data apps are required to ensure that each-and-every input vector issued to |
That processor does not support AVX so there's no way you'll see |
@mikedn Oh, I'm starting to see that part. You're suggesting that maybe...
is functionally the same as, but in in some ways perhaps an improvement over...
I remain a bit suspicious since while these exhibit the same number of fetches, the instruction stream for the latter, presumably "legacy", CPU (this wasn't news to me) likely has a tighter byte count (i.e. for keeping loops in-cache...)? |
I'm not suggesting anything, I'm simply saying that you're seeing different code because you're using a CPU with different capabilities. I don't have time to investigate this issue in detail. The use of permute is very likely required due to the per-lane operation of AVX version of Vector's |
Ok, thanks; I see that I'm in way over my head. |
@mikedn Is it actually possible to issue individual "low-level HW intrinsics" from managed IL somehow? |
@glenn-slayden, it should be in netcoreapp3.0 (provided the feature gets completed/ships 😄). You can browse most of the APIs here: https://source.dot.net/#System.Private.CoreLib/shared/System/Runtime/Intrinsics/X86/Sse.cs,a190e303dd574c72 |
Spectacular. Keep up the great work. |
I was just looking at this codegen a couple of weeks ago, in the context of SixLabors/ImageSharp#1143 The AVX2 implementation could be improved slightly by swapping out the two And it might be a little more efficient in the SSE2 implementation to construct a mask ( In the end, it will always be inefficient because it attempts to preserve the truncating behavior of a scalar short->byte conversion, while the SIMD instructions narrow with saturation. (That is, BTW, not a design flaw in the SIMD instructions -- that behavior is extremely desirable in signal-processing applications). Is this worth keeping open now that HWIntrinsics have landed? |
The mask is constant, so I don't think that would be a problem. I imagine that
Yes. HWIntrinsics provide machine specific optimizations while |
Actually had a look at all the codegen for narrow, and some of it is quite poor. For example: runtime/src/coreclr/src/jit/simdcodegenxarch.cpp Lines 1696 to 1703 in 7d67d17
That entire sequence of |
This is tracked by #956 and is still blocked pending us improving inlining so the generated trees aren't as complex. |
Seems we're deadlocked on that one. Is there something concrete and actionable to be done on the inlining side? I'd be happy to do the I can see quite a few easy perf/complexity wins in the codegen for these, and it's easy to see the JIT impact since all that code was added in one commit here: dotnet/coreclr@965d5ee @CarolEidt @jkotas @AndyAyersMS any thoughts on starting the #956 effort with these? |
How complex is to fix this in the JIT?
If this results in measurably better code and it is too complex to achieve the equivalent improvement by fixing the current JIT implementation, I think it is a fine idea to start the effort if these. |
Some of the fixes (like reducing the number of permutes by delaying until after narrow) would be easy to do in the JIT. There are other potential optimizations (for example using an SSSE3 |
I would definitely be supportive of a POC, especially to the extent that we can leverage that to quantify and potentially consider how to address the increase in complexity of the IR that's produced. I don't know if anyone has considered the alternative of doing the transformation in the importer - that is, having SIMD intrinsic importation generate HW intrinsic nodes where possible. |
I'd actually say it isn't that difficult. We already have infrastructure in place for doing SSE2 vs SSE41 vs AVX2 versions of the code (for the System.Numerics.Vector types) and so its just a bit more work to make that happen. The bigger work item is that the System.Numerics.Vector don't support many of the same codegen improvements that HWIntrinsics have. For example, while they can use the VEX encoding, they are not necessarily VEX aware. Likewise, they do not support containment. This means we often have additional moves (register to register and memory to register) inserted over the equivalent HWIntrinsic code. Updating it to support this would be non-trivial.
There have been a few discussions and this is likely an ok intermediate solution. It would provide us with the better codegen without needing to update the System.Numerics.Vector intrinsics to be VEX aware or support containment themselves. |
I started a prototype for this: https://github.com/tannergooding/runtime/tree/proto-simd-as-hwintrinsic. It's a very rough prototype and only covers a few methods right now, but it does show some positive results:
All of the gains are either from the codegen becoming VEX aware (where we can do If we wanted to pursue this further, we would need to account for the ISA levels (AVX2 vs SSE4.2 vs SSE2) as I'm only assuming latest in the prototype and would likely need a decent sized refactoring of some of the code and eventual removal of the @CarolEidt, do you think this is something worth pursuing further? |
I've done a quick prototype of the Benchmark resultsBenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
Intel Core i7-6700K CPU 4.00GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.1.100
[Host] : .NET Core 5.0.0 (CoreCLR 5.0.20.16006, CoreFX 5.0.20.16006), X64 RyuJIT
Job-BAYMLV : .NET Core 5.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT
Job-YQTAVT : .NET Core 5.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT
PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:DebugType=portable IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 WarmupCount=1
COMPlus_EnableAVX2=0
For most of the AVX2 implementations, the gains come from delaying the permute step until the end as discussed above. The Current AVX2 long->int JIT implementation vmovupd ymm1, ymmword ptr[rdx]
vmovupd ymm2, ymmword ptr[rdx+20h]
vextracti128 ymm3, ymm1, 1h
vextracti128 ymm4, ymm2, 1h
vinserti128 ymm3, ymm4, 1h
vmovaps ymm4, ymm1
vinserti128 ymm4, ymm2, 1h
vpshufd ymm3, ymm3, 8h
vpshufd ymm1, ymm4, 8h
vpunpcklqdq ymm1, ymm1, ymm3 Improved version: vmovupd ymm1, ymmword ptr[rdx]
vmovupd ymm2, ymmword ptr[rdx+20h]
vpshufd ymm2, ymm2, 8h
vpshufd ymm1, ymm1, 8h
vpunpcklqdq ymm1, ymm1, ymm2
vpermq ymm1, ymm1, d8h There were also some surprise gains in the SSE2 benchmarks. I used the same implementation on for all but long->int on the managed side, but we got slightly better codegen with HWIntrinsics. Current SSE2 short->byte JIT implementation vmovupd xmm1, xmmword ptr[rdx]
vmovupd xmm2, xmmword ptr[rdx+10h]
vmovaps xmm1, xmm1
vmovaps xmm3, xmm2
vpsllw xmm1, 8h
vpsrlw xmm1, 8h
vpsllw xmm3, 8h
vpsrlw xmm3, 8h
vpackuswb xmm1, xmm3 And the HWIntrinsics version, which omits the two redundant vmovupd xmm1, xmmword ptr[rdx]
vmovupd xmm2, xmmword ptr[rdx+10h]
vpsllw xmm2, xmm2, 8h
vpsrlw xmm2, xmm2, 8h
vpsllw xmm1, xmm1, 8h
vpsrlw xmm1, xmm1, 8h
vpackuswb xmm1, xmm1, xmm2 |
This is likely a fairly small change, in cost, and so I've marked it The process for making the changes should be something akin to:
|
I believe #60094 addressed this fully |
Yeah, I believe so as well. Feel free to re-open if there is any additional improvements that could happen here. |
I was trying to work out why I wasn't getting the performance expected from
Vector.Narrow
Generates
Is this correct (performance wise)?
/cc @mikedn @CarolEidt
category:cq
theme:vector-codegen
skill-level:intermediate
cost:medium
The text was updated successfully, but these errors were encountered: