Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix unrolling for LocateLastFoundChar and LocateLastFoundByte #46977

Merged
1 commit merged into from
Jan 20, 2021

Conversation

SingleAccretion
Copy link
Contributor

@SingleAccretion SingleAccretion commented Jan 14, 2021

Despite what the comments say, the pattern employed by these methods is not recognized by the Jit, because the loop will not have a flag LPFLG_SIMD_LIMIT set on it, which only happens for loops with limits that have GTF_ICON_SIMD_COUNT set on them, which in turn is only set for Vector_.Count nodes during importation.

This can be confirmed by looking at the assembly:

; Method LocateLastFoundChar(System.Numerics.Vector`1[UInt16]):int
G_M37433_IG01:
       sub      rsp, 56
       vzeroupper 
G_M37433_IG02:
       vmovupd  xmm0, xmmword ptr [rcx]
       mov      eax, 1
G_M37433_IG03:
       cmp      eax, 2
       jae      SHORT G_M37433_IG06
       vmovupd  xmmword ptr [rsp+28H], xmm0
       mov      rdx, qword ptr [rsp+8*rax+28H]
       test     rdx, rdx
       jne      SHORT G_M37433_IG04
       dec      eax
       test     eax, eax
       jge      SHORT G_M37433_IG03
G_M37433_IG04:
       or       rdx, 1
       bsr      rdx, rdx
       sar      edx, 4
       lea      eax, [rdx+4*rax]
G_M37433_IG05:
       add      rsp, 56
       ret      
G_M37433_IG06:
       call     CORINFO_HELP_RNGCHKFAIL
       int3     
; Total bytes of code: 68

This PR fixes the issue by introducing a new loop counter that satisfies the Jit's requirements. I have also collected a small benchmark (source can be found here: https://gist.github.com/SingleAccretion/d17ef1b0e885b50fffed47c82afdc29e). This is only for illustrative/sanity checking purposes, as:

  1. It seems to be sensitive to alignment.
  2. It was collected on a machine without AVX2.
  3. It is a microbenchmark of a method that will be inlined into another method and never be used on its own.

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-4820K CPU 3.70GHz (Ivy Bridge), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100
[Host] : .NET Core 5.0.2 (CoreCLR 5.0.220.61120, CoreFX 5.0.220.61120), X64 RyuJIT
DefaultJob : .NET Core 5.0.2 (CoreCLR 5.0.220.61120, CoreFX 5.0.220.61120), X64 RyuJIT

Method match Mean Error StdDev Median Ratio RatioSD Code Size
FindLastCharOld <0, 0(...)5535> [28] 0.7789 ns 0.0521 ns 0.0512 ns 0.7866 ns 1.24 0.15 68 B
FindLastCharNew <0, 0(...)5535> [28] 0.6026 ns 0.0501 ns 0.0702 ns 0.6168 ns 1.00 0.00 63 B
FindLastCharOld <6553(...)0, 0> [28] 1.6021 ns 0.0624 ns 0.0584 ns 1.6234 ns 2.05 0.17 68 B
FindLastCharNew <6553(...)0, 0> [28] 0.7893 ns 0.0534 ns 0.0572 ns 0.7439 ns 1.00 0.00 63 B

MultimodalDistribution
LocateLastFoundCharBenchmarks.FindLastCharOld: Default -> It seems that the distribution is bimodal (mValue = 3.25)
LocateLastFoundCharBenchmarks.FindLastCharNew: Default -> It seems that the distribution is bimodal (mValue = 3.29)
LocateLastFoundCharBenchmarks.FindLastCharNew: Default -> It seems that the distribution can have several modes (mValue = 3.09)

Notes:

  • Placing i-- at the end of the loop body is deliberate: for (int j = 0; j < Vector<ulong>.Count; j++, i--) breaks the recognition of the pattern (for (int j = 0; j < Vector<ulong>.Count; i--, j++) works).
  • This change should result in machine code size regression on AVX2-enabled machines. Sharplab suggests it will not be significant.
  • There is some duplication of code between the Byte and Char versions which only differ in their last line. The methods could be folded into one without performance loss. However, this would disrupt the nice "Char/Byte" pattern that the source files have. My personal opinion here would be that it's not worth it.
  • It could be argued that fixing the source here is wrong, and we should really fix the Jit. I think that's a fair opinion, at the same time, fixing the Jit here seems to be non-trivial, and a cursory look on source.dot.net for places where Vector<T>.Count is used suggests that this problematic pattern is not widespread (only the fixed methods show up). The unrolling today is laser-focused on the for (int i = 0; i < Vector<T>.Count; i++) case, and I think it makes sense to leave it that way until more significant investment in the area is determined to be needed.

@stephentoub
Copy link
Member

@ghost
Copy link

ghost commented Jan 20, 2021

Hello @tannergooding!

Because this pull request has the auto-merge label, I will be glad to assist with helping to merge this pull request once all check-in policies pass.

p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (@msftbot) and give me an instruction to get started! Learn more here.

@ghost ghost merged commit ac19cae into dotnet:master Jan 20, 2021
@stephentoub
Copy link
Member

It's great that we could workaround this, but is this something that should have worked via the JIT? i.e. is there an issue somewhere tracking making the original pattern "just work"?

@tannergooding
Copy link
Member

@stephentoub, we have a few issues tracking loop unrolling support (see under Loop Unrolling): #43549
Today, it's basically limited to a single pattern involving 0 to *.Count where * is Vector<T>, Vector64<T>, Vector128<T>, or Vector256<T>.

@SingleAccretion SingleAccretion deleted the Fix-unrolling-in-SpanHelpers branch February 10, 2021 20:12
@ghost ghost locked as resolved and limited conversation to collaborators Mar 12, 2021
This pull request was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants