Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Arm64] Treat methods of non-generic Vector64 and Vector128 classes as intrinsics #40441

Conversation

echesakov
Copy link
Contributor

@echesakov echesakov commented Aug 6, 2020

I noticed that condition I added in #39753 for computing fTreatAsRegularMethodCall in zapinfo.cpp based on class name was too strict - it made all methods of non-generic Vector64 and Vector128 classes to be treated as regular methods.

Unfortunately, I didn't notice this back then since the same PR stopped generating bodies of hardware intrinsic methods on Arm64 and, as a consequence, the total size of System.Private.CoreLib native image still went down.

I believe, this should be resolved after this change.

System.Private.CoreLib.ni.dll - 12,078,080 bytes before
System.Private.CoreLib.ni.dll - 12,073,472 bytes after

As an example, what this change does

Before:

; Assembly listing for method System.Text.ASCIIUtility:NarrowFourUtf16CharsToAsciiAndWriteToBuffer(byref,long)
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; ReadyToRun compilation
; optimized code
; fp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T01] (  3,  3   )   byref  ->  x20
;  V01 arg1         [V01,T00] (  4,  4   )    long  ->  x19
;* V02 loc0         [V02    ] (  0,  0   )  simd16  ->  zero-ref    HFA(simd16)
;* V03 loc1         [V03    ] (  0,  0   )  simd16  ->  zero-ref    HFA(simd16)
;  V04 loc2         [V04,T10] (  2,  2   )  simd16  ->   d0         HFA(simd16)
;  V05 loc3         [V05,T11] (  2,  2   )   simd8  ->   d0         HFA(simd8)
;# V06 OutArgs      [V06    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
;  V07 tmp1         [V07,T08] (  2,  4   )  simd16  ->   d0         HFA(simd16)  "struct address for call/obj"
;  V08 tmp2         [V08,T09] (  2,  4   )   simd8  ->   d0         HFA(simd8)  "struct address for call/obj"
;  V09 tmp3         [V09,T02] (  2,  4   )    bool  ->  x11         "Inlining Arg"
;  V10 tmp4         [V10,T05] (  2,  2.50)     ref  ->  x22         class-hnd "Inlining Arg"
;  V11 tmp5         [V11,T06] (  2,  2.50)     ref  ->  x21         class-hnd "Inlining Arg"
;  V12 tmp6         [V12,T03] (  2,  4   )     int  ->   x0         "Inlining Arg"
;  V13 tmp7         [V13,T07] (  3,  1.50)     ref  ->   x0         "argument with side effect"
;  V14 cse0         [V14,T04] (  3,  3   )     ref  ->  x21         "CSE - aggressive"
;
; Lcl frame size = 0

G_M42641_IG01:
        A9BD7BFD          stp     fp, lr, [sp,#-48]!
        A90153F3          stp     x19, x20, [sp,#16]
        A9025BF5          stp     x21, x22, [sp,#32]
        910003FD          mov     fp, sp
        AA0003F4          mov     x20, x0
        AA0103F3          mov     x19, x1
                                                ;; bbWeight=1    PerfScore 4.50
G_M42641_IG02:
        F209A27F          tst     x19, #0xff80ff80ff80ff80
        9A9F17EB          cset    x11, eq
        90000000          adrp    x0, [HIGH RELOC #0x1d078024d20]      // [String handle]
        91000000          add     x0, x0, [LOW RELOC #0x1d078024d20]
        F9400000          ldr     x0, [x0]
        F9400015          ldr     x21, [x0]
        AA1503F6          mov     x22, x21
        350001AB          cbnz    w11, G_M42641_IG04
                                                ;; bbWeight=1    PerfScore 9.50
G_M42641_IG03:
        9000000B          adrp    x11, [HIGH RELOC #0x1d078024ea0]      // [CORINFO_HELP_READYTORUN_STATIC_BASE]
        9100016B          add     x11, x11, [LOW RELOC #0x1d078024ea0]
        F9400160          ldr     x0, [x11]
        D63F0000          blr     x0
        F94B2C00          ldr     x0, [x0,#0x1658]
        D50339BF          dmb     ishld
        AA1603E1          mov     x1, x22
        AA1503E2          mov     x2, x21
        F9400003          ldr     x3, [x0]
        F9402463          ldr     x3, [x3,#72]
        F9401063          ldr     x3, [x3,#32]
        D63F0060          blr     x3
                                                ;; bbWeight=0.25 PerfScore 7.25
G_M42641_IG04:
        AA1303E0          mov     x0, x19
        9000000B          adrp    x11, [HIGH RELOC #0x1d078b809d8]      // [System.Runtime.Intrinsics.Vector128:CreateScalarUnsafe(long):System.Runtime.Intrinsics.Vector128`1[UInt64]]
        9100016B          add     x11, x11, [LOW RELOC #0x1d078b809d8]
        F9400161          ldr     x1, [x11]
        D63F0020          blr     x1
        9000000B          adrp    x11, [HIGH RELOC #0x1d078b80a70]      // [System.Runtime.Intrinsics.Vector128:AsInt16(System.Runtime.Intrinsics.Vector128`1[UInt64]):System.Runtime.Intrinsics.Vector128`1[Int16]]
        9100016B          add     x11, x11, [LOW RELOC #0x1d078b80a70]
        F9400160          ldr     x0, [x11]
        D63F0000          blr     x0
        2E212800          sqxtun  v0.8b, v0.8h
        9000000B          adrp    x11, [HIGH RELOC #0x1d078b80ba0]      // [System.Runtime.Intrinsics.Vector64:AsUInt32(System.Runtime.Intrinsics.Vector64`1[Byte]):System.Runtime.Intrinsics.Vector64`1[UInt32]]
        9100016B          add     x11, x11, [LOW RELOC #0x1d078b80ba0]
        F9400160          ldr     x0, [x11]
        D63F0000          blr     x0
        9000000B          adrp    x11, [HIGH RELOC #0x1d078b80c38]      // [System.Runtime.Intrinsics.Vector64:ToScalar(System.Runtime.Intrinsics.Vector64`1[UInt32]):int]
        9100016B          add     x11, x11, [LOW RELOC #0x1d078b80c38]
        F9400160          ldr     x0, [x11]
        D63F0000          blr     x0
        B9000280          str     w0, [x20]
                                                ;; bbWeight=1    PerfScore 24.50
G_M42641_IG05:
        A9425BF5          ldp     x21, x22, [sp,#32]
        A94153F3          ldp     x19, x20, [sp,#16]
        A8C37BFD          ldp     fp, lr, [sp],#48
        D65F03C0          ret     lr
                                                ;; bbWeight=1    PerfScore 4.00

; Total bytes of code 196, prolog size 16, PerfScore 69.35, (MethodHash=7c3e596e) for method System.Text.ASCIIUtility:NarrowFourUtf16CharsToAsciiAndWriteToBuffer(byref,long)
; ============================================================

After:

; Assembly listing for method System.Text.ASCIIUtility:NarrowFourUtf16CharsToAsciiAndWriteToBuffer(byref,long)
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; ReadyToRun compilation
; optimized code
; fp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T01] (  3,  3   )   byref  ->  x20        
;  V01 arg1         [V01,T00] (  4,  4   )    long  ->  x19        
;* V02 loc0         [V02    ] (  0,  0   )  simd16  ->  zero-ref    HFA(simd16) 
;* V03 loc1         [V03    ] (  0,  0   )  simd16  ->  zero-ref    HFA(simd16) 
;  V04 loc2         [V04,T07] (  2,  2   )  simd16  ->  d16         HFA(simd16) 
;  V05 loc3         [V05,T08] (  2,  2   )   simd8  ->  d16         HFA(simd8) 
;# V06 OutArgs      [V06    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
;  V07 tmp1         [V07,T02] (  2,  4   )    bool  ->  x11         "Inlining Arg"
;  V08 tmp2         [V08,T04] (  2,  2.50)     ref  ->  x22         class-hnd "Inlining Arg"
;  V09 tmp3         [V09,T05] (  2,  2.50)     ref  ->  x21         class-hnd "Inlining Arg"
;* V10 tmp4         [V10    ] (  0,  0   )     int  ->  zero-ref    "Inlining Arg"
;  V11 tmp5         [V11,T06] (  3,  1.50)     ref  ->   x0         "argument with side effect"
;  V12 cse0         [V12,T03] (  3,  3   )     ref  ->  x21         "CSE - aggressive"
;
; Lcl frame size = 0

G_M42641_IG01:
        A9BD7BFD          stp     fp, lr, [sp,#-48]!
        A90153F3          stp     x19, x20, [sp,#16]
        A9025BF5          stp     x21, x22, [sp,#32]
        910003FD          mov     fp, sp
        AA0003F4          mov     x20, x0
        AA0103F3          mov     x19, x1
						;; bbWeight=1    PerfScore 4.50
G_M42641_IG02:
        F209A27F          tst     x19, #0xff80ff80ff80ff80
        9A9F17EB          cset    x11, eq
        90000000          adrp    x0, [HIGH RELOC #0x2549c404d20]      // [String handle]
        91000000          add     x0, x0, [LOW RELOC #0x2549c404d20]
        F9400000          ldr     x0, [x0]
        F9400015          ldr     x21, [x0]
        AA1503F6          mov     x22, x21
        350001AB          cbnz    w11, G_M42641_IG04
						;; bbWeight=1    PerfScore 9.50
G_M42641_IG03:
        9000000B          adrp    x11, [HIGH RELOC #0x2549c404ea0]      // [CORINFO_HELP_READYTORUN_STATIC_BASE]
        9100016B          add     x11, x11, [LOW RELOC #0x2549c404ea0]
        F9400160          ldr     x0, [x11]
        D63F0000          blr     x0
        F94B2C00          ldr     x0, [x0,#0x1658]
        D50339BF          dmb     ishld
        AA1603E1          mov     x1, x22
        AA1503E2          mov     x2, x21
        F9400003          ldr     x3, [x0]
        F9402463          ldr     x3, [x3,#72]
        F9401063          ldr     x3, [x3,#32]
        D63F0060          blr     x3
						;; bbWeight=0.25 PerfScore 7.25
G_M42641_IG04:
        4E081E70          ins     v16.d[0], x19
        2E212A10          sqxtun  v16.8b, v16.8h
        0E043E00          umov    w0, v16.s[0]
        B9000280          str     w0, [x20]
						;; bbWeight=1    PerfScore 6.00
G_M42641_IG05:
        A9425BF5          ldp     x21, x22, [sp,#32]
        A94153F3          ldp     x19, x20, [sp,#16]
        A8C37BFD          ldp     fp, lr, [sp],#48
        D65F03C0          ret     lr
						;; bbWeight=1    PerfScore 4.00

; Total bytes of code 136, prolog size 16, PerfScore 44.85, (MethodHash=7c3e596e) for method System.Text.ASCIIUtility:NarrowFourUtf16CharsToAsciiAndWriteToBuffer(byref,long)
; ============================================================

@echesakov echesakov changed the title [Arm64] Treat class methods of Vector64 and Vector128 as intrinsics [Arm64] Treat non-generic class methods of Vector64 and Vector128 as intrinsics Aug 6, 2020
@echesakov echesakov force-pushed the Arm64-Treat-Vector64-Vector128-Class-Methods-As-Intrinsics branch from b35f272 to 56af646 Compare August 6, 2020 04:21
@echesakov echesakov changed the title [Arm64] Treat non-generic class methods of Vector64 and Vector128 as intrinsics [Arm64] Treat methods of non-generic Vector64 and Vector128 classes as intrinsics Aug 6, 2020
// On Arm64 AdvSimd ISA is required by CoreCLR, so we can expand Vector64<T> and Vector128<T> generic methods (e.g. Vector64<byte>.get_Zero)
// as well as Vector64 and Vector128 methods (e.g. Vector128.CreateScalarUnsafe).
fTreatAsRegularMethodCall |= !fIsPlatformHWIntrinsic && fIsHWIntrinsic
&& (strstr(className, "Vector64") != className) && (strstr(className, "Vector128") != className);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition could be checked more accurately by

(strcmp(className, "Vector64") != 0) && (strcmp(className, "Vector64`1") != 0) && (strcmp(className, "Vector128") != 0) && (strcmp(className, "Vector128`1") != 0)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why strstr rather than strncmp?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why strstr rather than strncmp?

@tannergooding What do you mean?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strstr checks the entire string and is more complex, so if you have strstr(className, "Vector64") and className happend to be "SomeTextVector64", it wouldn't exit on S and would instead keep searching and match Vector64 at the end, returning the new index.

strncmp allows you to check for "Vector64" or "Vector128", from the start of the string and terminate at a given digit count. This means you know it starts with exactly that text and if desired you can easily check if the returned index is '\0' or if it is "``1")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tannergooding I did as you suggested - replaced strstr with strncmp. PTAL

@echesakov
Copy link
Contributor Author

@dotnet/jit-contrib @tannergooding @davidwrighton Can you please take a look?

@echesakov
Copy link
Contributor Author

@davidwrighton Just to clarify things a bit that the changes are not related to our discussion in #32714. The methods were enabled to be treated as intrinsics on Arm64 a while ago under assumption that AdvSimd is required baseline (in #38060).

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching it.

Copy link
Contributor

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thanks!

…sName startswith "Vector128"` in zapinfo.cpp
@echesakov echesakov merged commit 1db5808 into dotnet:master Aug 8, 2020
@echesakov echesakov deleted the Arm64-Treat-Vector64-Vector128-Class-Methods-As-Intrinsics branch August 8, 2020 00:51
Jacksondr5 pushed a commit to Jacksondr5/runtime that referenced this pull request Aug 10, 2020
@karelz karelz modified the milestones: 6.0.0, 5.0.0 Aug 18, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 7, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants