Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Couple optimization to MultiRegStoreLoc #64857

Merged
merged 2 commits into from
Feb 15, 2022

Conversation

echesakov
Copy link
Contributor

@echesakov echesakov commented Feb 5, 2022

The diffs are net positive. There are some regressions. Looked at couple of them - do not see how they can be mitigated - they are due to re-shuffling of the registers in LSRA.

Here is a description of the change:

  1. Preference the source registers to the destination of multi-register GT_STORE_LCL_VAR if the source is last-use local or not a local (i.e. a multi-reg call or hardware intrinsic). The similar strategy is already done for single reg GT_STORE_LCL_VAR.
  2. Properly check for the last use of a field of a multi-reg local in LinearScan::BuildStoreLocDef().
    The second issue can be demonstrated by examples.

Consider the following program:

using System;
using System.Runtime.CompilerServices;

namespace Runtime_64857
{
    class Program
    {
        struct int64x2
        {
            public ulong _0;
            public ulong _1;
        }

        [MethodImpl(MethodImplOptions.NoInlining)]
        static int64x2 Def()
        {
            return default(int64x2);
        }

        [MethodImpl(MethodImplOptions.NoInlining)]
        static void Use(ulong x0, ulong x1)
        {
        }

        static void Main(string[] args)
        {
            var val = Def();
            Use(val._0, val._0);
        }
    }
}

The code diffs for Main (on win-arm64):

diff --git a/base.txt b/diff.txt
index d080ead..b5176ec 100644
--- a/base.txt
+++ b/diff.txt
@@ -11,34 +11,31 @@
 ;# V02 OutArgs      [V02    ] (  1,  1   )  lclBlk ( 0) [sp+00H]   "OutgoingArgSpace"
 ;* V03 tmp1         [V03    ] (  0,  0   )  struct (16) zero-ref    do-not-enreg[SBR] multireg-ret "Return value temp for multireg return"
 ;  V04 tmp2         [V04,T00] (  3,  3   )    long  ->   x0         V01._0(offs=0x00) P-INDEP "field V01._0 (fldOffset=0x0)"
-;  V05 tmp3         [V05,T01] (  1,  1   )    long  ->  x19         V01._1(offs=0x08) P-INDEP "field V01._1 (fldOffset=0x8)"
+;  V05 tmp3         [V05,T01] (  1,  1   )    long  ->   x1         V01._1(offs=0x08) P-INDEP "field V01._1 (fldOffset=0x8)"
 ;* V06 tmp4         [V06    ] (  0,  0   )    long  ->  zero-ref    V03._0(offs=0x00) P-DEP "field V03._0 (fldOffset=0x0)"
 ;* V07 tmp5         [V07    ] (  0,  0   )    long  ->  zero-ref    V03._1(offs=0x08) P-DEP "field V03._1 (fldOffset=0x8)"
 ;
-; Lcl frame size = 8
+; Lcl frame size = 0
 
 G_M3731_IG01:              ;; offset=0000H
-        A9BE7BFD          stp     fp, lr, [sp,#-32]!
-        F9000FF3          str     x19, [sp,#24]
+        A9BF7BFD          stp     fp, lr, [sp,#-16]!
         910003FD          mov     fp, sp
-						;; bbWeight=1    PerfScore 2.50
-G_M3731_IG02:              ;; offset=000CH
+						;; bbWeight=1    PerfScore 1.50
+G_M3731_IG02:              ;; offset=0008H
         94000000          bl      Runtime_64857.Program:Def():int64x2
-        AA0103F3          mov     x19, x1
         AA0003E1          mov     x1, x0
         94000000          bl      Runtime_64857.Program:Use(long,long)
-						;; bbWeight=1    PerfScore 3.00
-G_M3731_IG03:              ;; offset=001CH
-        F9400FF3          ldr     x19, [sp,#24]
-        A8C27BFD          ldp     fp, lr, [sp],#32
+						;; bbWeight=1    PerfScore 2.50
+G_M3731_IG03:              ;; offset=0014H
+        A8C17BFD          ldp     fp, lr, [sp],#16
         D65F03C0          ret     lr
-						;; bbWeight=1    PerfScore 4.00
+						;; bbWeight=1    PerfScore 2.00
 
-; Total bytes of code 40, prolog size 12, PerfScore 13.50, instruction count 10, allocated bytes for code 40 (MethodHash=8297f16c) for method Runtime_64857.Program:Main(System.String[])
+; Total bytes of code 28, prolog size 8, PerfScore 8.80, instruction count 7, allocated bytes for code 28 (MethodHash=8297f16c) for method Runtime_64857.Program:Main(System.String[])

Note that in the base case a variable that corresponds to promoted val._1 was added (unnecessarily) to currentLiveVars and, as a consequence, x19 was used to keep a dead value of the variable.

Now, if you consider another program:

using System;
using System.Runtime.CompilerServices;

namespace Runtime_64857
{
    class Program
    {
        struct int64x2
        {
            public ulong _0;
            public ulong _1;
        }

        [MethodImpl(MethodImplOptions.NoInlining)]
        static int64x2 Def()
        {
            return default(int64x2);
        }

        [MethodImpl(MethodImplOptions.NoInlining)]
        static void TrashRegs()
        {
        }

        [MethodImpl(MethodImplOptions.NoInlining)]
        static void Use(ulong x0, ulong x1)
        {
        }

        static void Main(string[] args)
        {
            var val = Def();
            TrashRegs();
            Use(val._1, val._1);
        }
    }
}

the diffs are opposite in some sense

diff --git a/base.txt b/diff.txt
index c3b5ab8..181e57b 100644
--- a/base.txt
+++ b/diff.txt
@@ -11,34 +11,36 @@
 ;# V02 OutArgs      [V02    ] (  1,  1   )  lclBlk ( 0) [sp+00H]   "OutgoingArgSpace"
 ;* V03 tmp1         [V03    ] (  0,  0   )  struct (16) zero-ref    do-not-enreg[SBR] multireg-ret "Return value temp for multireg return"
 ;  V04 tmp2         [V04,T01] (  1,  1   )    long  ->   x0         V01._0(offs=0x00) P-INDEP "field V01._0 (fldOffset=0x0)"
-;  V05 tmp3         [V05,T00] (  3,  3   )    long  ->  [fp+18H]   V01._1(offs=0x08) P-INDEP "field V01._1 (fldOffset=0x8)"
+;  V05 tmp3         [V05,T00] (  3,  3   )    long  ->  x19         V01._1(offs=0x08) P-INDEP "field V01._1 (fldOffset=0x8)"
 ;* V06 tmp4         [V06    ] (  0,  0   )    long  ->  zero-ref    V03._0(offs=0x00) P-DEP "field V03._0 (fldOffset=0x0)"
 ;* V07 tmp5         [V07    ] (  0,  0   )    long  ->  zero-ref    V03._1(offs=0x08) P-DEP "field V03._1 (fldOffset=0x8)"
 ;
-; Lcl frame size = 16
+; Lcl frame size = 8
 
 G_M3731_IG01:              ;; offset=0000H
         A9BE7BFD          stp     fp, lr, [sp,#-32]!
+        F9000FF3          str     x19, [sp,#24]
         910003FD          mov     fp, sp
-						;; bbWeight=1    PerfScore 1.50
-G_M3731_IG02:              ;; offset=0008H
+						;; bbWeight=1    PerfScore 2.50
+G_M3731_IG02:              ;; offset=000CH
         94000000          bl      Runtime_64857.Program:Def():int64x2
-        F9000FA1          str     x1, [fp,#24]
+        AA0103F3          mov     x19, x1
         94000000          bl      Runtime_64857.Program:TrashRegs()
-        F9400FA0          ldr     x0, [fp,#24]	// [V05 tmp3]
-        AA0003E1          mov     x1, x0
+        AA1303E0          mov     x0, x19
+        AA1303E1          mov     x1, x19
         94000000          bl      Runtime_64857.Program:Use(long,long)
-						;; bbWeight=1    PerfScore 6.50
-G_M3731_IG03:              ;; offset=0020H
+						;; bbWeight=1    PerfScore 4.50
+G_M3731_IG03:              ;; offset=0024H
+        F9400FF3          ldr     x19, [sp,#24]
         A8C27BFD          ldp     fp, lr, [sp],#32
         D65F03C0          ret     lr
-						;; bbWeight=1    PerfScore 2.00
+						;; bbWeight=1    PerfScore 4.00
 
-; Total bytes of code 40, prolog size 8, PerfScore 14.00, instruction count 10, allocated bytes for code 40 (MethodHash=8297f16c) for method Runtime_64857.Program:Main(System.String[])
+; Total bytes of code 48, prolog size 12, PerfScore 15.80, instruction count 12, allocated bytes for code 48 (MethodHash=8297f16c) for method Runtime_64857.Program:Main(System.String[])

val._1 was NOT added to currentLiveVars and instead of assigning a callee-saved register to val._1 (to be able to survive a call to TrashRegs()) the value had to be spilled on the stack.

@echesakov echesakov added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 5, 2022
@echesakov echesakov closed this Feb 6, 2022
@echesakov echesakov reopened this Feb 6, 2022
@echesakov
Copy link
Contributor Author

/azp run runtime-coreclr outerloop

@ghost ghost assigned echesakov Feb 6, 2022
@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@ghost
Copy link

ghost commented Feb 6, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

null

Author: echesakovMSFT
Assignees: echesakovMSFT
Labels:

area-CodeGen-coreclr

Milestone: -

@echesakov echesakov closed this Feb 7, 2022
@echesakov echesakov reopened this Feb 7, 2022
@echesakov
Copy link
Contributor Author

/azp run runtime-coreclr outerloop

…ters as src in src/coreclr/jit/lsrabuild.cpp
…nearScan::BuildStoreLocDef() in src/coreclr/jit/lsrabuild.cpp
@echesakov echesakov force-pushed the StoreLoc-MultiReg-PreferSrcReg branch from 79d3bb6 to d94f2b8 Compare February 11, 2022 19:07
@echesakov echesakov closed this Feb 14, 2022
@echesakov echesakov reopened this Feb 14, 2022
@dotnet dotnet deleted a comment from azure-pipelines bot Feb 14, 2022
@dotnet dotnet deleted a comment from azure-pipelines bot Feb 14, 2022
@dotnet dotnet deleted a comment from azure-pipelines bot Feb 14, 2022
@echesakov echesakov changed the title For MultiRegStoreLoc prefer allocating a local def to the same register as src Couple optimization to MultiRegStoreLoc Feb 15, 2022
@echesakov echesakov marked this pull request as ready for review February 15, 2022 01:31
@echesakov
Copy link
Contributor Author

@dotnet/jit-contrib PTAL

@dotnet dotnet deleted a comment from azure-pipelines bot Feb 15, 2022
@kunalspathak
Copy link
Member

/azp run runtime-coreclr jitstressregs

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as long as jitstressregs is clean.

@echesakov
Copy link
Contributor Author

LGTM as long as jitstressregs is clean.

Taking a look at the linux-arm failures.

@echesakov
Copy link
Contributor Author

The failures on linux-arm are known #65395 and macos-x64 ones look like an infrastructure hiccup.
I could try to re-trigger them.

@echesakov
Copy link
Contributor Author

@echesakov echesakov merged commit 19fc306 into dotnet:main Feb 15, 2022
@echesakov echesakov deleted the StoreLoc-MultiReg-PreferSrcReg branch February 15, 2022 23:33
@EgorBo
Copy link
Member

EgorBo commented Feb 17, 2022

Improvements on arm64: dotnet/perf-autofiling-issues#3565

BruceForstall added a commit to BruceForstall/runtime that referenced this pull request Feb 26, 2022
Change dotnet#64857 exposed an existing problem where when generating code
for a multi-reg GT_STORE_LCL_VAR, if the first register slot was not
enregistered, but the second or subsequent slots was, and those non-first
slots contained GC pointers, we wouldn't properly add those GC pointers
to the GC tracking sets. This led to cases where the register lifetimes
would be killed in the GC info before the actual lifetime was complete.

The primary fix is to make `gtHasReg()` handle the `IsMultiRegLclVar()`
case. As a side-effect, this fixes some LSRA dumps that weren't displaying
multiple registers properly.

There are about 50 SPMI asm diffs on win-arm64 where register lifetimes
get extended, fixing GC holes.

I also made `GetMultiRegCount()` handle the `IsMultiRegLclVar()` case.

I made a number of cleanup changes along the way:
1. Fixed two cases of calling `gcInfo.gcMarkRegSetNpt` with regNumber, not regMaskTP
2. Marked some functions `const`
3. Improved some comments
4. Changed "ith" to "i'th" in comments which still doesn't read great,
but at least I'm not left trying to parse "ith" as an English word.
5. Use `OperIsScalarLocal()` more broadly
6. Renamed `gtDispRegCount` to `gtDispMultiRegCount` to make it clear
it only applies to the multi-reg case.

Fixes dotnet#65476.
BruceForstall added a commit that referenced this pull request Mar 1, 2022
* Fix GC hole with multi-reg local var stores

Change #64857 exposed an existing problem where when generating code
for a multi-reg GT_STORE_LCL_VAR, if the first register slot was not
enregistered, but the second or subsequent slots was, and those non-first
slots contained GC pointers, we wouldn't properly add those GC pointers
to the GC tracking sets. This led to cases where the register lifetimes
would be killed in the GC info before the actual lifetime was complete.

The primary fix is to make `gtHasReg()` handle the `IsMultiRegLclVar()`
case. As a side-effect, this fixes some LSRA dumps that weren't displaying
multiple registers properly.

There are about 50 SPMI asm diffs on win-arm64 where register lifetimes
get extended, fixing GC holes.

I also made `GetMultiRegCount()` handle the `IsMultiRegLclVar()` case.

I made a number of cleanup changes along the way:
1. Fixed two cases of calling `gcInfo.gcMarkRegSetNpt` with regNumber, not regMaskTP
2. Marked some functions `const`
3. Improved some comments
4. Changed "ith" to "i'th" in comments which still doesn't read great,
but at least I'm not left trying to parse "ith" as an English word.
5. Use `OperIsScalarLocal()` more broadly
6. Renamed `gtDispRegCount` to `gtDispMultiRegCount` to make it clear
it only applies to the multi-reg case.

Fixes #65476.

* Update src/coreclr/jit/gentree.cpp

Co-authored-by: Kunal Pathak <Kunal.Pathak@microsoft.com>

Co-authored-by: Kunal Pathak <Kunal.Pathak@microsoft.com>
@ghost ghost locked as resolved and limited conversation to collaborators Mar 19, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants