Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of CSE for GT_CNS_INT benefits ARM64 #39096

Merged
merged 8 commits into from
Jul 24, 2020

Conversation

briansull
Copy link
Contributor

@briansull briansull commented Jul 10, 2020

Implementation of code size optimization, CSE of constant values for ARM64
We will share a single CSE for constants that differ only in their low 12 bits on ARM64

Number of shared constant low bits set in target.h  CSE_CONST_SHARED_LOW_BITS
we use 12 bits on Arm platforms and 16 bits on XArch platforms

Disable the CSE of the REG_R2R_INDIRECT_PARAM on Arm32
as it hits  Assertion failed 'candidates != candidateBit' in lsra.cpp Line: 3723

Config variable: COMPlus_JitConstCSE
// Default 0, enable the CSE of Constants, including nearby offsets. (only for ARM64)
// If 1, disable all the CSE of Constants
// If 2, enable the CSE of Constants but don't combine with nearby offsets. (only for ARM64)
// If 3, enable the CSE of Constants including nearby offsets. (all platforms)
// If 4, enable the CSE of Constants but don't combine with nearby offsets. (all platforms)

Don't auto retry when using the AltJit as that results in two copies of every method in the output file
Added additional Priority 0 test coverage for Floating Point optimizations

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 10, 2020
@briansull
Copy link
Contributor Author

briansull commented Jul 10, 2020

ARM64 codesize diffs:

Crossgen CodeSize Diffs for System.Private.CoreLib.dll, framework assemblies for x64 protononjit.dll
Summary of Code Size diffs:
(Lower is better)
Total bytes of diff: -521348 (-1.01% of base)
    diff is an improvement.
Top file improvements (bytes):
     -292448 : Microsoft.Diagnostics.Tracing.TraceEvent.dasm (-5.97% of base)
      -62256 : Microsoft.CodeAnalysis.VisualBasic.dasm (-1.02% of base)
      -57392 : Microsoft.CodeAnalysis.CSharp.dasm (-1.02% of base)
      -13672 : System.Private.Xml.dasm (-0.31% of base)
      -11656 : System.Private.CoreLib.dasm (-0.22% of base)
      -10368 : System.Private.DataContractSerialization.dasm (-0.98% of base)
       -8976 : System.Management.dasm (-1.85% of base)
       -7956 : System.Data.Common.dasm (-0.50% of base)
       -7616 : System.DirectoryServices.dasm (-1.35% of base)
       -6796 : Microsoft.CodeAnalysis.dasm (-0.35% of base)
       -6660 : System.Linq.Parallel.dasm (-0.93% of base)
       -5980 : System.DirectoryServices.AccountManagement.dasm (-1.60% of base)
       -4904 : System.Net.Http.dasm (-0.61% of base)
       -4696 : System.Data.OleDb.dasm (-1.19% of base)
       -3392 : System.Data.Odbc.dasm (-1.27% of base)
       -3048 : System.Text.RegularExpressions.dasm (-1.15% of base)
       -2392 : System.Net.Security.dasm (-0.77% of base)
       -2340 : System.Drawing.Common.dasm (-0.54% of base)
       -2100 : System.ComponentModel.TypeConverter.dasm (-0.61% of base)
       -2076 : System.Reflection.Metadata.dasm (-0.44% of base)

Top file regressions (bytes):
        3648 : System.Linq.dasm (1.06% of base)
        3628 : System.CodeDom.dasm (1.48% of base)
        3508 : System.Configuration.ConfigurationManager.dasm (0.78% of base)
        2948 : System.ComponentModel.Composition.dasm (0.88% of base)
        2276 : System.Security.Cryptography.Xml.dasm (1.10% of base)
        1448 : xunit.runner.utility.netcoreapp10.dasm (0.67% of base)
        1116 : System.Diagnostics.TraceSource.dasm (2.01% of base)
Top method improvements (bytes):
       -6860 (-16.23% of base) : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - Microsoft.Diagnostics.Tracing.Parsers.ClrTraceEventParser:EnumerateTemplates(System.Func`3[[System.String, System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[System.String, System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[Microsoft.Diagnostics.Tracing.EventFilterResponse, Microsoft.Diagnostics.Tracing.TraceEvent, Version=2.0.49.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a]],System.Action`1[[Microsoft.Diagnostics.Tracing.TraceEvent, Microsoft.Diagnostics.Tracing.TraceEvent, Version=2.0.49.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a]]):this
       -6652 (-14.95% of base) : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - Microsoft.Diagnostics.Tracing.Parsers.ClrPrivateTraceEventParser:EnumerateTemplates(System.Func`3[[System.String, System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[System.String, System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[Microsoft.Diagnostics.Tracing.EventFilterResponse, Microsoft.Diagnostics.Tracing.TraceEvent, Version=2.0.49.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a]],System.Action`1[[Microsoft.Diagnostics.Tracing.TraceEvent, Microsoft.Diagnostics.Tracing.TraceEvent, Version=2.0.49.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a]]):this
       -6172 (-10.48% of base) : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - Microsoft.Diagnostics.Tracing.Parsers.KernelTraceEventParser:EnumerateTemplates(System.Func`3[[System.String, System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[System.String, System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[Microsoft.Diagnostics.Tracing.EventFilterResponse, Microsoft.Diagnostics.Tracing.TraceEvent, Version=2.0.49.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a]],System.Action`1[[Microsoft.Diagnostics.Tracing.TraceEvent, Microsoft.Diagnostics.Tracing.TraceEvent, Version=2.0.49.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a]]):this
       -5956 (-21.35% of base) : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - Microsoft.Diagnostics.Tracing.Parsers.AspNet.AspNetTraceEventParser:EnumerateTemplates(System.Func`3[[System.String, System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[System.String, System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[Microsoft.Diagnostics.Tracing.EventFilterResponse, Microsoft.Diagnostics.Tracing.TraceEvent, Version=2.0.49.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a]],System.Action`1[[Microsoft.Diagnostics.Tracing.TraceEvent, Microsoft.Diagnostics.Tracing.TraceEvent, Version=2.0.49.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a]]):this
       -2988 (-12.64% of base) : System.Private.CoreLib.dasm - HtmlEntities:.cctor()
       -2156 (-3.49% of base) : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - Microsoft.Diagnostics.Tracing.CtfTraceEventSource:InitEventMap():System.Collections.Generic.Dictionary`2[[System.String, System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[Microsoft.Diagnostics.Tracing.ETWMapping, Microsoft.Diagnostics.Tracing.TraceEvent, Version=2.0.49.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a]]
       -2080 (-3.10% of base) : Microsoft.CodeAnalysis.dasm - Microsoft.CodeAnalysis.DesktopAssemblyIdentityComparer:.cctor() (2 methods)
       -2012 (-20.91% of base) : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - Microsoft.Diagnostics.Tracing.Parsers.Clr.ClrRundownTraceEventParser:EnumerateTemplates(System.Func`3[[System.String, System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[System.String, System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[Microsoft.Diagnostics.Tracing.EventFilterResponse, Microsoft.Diagnostics.Tracing.TraceEvent, Version=2.0.49.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a]],System.Action`1[[Microsoft.Diagnostics.Tracing.TraceEvent, Microsoft.Diagnostics.Tracing.TraceEvent, Version=2.0.49.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a]]):this
       -1784 (-19.70% of base) : System.Private.DataContractSerialization.dasm - System.Xml.XmlBinaryReader:ReadNode():bool:this
       -1664 (-6.44% of base) : System.Private.Xml.dasm - System.Xml.Xsl.IlGen.XmlILMethods:.cctor()
       -1640 (-10.85% of base) : System.DirectoryServices.AccountManagement.dasm - System.DirectoryServices.AccountManagement.ADStoreCtx:.cctor()

@briansull
Copy link
Contributor Author

@sandreenko PTAL
@dotnet/jit-contrib

@briansull
Copy link
Contributor Author

This feature is ready for check in

@sandreenko PTAL
@dotnet/jit-contrib

Copy link
Contributor

@sandreenko sandreenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good in general, a few questions.
Also, do you have top method improvements by percentage available?

@@ -2707,6 +2707,11 @@ void Compiler::fgInitArgInfo(GenTreeCall* call)
indirectCellAddress->AsIntCon()->gtTargetHandle = (size_t)call->gtCallMethHnd;
#endif
indirectCellAddress->SetRegNum(REG_R2R_INDIRECT_PARAM);
#ifdef TARGET_ARM
// Don't attempt to CSE this constant
// This hits an assert: Assertion failed 'candidates != candidateBit' in lsra.cpp Line: 3723
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please open an issue with a repro steps about that and place its number instead of line that could change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I will

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This constant has a specific register requirement, and I think that the same constant is used in another context with a different register requirement. So presumably the CSE value then has a conflict because it needs to be copied to both registers. I would probably restate this as:
"This constant has specific register requirements, and LSRA doesn't currently correctly handle them when the value is in a CSE'd local." or something to that effect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that that describe the issue, It only happens on Arm32, with physical register r4, we have a similar thing on Arm64 with a physical register r11 and that one works fortunately.

src/coreclr/src/jit/optcse.cpp Outdated Show resolved Hide resolved
@@ -502,50 +502,61 @@ void ZapInfo::CompileMethod()
ULONG cCode;

#ifdef ALLOW_SXS_JIT_NGEN
bool expectedAltJitFailure = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this fallback after other changes in this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whenever I generate AltJit jit-diffs, we always get two copies of every method.
This fixes that issue. I could move this into it's own checkin.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please. I have checked that I see that behavior (crossgen only), but I will need more time to understand your fix, so it will be nice not to block the PR because of this change.

@erozenfeld
Copy link
Member

Nice size improvement! Can you update your diffs with the number of improved/regressed methods? Also, can you include an analysis of regressions?

@briansull
Copy link
Contributor Author

briansull commented Jul 13, 2020

@erozenfeld A typical regression is where we see a code size increase due to the hoisting of the method addresses from in a loop:

image

image

image

@briansull
Copy link
Contributor Author

briansull commented Jul 13, 2020

Crossgen CodeSize Diffs for System.Private.CoreLib.dll for x64 protononjit.dll
Total bytes of diff: -11656 (-0.22% of base)    diff is an improvement.
1457 total methods with Code Size differences (976 improved, 481 regressed), 25663 unchanged.

@briansull
Copy link
Contributor Author

Crossgen CodeSize Diffs for framework assemblies for x64 protononjit.dll
Total bytes of diff: -521348 (-1.01% of base)    diff is an improvement.
22083 total methods with Code Size differences (13660 improved, 8423 regressed), 176762 unchanged.

@TamarChristinaArm
Copy link
Contributor

@briansull out of curiosity, is the address of HIGH RELOC different in these cases? I assume so?
Do you also happen to know where the extra mov comes from? it seems to want to use x11, Also did it run out of temp registers before it assigned the callee saved one? That snippet seems to indicate it had plenty of temp registers left? But I guess that function is larger than what's shown?

Copy link
Contributor

@sandreenko sandreenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few questions/comments.

src/coreclr/src/jit/optcse.cpp Outdated Show resolved Hide resolved
@@ -2,7 +2,7 @@
<PropertyGroup>
<OutputType>Exe</OutputType>
<RestorePackages>true</RestorePackages>
<CLRTestPriority>1</CLRTestPriority>
<CLRTestPriority>0</CLRTestPriority>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you going to revert test changes before merge or want to keep them in pri0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No we need this extra test coverage

src/coreclr/src/jit/optcse.cpp Outdated Show resolved Hide resolved
src/coreclr/src/jit/optcse.cpp Show resolved Hide resolved
src/coreclr/src/jit/optcse.cpp Outdated Show resolved Hide resolved
src/coreclr/src/jit/optcse.cpp Outdated Show resolved Hide resolved
src/coreclr/src/jit/optcse.cpp Outdated Show resolved Hide resolved

BasicBlock* blk = lst->tslBlock;
const BasicBlock::weight_t curWeight = blk->getBBWeight(m_pCompiler);
if (isConstCSE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: it would be nice to shrink this function into smaller in the future, it was already 500+ lines, not it is 800+,

Copy link
Contributor

@sandreenko sandreenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, size diffs are great. Please run jit-stress pipile before merge + a few nits (open an issue about lsra assert instead of referencing a line number, replace magic const 8, see if you can avoid reimplementing sort).

Thanks for the answers and extracting other changes out of this PR.

Copy link
Contributor

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, though it would be great to add constant definitions for the option values, and an updated comment for the LSRA issue.

@@ -2707,6 +2707,11 @@ void Compiler::fgInitArgInfo(GenTreeCall* call)
indirectCellAddress->AsIntCon()->gtTargetHandle = (size_t)call->gtCallMethHnd;
#endif
indirectCellAddress->SetRegNum(REG_R2R_INDIRECT_PARAM);
#ifdef TARGET_ARM
// Don't attempt to CSE this constant
// This hits an assert: Assertion failed 'candidates != candidateBit' in lsra.cpp Line: 3723
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This constant has a specific register requirement, and I think that the same constant is used in another context with a different register requirement. So presumably the CSE value then has a conflict because it needs to be copied to both registers. I would probably restate this as:
"This constant has specific register requirements, and LSRA doesn't currently correctly handle them when the value is in a CSE'd local." or something to that effect.

src/coreclr/src/jit/optcse.cpp Outdated Show resolved Hide resolved
src/coreclr/src/jit/optcse.cpp Outdated Show resolved Hide resolved
@briansull briansull closed this Jul 15, 2020
@briansull briansull reopened this Jul 15, 2020
@briansull briansull force-pushed the cse-const-shared-part2 branch 2 times, most recently from db65611 to 3b6c32b Compare July 16, 2020 00:02
@briansull briansull closed this Jul 16, 2020
@briansull briansull reopened this Jul 16, 2020
@briansull
Copy link
Contributor Author

Manually Ran the JitStress job again

@briansull
Copy link
Contributor Author

My JitStress run is failing due to
#38847

…ARM64

Implementation of code size optimization, CSE of constant values for ARM64
We will share a single CSE for constants that differ only in their low 12 bits on ARM64

Number of shared constant low bits set in target.h  CSE_CONST_SHARED_LOW_BITS
we use 12 bits on Arm platforms and 16 bits on XArch platforms

Disable the CSE of the REG_R2R_INDIRECT_PARAM on Arm32
as it hits  Assertion failed 'candidates != candidateBit' in lsra.cpp Line: 3723

Config variable: COMPlus_JitConstCSE
// Default 0: enable the CSE of Constants, including nearby offsets. (only for ARM64)
// If 1, disable all the CSE of Constants
// If 2, enable the CSE of Constants but don't combine with nearby offsets. (only for ARM64)
// If 3, enable the CSE of Constants including nearby offsets. (all platforms)
// If 4, enable the CSE of Constants but don't combine with nearby offsets. (all platforms)
//
@briansull
Copy link
Contributor Author

briansull commented Jul 23, 2020

@BruceForstall It looks like a couple of known errors cause these two failures in my JitStress run.
Can you confirm?

runtime-coreclr jitstress (CoreCLR Pri1 Runtime Tests Run OSX x64 checked)
tracing/eventpipe/reverseouter/reverseouter/reverseouter.sh Timed Out (timeout in milliseconds: 1800000
Issue #38801

runtime-coreclr jitstress (CoreCLR Pri1 Runtime Tests Run Windows_NT x86 checked)
tracing\eventpipe\pauseonstart\pauseonstart\pauseonstart.cmd Timed Out (timeout in milliseconds: 1800000
Issue #38847

@BruceForstall
Copy link
Member

@briansull Both of those are known.

@briansull briansull merged commit 90619d5 into dotnet:master Jul 24, 2020
briansull added a commit to briansull/runtime that referenced this pull request Jul 24, 2020
* Change the type of csdHashKey to size_t

* Update gtCostSz and gtCostEx for constant nodes

* Implementation of code size optimization, CSE of constant values for ARM64
Implementation of code size optimization, CSE of constant values for ARM64
We will share a single CSE for constants that differ only in their low 12 bits on ARM64

Number of shared constant low bits set in target.h  CSE_CONST_SHARED_LOW_BITS
we use 12 bits on Arm platforms and 16 bits on XArch platforms

Disable the CSE of the REG_R2R_INDIRECT_PARAM on Arm32
as it hits  Assertion failed 'candidates != candidateBit' in lsra.cpp Line: 3723

Config variable: COMPlus_JitConstCSE
// Default 0: enable the CSE of Constants, including nearby offsets. (only for ARM64)
// If 1, disable all the CSE of Constants
// If 2, enable the CSE of Constants but don't combine with nearby offsets. (only for ARM64)
// If 3, enable the CSE of Constants including nearby offsets. (all platforms)
// If 4, enable the CSE of Constants but don't combine with nearby offsets. (all platforms)
//

* Added additional Priority 0 test coverage for Floating Point optimizations

* Fix for COMPLUS_JitConstCSE=4

* Renamed config variable from COMPlus_JitDisableConstCSE to COMPlus_JitConstCSE

* Updated with Codereview feedback, removed sort from Const CSE phase

* Fix for assertionProp issue in the refTypesdynamic test
briansull added a commit that referenced this pull request Jul 24, 2020
* Change the type of csdHashKey to size_t

* Update gtCostSz and gtCostEx for constant nodes

* Implementation of code size optimization, CSE of constant values for ARM64
Implementation of code size optimization, CSE of constant values for ARM64
We will share a single CSE for constants that differ only in their low 12 bits on ARM64

Number of shared constant low bits set in target.h  CSE_CONST_SHARED_LOW_BITS
we use 12 bits on Arm platforms and 16 bits on XArch platforms

Disable the CSE of the REG_R2R_INDIRECT_PARAM on Arm32
as it hits  Assertion failed 'candidates != candidateBit' in lsra.cpp Line: 3723

Config variable: COMPlus_JitConstCSE
// Default 0: enable the CSE of Constants, including nearby offsets. (only for ARM64)
// If 1, disable all the CSE of Constants
// If 2, enable the CSE of Constants but don't combine with nearby offsets. (only for ARM64)
// If 3, enable the CSE of Constants including nearby offsets. (all platforms)
// If 4, enable the CSE of Constants but don't combine with nearby offsets. (all platforms)
//

* Added additional Priority 0 test coverage for Floating Point optimizations

* Fix for COMPLUS_JitConstCSE=4

* Renamed config variable from COMPlus_JitDisableConstCSE to COMPlus_JitConstCSE

* Updated with Codereview feedback, removed sort from Const CSE phase

* Fix for assertionProp issue in the refTypesdynamic test
Jacksondr5 pushed a commit to Jacksondr5/runtime that referenced this pull request Aug 10, 2020
* Change the type of csdHashKey to size_t

* Update gtCostSz and gtCostEx for constant nodes

* Implementation of code size optimization, CSE of constant values for ARM64
Implementation of code size optimization, CSE of constant values for ARM64
We will share a single CSE for constants that differ only in their low 12 bits on ARM64

Number of shared constant low bits set in target.h  CSE_CONST_SHARED_LOW_BITS
we use 12 bits on Arm platforms and 16 bits on XArch platforms

Disable the CSE of the REG_R2R_INDIRECT_PARAM on Arm32
as it hits  Assertion failed 'candidates != candidateBit' in lsra.cpp Line: 3723

Config variable: COMPlus_JitConstCSE
// Default 0: enable the CSE of Constants, including nearby offsets. (only for ARM64)
// If 1, disable all the CSE of Constants
// If 2, enable the CSE of Constants but don't combine with nearby offsets. (only for ARM64)
// If 3, enable the CSE of Constants including nearby offsets. (all platforms)
// If 4, enable the CSE of Constants but don't combine with nearby offsets. (all platforms)
//

* Added additional Priority 0 test coverage for Floating Point optimizations

* Fix for COMPLUS_JitConstCSE=4

* Renamed config variable from COMPlus_JitDisableConstCSE to COMPlus_JitConstCSE

* Updated with Codereview feedback, removed sort from Const CSE phase

* Fix for assertionProp issue in the refTypesdynamic test
@kunalspathak
Copy link
Member

@briansull , I just noticed that we don't CSE the offset calculation, do we?

public static uint Compute(uint a)
{
  result += (a * 2981231) + 2981235;
  a += 2981235;
  result += (a * 2981231) + 2981235;
  return result;
}

Before your CSE change, we generated this:

G_M45768_IG02:
        528FADE1          movz    w1, #0x7d6f
        72A005A1          movk    w1, #45 LSL #16
        1B017C01          mul     w1, w0, w1
        528FAE62          movz    w2, #0x7d73
        72A005A2          movk    w2, #45 LSL #16
        0B020021          add     w1, w1, w2
        0B020000          add     w0, w0, w2
        528FADE2          movz    w2, #0x7d6f
        72A005A2          movk    w2, #45 LSL #16
        1B027C00          mul     w0, w0, w2
        0B000020          add     w0, w1, w0
        528FAE61          movz    w1, #0x7d73
        72A005A1          movk    w1, #45 LSL #16
        0B010000          add     w0, w0, w1

After this PR, we started generating this which is better but still has potential to optimize away add w3, w1, #4.

G_M45768_IG02:
        528FADE1          movz    w1, #0x7d6f
        72A005A1          movk    w1, #45 LSL #16
        1B017C02          mul     w2, w0, w1
        11001023          add     w3, w1, #4
        0B030042          add     w2, w2, w3
        11001023          add     w3, w1, #4
        0B030000          add     w0, w0, w3
        1B017C00          mul     w0, w0, w1
        0B020000          add     w0, w0, w2
        11001021          add     w1, w1, #4
        0B010000          add     w0, w0, w1

@briansull
Copy link
Contributor Author

briansull commented Aug 11, 2020

@kunalspathak in your example we create one shared constant CSE that covers the five references
Since we only have one CSE, we only have one register to cover all of these constants.
and we use a single add offset to create two of the constants.

If your example used different constants that differed by more than 2^12 we would have generated two CSE candidates.

public static uint Compute(uint a)
{
  result += (a * 2981231) + 2803511;
  a += 2803511;
  result += (a * 2981231) + 2803511;
  return result;
}

@kunalspathak
Copy link
Member

kunalspathak commented Aug 11, 2020

and we use a single add offset to create two of the constants.

Correct. But since the offset calculation is repeated several times, I was expecting that even that calculation gets CSE.

Since we only have one CSE, we only have one register to cover all of these constants.

I think it depends on what you choose as the shared constant. Is it on first-seen in the code basis? E.g. in below code, constant 2981231 appears first, but 2981235 is the one that is used more frequently.

public static uint[] Compute()
{
    uint[] result  = new uint[5];
    result[0] = 2981231;
    result[1] = 2981235;
    result[2] = 2981235;
    result[3] = 2981235;
    result[4] = 2981235;
    return result;
}

Before this PR, we would have generated this:

G_M30289_IG02:
        D2927400          movz    x0, #0x93a0
        F2BF4700          movk    x0, #0xfa38 LSL #16
        F2CFFF60          movk    x0, #0x7ffb LSL #32
        D28000A1          mov     x1, #5
        97FF3CFA          bl      CORINFO_HELP_NEWARR_1_VC
        528FADE1          movz    w1, #0x7d6f
        72A005A1          movk    w1, #45 LSL #16
        B9001001          str     w1, [x0,#16]
        528FAE61          movz    w1, #0x7d73
        72A005A1          movk    w1, #45 LSL #16
        B9001401          str     w1, [x0,#20]
        B9001801          str     w1, [x0,#24]
        B9001C01          str     w1, [x0,#28]
        B9002001          str     w1, [x0,#32]

However, we now generate this which is longer because of every time we create 2nd constant using add w2, w1, #4. So, perhaps there is a potential of CSE them and/or choose the base constant depending on the usage frequency.

G_M30289_IG02:
        D2927400          movz    x0, #0x93a0
        F2BF4720          movk    x0, #0xfa39 LSL #16
        F2CFFF60          movk    x0, #0x7ffb LSL #32
        D28000A1          mov     x1, #5
        97FF3D00          bl      CORINFO_HELP_NEWARR_1_VC
        528FADE1          movz    w1, #0x7d6f
        72A005A1          movk    w1, #45 LSL #16
        B9001001          str     w1, [x0,#16]
        11001022          add     w2, w1, #4
        B9001402          str     w2, [x0,#20]
        11001022          add     w2, w1, #4
        B9001802          str     w2, [x0,#24]
        11001022          add     w2, w1, #4
        B9001C02          str     w2, [x0,#28]
        11001021          add     w1, w1, #4
        B9002001          str     w1, [x0,#32]

A typical use-case I am imagining is where we load 2 or more addresses (each differ from one another such that it falls under the CSE threshold of offset calculation), 1st address is created once, but the other addresses are used more frequently.

@AndyAyersMS
Copy link
Member

Wonder if there's some residual DONT_CSE on those adds in anticipation of forming address modes on xarch; if so, we should revisit those to see which might need to be relaxed for arm64.

@briansull
Copy link
Contributor Author

briansull commented Aug 12, 2020

@kunalspathak The shared const CSE logic selects either the first constant seen or the lowest constant of the shared constants, if the lower constant would need to use a SUB (subtract) instruction.
The vast majority of cases that I saw in both Jitted and Crossgen code were constant handles that required three instruction to materialize the constant.
In most of the cases I saw we could fold the offset that we were adding directly into the ld or st operation at zero cost.

@karelz karelz added this to the 5.0.0 milestone Aug 18, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 8, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants