Reduce number of jump-stubs on ARM64 via smaller preserved space #63842

EgorBo · 2022-01-16T13:43:48Z

In #62302 (comment) I realized that all internal calls use jump-stubs (double calls basically) because managed code can't reach them within 128MB distance in memory.

Even an empty console app emits 35 jump stubs. During investigation, @jakobbotsch suggested to change these limits and it helped! Many micro and macro benchmarks significantly improved:

using BenchmarkDotNet.Attributes; 
using BenchmarkDotNet.Running; 
 
BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args); 
 
[Benchmark] 
[Arguments(3.14)] 
public double Test(double d) => Math.Cos(d) * Math.Sin(d) * Math.Tan(d);  // 3 InternalCalls

Results on Apple M1 arm64:

|  Method |        Job |               Toolchain |    d |      Mean |
|-------- |----------- |------------------------ |----- |----------:|
|    Test | Job-UWEEFQ |   /Core_Root_PR/corerun | 3.14 |  9.884 ns | 3x faster!!
|    Test | Job-HATVTO | /Core_Root_base/corerun | 3.14 | 28.235 ns |

Techempower (linux-arm64):

cc @jkotas @jakobbotsch

jakobbotsch · 2022-01-16T13:47:34Z

There is a similar calculation for Windows here:

runtime/src/coreclr/utilcode/executableallocator.cpp

Lines 54 to 98 in 23de817

    
           void ExecutableAllocator::InitLazyPreferredRange(size_t base, size_t size, int randomPageOffset) 
        
           { 
        
           #if USE_LAZY_PREFERRED_RANGE 
        
           #ifdef _DEBUG 
        
               // If GetForceRelocs is enabled we don't constrain the pMinAddr 
        
               if (PEDecoder::GetForceRelocs()) 
        
                   return; 
        
           #endif 
        
               // 
        
               // If we are using USE_LAZY_PREFERRED_RANGE then we try to allocate memory close 
        
               // to coreclr.dll.  This avoids having to create jump stubs for calls to 
        
               // helpers and R2R images loaded close to coreclr.dll. 
        
               // 
        
               SIZE_T reach = 0x7FFF0000u; 
        
               // We will choose the preferred code region based on the address of coreclr.dll. The JIT helpers 
        
               // in coreclr.dll are the most heavily called functions. 
        
               g_preferredRangeMin = (base + size > reach) ? (BYTE *)(base + size - reach) : (BYTE *)0; 
        
               g_preferredRangeMax = (base + reach > base) ? (BYTE *)(base + reach) : (BYTE *)-1; 
        
               BYTE * pStart; 
        
               if (base > UINT32_MAX) 
        
               { 
        
                   // Try to occupy the space as far as possible to minimize collisions with other ASLR assigned 
        
                   // addresses. Do not start at g_codeMinAddr exactly so that we can also reach common native images 
        
                   // that can be placed at higher addresses than coreclr.dll. 
        
                   pStart = g_preferredRangeMin + (g_preferredRangeMax - g_preferredRangeMin) / 8; 
        
               } 
        
               else 
        
               { 
        
                   // clr.dll missed the base address? 
        
                   // Try to occupy the space right after it. 
        
                   pStart = (BYTE *)(base + size); 
        
               } 
        
               // Randomize the address space 
        
               pStart += GetOsPageSize() * randomPageOffset; 
        
               g_lazyPreferredRangeStart = pStart; 
        
               g_lazyPreferredRangeHint = pStart; 
        
           #endif 
        
           }

src/coreclr/pal/src/include/pal/virtual.h

jkotas · 2022-01-16T14:12:05Z

src/coreclr/pal/src/include/pal/virtual.h

+#else
+    // Smaller values for ARM64 where relative calls/jumps only work within 128MB
+    static const int32_t CoreClrLibrarySize = 32 * 1024 * 1024;
+    static const int32_t MaxExecutableMemorySize = 0x7FF0000; // 128MB - 64KB


How much of this range gets consumed in a real-world ASP.NET app?

Note that we map all sorts of stuff into this range, including R2R images. I would expect that 128MB gets exhausted fairly quickly given how things work today.

I agree that this fix works well for micro-benchmarks that are very unlikely to exhaust the 128MB range.

Will try to investigate, I guess the idea that the hot code (tier1) should better be closer to VM's FCalls by default?
For large apps I guess we still want to try to address your suggestion with emitting direct addresses without jump-stubs in tier1 #62302 (I have a draft)

For R2R-only it can be improved by pgo + method sorting?

I do not think R2R code today benefits from being close to coreclr as all 'external' calls/branches have to go through indirection cells anyway. This may have been different in the days of fragile ngen?
Please correct me if I'm wrong @jkotas.

For large apps I guess we still want to try to address your suggestion with emitting direct addresses without jump-stubs

I do not think it is just for large apps. With TC enabled, all managed->managed method calls go through a precode that has the exact same instructions as jump stub and so it will introduce similar bottleneck as what you have identified.

For R2R-only it can be improved by pgo + method sorting?

R2R images are generally smaller than 128MB. You can only sort within the image, so the sorting won't help with jump stubs. (Sorting within image is still good for locality.)

Also, once we get this all fixed, we may want to look at retuning the inliner. My feeling is that the inliner expands the code too much these days. Some of it may be just compensating for the extra method call overhead that we are paying today.

I do not think R2R code today benefits from being close to coreclr as all 'external' calls/branches have to go through indirection cells anyway.

Calls from runtime generated stubs and JITed code to R2R code still benefit from the two being close.

JITed code to R2R code

Do these not go through an indirection when tiering is enabled?

Do these not go through an indirection when tiering is enabled?

Yes - when tiering is enabled. No - when tiering is disabled.

With TC enabled, all managed->managed method calls go through a precode that has the exact same instructions as jump stub and so it will introduce similar bottleneck as what you have identified.

I'll start from your suggestion to emit direct calls for T1 Caller calls T1 Callee (not as part of this PR)

EgorBo · 2022-01-16T14:35:07Z

Allocations seem also a bit faster:

[Benchmark]
public object Alloc() => new object();

| Method |        Job |               Toolchain |     Mean |
|------- |----------- |------------------------ |---------:|
|  Alloc | Job-HBDERT |      /Core_Root/corerun | 2.732 ns |
|  Alloc | Job-UQOGQF | /Core_Root_base/corerun | 3.115 ns |

EgorBo · 2022-01-16T17:18:27Z

Techempower:

PS: DOTNET_GCgen0size=1E00000 was used for all runs for both base and PR

EgorBo · 2022-01-16T18:32:21Z

Hm.. even with this change I still see jump-stubs in Plaintex-Plaintex-default (precode fixup are selected just to highlight the other problem)

Plaintext-MVC:

but it feels like it's not the actual jump-stub but a function that creates them 😐

jkotas · 2022-01-17T17:21:45Z

src/coreclr/utilcode/executableallocator.cpp

    SIZE_T reach = 0x7FFF0000u;
+#else
+    // Smaller size for ARM64 where relative calls/jumps only work within 128MB
+    SIZE_T reach = 0x7FF0000u;


This may collide with other ASLR assigned addresses and lead to non-trivial private working set hits for native .dlls or R2R images that are loaded after coreclr is initialized.

We try to start as far as possible from coreclr base address to avoid that situation on x64. Look for the comment below: "We try to occupy the space as far as possible to minimize collisions with other ASLR assigned addresses."

Are you saying that win-arm64 is fine as is?

I am still trying to understand the problem. E.g. we reserve 2GB via, I assume, VirtualAlloc
how come we end up with a large distance between coreclr code and jitted code in memory even for a hello-world

Just checked - win-arm64 app emits less jump-stubs for a completely empty app: only 4 (two inside StelemRef and two inside IndexOf)

Are you saying that win-arm64 is fine as is?

I am saying that this change is potentially replacing one performance problem with a different performance problem on win-arm64.

All modern OSes have address space layout randomization. Our attempts to allocate near coreclr library are going against that. So we have to be careful not to be on collision course with what the OSes are trying to do.

win-arm64 app emits less jump-stubs for a completely empty app

Before this change, or only with this change?

Before this change, or only with this change?

Before (Main)

Do you understand why it is the case? The executable space should be typically allocated more than 128MB away from coreclr.dll on win-arm64 if I am reading this code correctly.

@jkotas oops, nvm, I forgot that on M1 I used DOTNET_ReadyToRun=0. I've just tried it with R2R=0 on win-arm64 and got exactly the same list of methods requesting jump stubs as in #62302 (comment)

EgorBo · 2022-01-23T19:04:05Z

@jkotas so I don't see how this could land then since it's going to hurt pretty much any code larger than 128mb due to next heap being randomly far from the first one. Should I at least make this constant configurable or just close this PR?

jkotas · 2022-01-23T20:01:21Z

Agree. I think it is unlikely that a simple one-line fix like what is proposed here is going to work well. It was a good discussion. We can continue in the other PR that you have opened.

EgorBo added 2 commits January 16, 2022 16:24

Change CoreClrLibrarySize and MaxExecutableMemorySize on arm

90e4d8e

Smaller MaxExecutableMemorySize on arm64

6e809fe

ghost assigned EgorBo Jan 16, 2022

dotnet-issue-labeler bot added the area-PAL-coreclr label Jan 16, 2022

jkotas reviewed Jan 16, 2022

View reviewed changes

src/coreclr/pal/src/include/pal/virtual.h Outdated Show resolved Hide resolved

jkotas reviewed Jan 16, 2022

View reviewed changes

Address feedback

44ac2bc

jkotas reviewed Jan 17, 2022

View reviewed changes

Update executableallocator.cpp

dc7ec53

EgorBo mentioned this pull request Jan 22, 2022

ARM64: Avoid jump stubs where possible #64148

Closed

EgorBo closed this Jan 23, 2022

ghost locked as resolved and limited conversation to collaborators Feb 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce number of jump-stubs on ARM64 via smaller preserved space #63842

Reduce number of jump-stubs on ARM64 via smaller preserved space #63842

EgorBo commented Jan 16, 2022 •

edited

Loading

jakobbotsch commented Jan 16, 2022

jkotas Jan 16, 2022

EgorBo Jan 16, 2022 •

edited

Loading

jakobbotsch Jan 16, 2022

jkotas Jan 16, 2022

jkotas Jan 16, 2022

jakobbotsch Jan 16, 2022

jkotas Jan 16, 2022

EgorBo Jan 16, 2022

EgorBo commented Jan 16, 2022

EgorBo commented Jan 16, 2022 •

edited

Loading

EgorBo commented Jan 16, 2022 •

edited

Loading

jkotas Jan 17, 2022

EgorBo Jan 17, 2022 •

edited

Loading

EgorBo Jan 17, 2022

jkotas Jan 17, 2022 •

edited

Loading

EgorBo Jan 17, 2022

jkotas Jan 17, 2022

EgorBo Jan 17, 2022 •

edited

Loading

EgorBo commented Jan 23, 2022 •

edited

Loading

jkotas commented Jan 23, 2022

Reduce number of jump-stubs on ARM64 via smaller preserved space #63842

Reduce number of jump-stubs on ARM64 via smaller preserved space #63842

Conversation

EgorBo commented Jan 16, 2022 • edited Loading

jakobbotsch commented Jan 16, 2022

Choose a reason for hiding this comment

EgorBo Jan 16, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EgorBo commented Jan 16, 2022

EgorBo commented Jan 16, 2022 • edited Loading

EgorBo commented Jan 16, 2022 • edited Loading

Choose a reason for hiding this comment

EgorBo Jan 17, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkotas Jan 17, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EgorBo Jan 17, 2022 • edited Loading

Choose a reason for hiding this comment

EgorBo commented Jan 23, 2022 • edited Loading

jkotas commented Jan 23, 2022

EgorBo commented Jan 16, 2022 •

edited

Loading

EgorBo Jan 16, 2022 •

edited

Loading

EgorBo commented Jan 16, 2022 •

edited

Loading

EgorBo commented Jan 16, 2022 •

edited

Loading

EgorBo Jan 17, 2022 •

edited

Loading

jkotas Jan 17, 2022 •

edited

Loading

EgorBo Jan 17, 2022 •

edited

Loading

EgorBo commented Jan 23, 2022 •

edited

Loading