Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce number of jump-stubs on ARM64 via smaller preserved space #63842

Closed
wants to merge 4 commits into from

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Jan 16, 2022

In #62302 (comment) I realized that all internal calls use jump-stubs (double calls basically) because managed code can't reach them within 128MB distance in memory.
image

Even an empty console app emits 35 jump stubs. During investigation, @jakobbotsch suggested to change these limits and it helped! Many micro and macro benchmarks significantly improved:

using BenchmarkDotNet.Attributes; 
using BenchmarkDotNet.Running; 
 
BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args); 
 
[Benchmark] 
[Arguments(3.14)] 
public double Test(double d) => Math.Cos(d) * Math.Sin(d) * Math.Tan(d);  // 3 InternalCalls

Results on Apple M1 arm64:

|  Method |        Job |               Toolchain |    d |      Mean |
|-------- |----------- |------------------------ |----- |----------:|
|    Test | Job-UWEEFQ |   /Core_Root_PR/corerun | 3.14 |  9.884 ns | 3x faster!!
|    Test | Job-HATVTO | /Core_Root_base/corerun | 3.14 | 28.235 ns |

Techempower (linux-arm64):
image

cc @jkotas @jakobbotsch

@jakobbotsch
Copy link
Member

There is a similar calculation for Windows here:

void ExecutableAllocator::InitLazyPreferredRange(size_t base, size_t size, int randomPageOffset)
{
#if USE_LAZY_PREFERRED_RANGE
#ifdef _DEBUG
// If GetForceRelocs is enabled we don't constrain the pMinAddr
if (PEDecoder::GetForceRelocs())
return;
#endif
//
// If we are using USE_LAZY_PREFERRED_RANGE then we try to allocate memory close
// to coreclr.dll. This avoids having to create jump stubs for calls to
// helpers and R2R images loaded close to coreclr.dll.
//
SIZE_T reach = 0x7FFF0000u;
// We will choose the preferred code region based on the address of coreclr.dll. The JIT helpers
// in coreclr.dll are the most heavily called functions.
g_preferredRangeMin = (base + size > reach) ? (BYTE *)(base + size - reach) : (BYTE *)0;
g_preferredRangeMax = (base + reach > base) ? (BYTE *)(base + reach) : (BYTE *)-1;
BYTE * pStart;
if (base > UINT32_MAX)
{
// Try to occupy the space as far as possible to minimize collisions with other ASLR assigned
// addresses. Do not start at g_codeMinAddr exactly so that we can also reach common native images
// that can be placed at higher addresses than coreclr.dll.
pStart = g_preferredRangeMin + (g_preferredRangeMax - g_preferredRangeMin) / 8;
}
else
{
// clr.dll missed the base address?
// Try to occupy the space right after it.
pStart = (BYTE *)(base + size);
}
// Randomize the address space
pStart += GetOsPageSize() * randomPageOffset;
g_lazyPreferredRangeStart = pStart;
g_lazyPreferredRangeHint = pStart;
#endif
}

#else
// Smaller values for ARM64 where relative calls/jumps only work within 128MB
static const int32_t CoreClrLibrarySize = 32 * 1024 * 1024;
static const int32_t MaxExecutableMemorySize = 0x7FF0000; // 128MB - 64KB
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much of this range gets consumed in a real-world ASP.NET app?

Note that we map all sorts of stuff into this range, including R2R images. I would expect that 128MB gets exhausted fairly quickly given how things work today.

I agree that this fix works well for micro-benchmarks that are very unlikely to exhaust the 128MB range.

Copy link
Member Author

@EgorBo EgorBo Jan 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will try to investigate, I guess the idea that the hot code (tier1) should better be closer to VM's FCalls by default?
For large apps I guess we still want to try to address your suggestion with emitting direct addresses without jump-stubs in tier1 #62302 (I have a draft)

For R2R-only it can be improved by pgo + method sorting?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think R2R code today benefits from being close to coreclr as all 'external' calls/branches have to go through indirection cells anyway. This may have been different in the days of fragile ngen?
Please correct me if I'm wrong @jkotas.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For large apps I guess we still want to try to address your suggestion with emitting direct addresses without jump-stubs

I do not think it is just for large apps. With TC enabled, all managed->managed method calls go through a precode that has the exact same instructions as jump stub and so it will introduce similar bottleneck as what you have identified.

For R2R-only it can be improved by pgo + method sorting?

R2R images are generally smaller than 128MB. You can only sort within the image, so the sorting won't help with jump stubs. (Sorting within image is still good for locality.)

Also, once we get this all fixed, we may want to look at retuning the inliner. My feeling is that the inliner expands the code too much these days. Some of it may be just compensating for the extra method call overhead that we are paying today.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think R2R code today benefits from being close to coreclr as all 'external' calls/branches have to go through indirection cells anyway.

Calls from runtime generated stubs and JITed code to R2R code still benefit from the two being close.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JITed code to R2R code

Do these not go through an indirection when tiering is enabled?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these not go through an indirection when tiering is enabled?

Yes - when tiering is enabled. No - when tiering is disabled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With TC enabled, all managed->managed method calls go through a precode that has the exact same instructions as jump stub and so it will introduce similar bottleneck as what you have identified.

I'll start from your suggestion to emit direct calls for T1 Caller calls T1 Callee (not as part of this PR)

@EgorBo
Copy link
Member Author

EgorBo commented Jan 16, 2022

Allocations seem also a bit faster:

[Benchmark]
public object Alloc() => new object();
| Method |        Job |               Toolchain |     Mean |
|------- |----------- |------------------------ |---------:|
|  Alloc | Job-HBDERT |      /Core_Root/corerun | 2.732 ns |
|  Alloc | Job-UQOGQF | /Core_Root_base/corerun | 3.115 ns |

@EgorBo
Copy link
Member Author

EgorBo commented Jan 16, 2022

Techempower:
image

PS: DOTNET_GCgen0size=1E00000 was used for all runs for both base and PR

@EgorBo
Copy link
Member Author

EgorBo commented Jan 16, 2022

Hm.. even with this change I still see jump-stubs in Plaintex-Plaintex-default (precode fixup are selected just to highlight the other problem)
image

Plaintext-MVC:
image

but it feels like it's not the actual jump-stub but a function that creates them 😐

SIZE_T reach = 0x7FFF0000u;
#else
// Smaller size for ARM64 where relative calls/jumps only work within 128MB
SIZE_T reach = 0x7FF0000u;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may collide with other ASLR assigned addresses and lead to non-trivial private working set hits for native .dlls or R2R images that are loaded after coreclr is initialized.

We try to start as far as possible from coreclr base address to avoid that situation on x64. Look for the comment below: "We try to occupy the space as far as possible to minimize collisions with other ASLR assigned addresses."

Copy link
Member Author

@EgorBo EgorBo Jan 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying that win-arm64 is fine as is?

I am still trying to understand the problem. E.g. we reserve 2GB via, I assume, VirtualAlloc
how come we end up with a large distance between coreclr code and jitted code in memory even for a hello-world

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checked - win-arm64 app emits less jump-stubs for a completely empty app: only 4 (two inside StelemRef and two inside IndexOf)

Copy link
Member

@jkotas jkotas Jan 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying that win-arm64 is fine as is?

I am saying that this change is potentially replacing one performance problem with a different performance problem on win-arm64.

All modern OSes have address space layout randomization. Our attempts to allocate near coreclr library are going against that. So we have to be careful not to be on collision course with what the OSes are trying to do.

win-arm64 app emits less jump-stubs for a completely empty app

Before this change, or only with this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this change, or only with this change?

Before (Main)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you understand why it is the case? The executable space should be typically allocated more than 128MB away from coreclr.dll on win-arm64 if I am reading this code correctly.

Copy link
Member Author

@EgorBo EgorBo Jan 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkotas oops, nvm, I forgot that on M1 I used DOTNET_ReadyToRun=0. I've just tried it with R2R=0 on win-arm64 and got exactly the same list of methods requesting jump stubs as in #62302 (comment)

@EgorBo
Copy link
Member Author

EgorBo commented Jan 23, 2022

@jkotas so I don't see how this could land then since it's going to hurt pretty much any code larger than 128mb due to next heap being randomly far from the first one. Should I at least make this constant configurable or just close this PR?

@jkotas
Copy link
Member

jkotas commented Jan 23, 2022

Agree. I think it is unlikely that a simple one-line fix like what is proposed here is going to work well. It was a good discussion. We can continue in the other PR that you have opened.

@EgorBo EgorBo closed this Jan 23, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Feb 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants