[NativeAOT] Faster lookup for method infos on Windows. #96283

VSadov · 2023-12-22T19:13:30Z

Binary search in an array of tens of thousands records jumps all over the place and is not cache friendly. It is a noticeable expense when we stack walk and need to query for lots of function infos.

A B+ tree-like 16-ary index, where each node fits in a cache line, could reduce the number of cache misses by about 3X times.

ghost · 2023-12-22T19:13:37Z

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas
See info in area-owners.md if you want to be subscribed.

Issue Details

Binary search in an array of tens of thousands records jumps all over the place and is not cache friendly. It is a noticeable expense when we stack walk and need to query for function infos using locations in the code.

A B-tree like 16-ary index, where each node fits in a cache line, could reduce the number of cache misses by about 3X times.

Author:	VSadov
Assignees:	-
Labels:	`os-windows`, `tenet-performance`, `area-NativeAOT-coreclr`
Milestone:	-

VSadov · 2023-12-22T20:51:32Z

/azp run runtime-nativeaot-outerloop

azure-pipelines · 2023-12-22T20:51:45Z

Azure Pipelines successfully started running 1 pipeline(s).

VSadov · 2023-12-22T21:39:54Z

To measure the impact I use the following microbenchmark

The numbers are averaged GC Gen0 pauses in milliseconds. Lower is better.
On x64, Windows10, AMD 5950X, 16 cores, 32 logical

I see ~ 10% improvement.

==== Before the change:

0.1448154296875006
0.14455371093750016
0.14472656250000057
0.14519238281250038
0.1447460937500003
. . .

=== After the change:

0.13505078125000033
0.13351171875000004
0.13270898437499998
0.13221289062500033
0.13344238281250007
. . .

src/coreclr/nativeaot/Runtime/windows/CoffNativeCodeManager.cpp

jkotas · 2023-12-22T22:25:10Z

What does this do to startup time and startup working set?

jkotas · 2023-12-22T22:26:07Z

It would be best to create the secondary index at build time somehow.

VSadov · 2023-12-22T23:46:53Z

What does this do to startup time and startup working set?

I think it will be fast, but it is definitely worth measuring.
Looking through the table is a concern in stack crawling, since we do it many times, but for the startup it is just one-time cost.

It is also pay-for play. - In a sense that a program with a few managed methods would not see much impact. The index is roughly ~2 bytes per method body, so it would need to be something pretty large. The referenced benchmark has ~4 thousand methods and the size of the lowest index level is just ~1.5Kb, other levels are much less (log16 drops fast).

I will check what happens with large(ish) testcases like Concurrent.Collections. I think that has tens of thousands methods.

VSadov · 2023-12-23T00:02:02Z

Other questions to consider:

Can this be applied to Unix?
Any search in a contiguous sorted array can be indexed this way, but I am not sure if what we have for Unix fits this model and if it does, at what level the index could be added (CodeManager? llvm?).
Windows is more straightforward, so I'd do Windows first and then see what can be done for Unix.
It feels like searching in the nodes may be vectorizable. (given an aligned array of 16 int32 values, find first that is >= than given X)
I am just not an expert enough in those things to tell if there are instructions for this or whether it could be useful. We expect just 3-4 levels in the tree so involving simd for 3-4 compares may not pay for itself.
I think it might run into diminishing returns as memory access pattern would be the same.

VSadov · 2023-12-23T01:51:36Z

=== Some data for the System.Collections.Concurrent.Tests.exe (17.5 Mb binary )

m_nRuntimeFunctionTable == 64859 (not all methods are managed, but for our purposes this is close enough to the method count)

m_indexCount == 4 (we have the last level index + 3 sub-indices)

the last level index size is 8108 elements - roughly 32 Kb

By the time we got to allocating the index we've already allocated 1.2 Mb of native memory. Managed code could not run yet.
This is with WKS. SVR will have more.
I think a few Kb allocation size is relatively insignificant.

VSadov · 2023-12-23T02:05:59Z

==== Stats for the TodosApi.exe (19 Mb binary)

m_nRuntimeFunctionTable 65239

the outer index size is 8155 elements, roughly the same 32Kb as with ConcurrentCollections test

By the time we got to allocating the index we've already alocated 8 Mb (this is SVR)
Commit size is 21Mb, I assume it is with just one heap, or it'd be hundreds of megs comitted.
I do not think 32Kb allocation matters a lot next to this.

jkotas · 2023-12-23T03:58:28Z

m_nRuntimeFunctionTable 65239

Does this mean that we will page in 65239 * 8 = ~510kB to build this index? That is a lot.

Can this be applied to Unix?

IIRC, Linux ELF has some sort of secondary index already.

Native AOT has a strong bias towards great startup time. It is the key characteristics that makes it different from regular CoreCLR with a JIT. I do not think we want to be regressing native AOT startup to improve GC microbenchmarks.

VSadov · 2023-12-23T04:56:20Z

I completely understand the concerns.
So far I measured impact on the startup size. I think it is very low. If somehow the index could be optimized to allocate zero bytes, it’d give us 0.001 reduction in startup size.

I did not measure the time though. It may tell a different story.

VSadov · 2023-12-23T05:08:54Z

Also - it is possible to shift the timing of creating the index until we need it. In start-up terms it is an eternity till the first GC. The first suspending thread could pay for building the index.

We have several options for moving costs out of startup path, but let’s see if we need that.

VSadov · 2023-12-24T02:52:03Z

Some preliminary report on timings (very rough estimates)

=== based on System.Collections.Concurrent.Tests.exe (17.5 Mb binary ).

InitializeRuntime takes 6487 microseconds on my machine
InitFuncTableIndex is a part of the above and takes 171 microseconds, which is roughly 2%.

InitializeRuntime is what runs before calling __managed__Main.
It does not include any managed initialization that is later done by the app itself - it is really up to the app anyways.
It does not include any time that OS loader spends loading/relocating the binary. I did not yet figure how to get a useful measure of that. Besides OS can "cheat" in a number of ways, so maybe we can write that time off as something not in our direct control.

VSadov · 2023-12-25T19:29:25Z

/azp run runtime-nativeaot-outerloop

azure-pipelines · 2023-12-25T19:29:34Z

Azure Pipelines successfully started running 1 pipeline(s).

VSadov · 2023-12-26T17:56:57Z

171 microseconds and 2% of startup time seems overall and relatively cheap. It is measurable though and we do not need the index until the first exception is thrown or the first GC happens.
I have made the initialization to happen lazily - to get it out of the way of startup and InitializeRuntime.

VSadov · 2023-12-27T22:35:30Z

It would be best to create the secondary index at build time somehow.

I was thinking about this and it looks like it could get complicated. We would need to teach the IL compiler about this optimization, find a way to get offsets of all managed methods at compile time, build the index, have some place in the file format to store the index and then fetch it at module initialization.
On the other hand, if the index can be built on demand and relatively cheaply (relative to the operations that will cause it to materialize), then it could stay an internal affair of CoffNativeCodeManager

Another potential problem is that the table constructed this way will cover only managed methods (which by itself is sufficient), but the actual runtime function table will contain infos for unmanaged methods too, so we would need to "re-bias" the indices to account for the index of the fist managed method. To do such re-biasing we would need to figure the index of the first managed method, possibly via binary search, which will raise the same question - would we want to pay the costs of searching and then rebiasing at startup?

AustinWise · 2023-12-28T07:10:36Z

It would be best to create the secondary index at build time somehow.

I was thinking about this and it looks like it could get complicated.

Instead of generating the index in ILC.exe, could it be done after link.exe creates the program executable? A post-processing step could create a new section in the executable and put the index there.

The obvious downside I see with this approach is it would complicate the MSBuild target files for native aot.

jkotas · 2023-12-30T14:30:06Z

To measure the impact I use the following microbenchmark

It is very rare to see a deep stack composed of tiny methods. Have you done any perf measurements for typical stacks found in real-world workloads? If we were to assume more typical methods (typical method code size is 100's bytes), would the algorithm design be different?

It may be better to allocate a single int32_t[cbManagedCodeRange / 4096 + 1] array and then use (pc - pvManagedCodeStartRange) / 4096 to lookup in this array to tell us where to start the final search in RUNTIME_FUNCTION array. I think it would consume about as much extra memory as your current algorithm and it should be faster for more typical stacks composed of average-sized methods.

jkotas · 2023-12-30T14:35:10Z

Instead of generating the index in ILC.exe, could it be done after link.exe creates the program executable? A post-processing step could create a new section in the executable and put the index there.

Yes, this would be the way to build this lookup map at build time. On Windows, the new section can be a resource - it is easy to attach a new resource to a Windows executable.

Having said this, creating the index as lazily as possible at runtime would work too.

VSadov · 2023-12-30T23:40:19Z

It is very rare to see a deep stack composed of tiny methods. Have you done any perf measurements for typical stacks found in real-world workloads? If we were to assume more typical methods (typical method code size is 100's bytes), would the algorithm design be different?

What makes that benchmark somewhat unnatural is that in regular programs the stack walks are often repetitive as often only a portion of program's code is active, while in this benchmark nearly all method infos will be needed on every GC. I think the point was to defeat any attempts to profit from temporal locality (i.e. by caching recent lookups) by demonstrating that a program with poor method locality can be easily constructed.

I do not think the average size of methods in the benchmark makes a difference though. Ultimately we are simply doing a search in a sorted array of ints. We can actually know the average distance between ints (method size) from the first/last int in the array and their count, if knowing could help.

It may be better to allocate a single int32_t[cbManagedCodeRange / 4096 + 1] array and then use (pc - pvManagedCodeStartRange) / 4096 to lookup in this array to tell us where to start the final search in RUNTIME_FUNCTION array. I think it would consume about as much extra memory as your current algorithm and it should be faster for more typical stacks composed of average-sized methods.

If I understand the idea correctly, it would be something like interpolation search - given an IP, we will try guessing the approximate entry via linear interpolation and then search in the vicinity of that entry. I think, just like with interpolation search, this will require that distribution of sizes is very uniform - to not end up with best case slightly better, while average/worst case much worse than divide-and-conquer.
On the other hand dividing by 16 narrows the search very quickly and there is no sensitivity to the method sizes - average or their distribution.

I.E roughly speaking with 16^4 == 65536 methods, we will see 4 cache misses in both average and worst case. I think this is close to the minimum possible and will be hard to improve by a lot.
The number of comparisons is small too: 4 * 8 == 32 on average, although I think cache misses will cost more than compares. We can assume that nothing of this stayed in the cache since the last GC, so every cacheline we are touching is missing at least in L1.

jkotas · 2023-12-31T00:10:13Z

What makes that benchmark somewhat unnatural is that in regular programs the stack walks are often repetitive as often only a portion of program's code is active, while in this benchmark nearly all method infos will be needed on every GC.

Right, the microbenchmark is somewhat unnatural and it demonstrates ~8% improvement. 8% improvement for a microbenchmark is not that much. I would like to understand average improvement from this change in real world workloads so that we can reason about whether the change is worth the added complexity and memory, and also whether the chosen algorithm produces the best average improvement in real world workloads.

VSadov · 2023-12-31T00:52:37Z

This change stacks very well with another improvement in stackwalking (#96150).

When measured with both changes we get close to 2X GC pause compared to conservative stackwalking (on this benchmark). We started at about 3X on the ratio of precise to conservative stackwalks. I see it as an improvement. One way or another.

I am not sure if anything that we consider "real world" will notice 8 percent in a specific area, but as I remember some ASP.Net benchmarks had noticeably faster throughput when run with conservative GC. There could be many explanations why, but foremost it indicates an opportunity. My thought is that improving performance of stackwalks incrementally should not make things worse and with enough micro improvements we may see macro results.

I am looking at constituent parts of a typical stackwalk that can be improved. This PR is about searching for the method info as it is a significant part of the overall cost according to profiler. Others are unwinding and GC decoding. Touching all that at once could be an unwieldy change.

This benchmark was chosen because it is sensitive to the area that is changed and useful as a repeatable measure of the progress.
It may be less sensitive to GC decoding, for example, so changes in that area will need an appropriate/modified benchmark. It would be hard though to gate every change on improvements in "real world workloads".

jkotas · 2023-12-31T03:46:09Z

I am not sure if anything that we consider "real world" will notice 8 percent in a specific area, but as I remember some ASP.Net benchmarks had noticeably faster throughput when run with conservative GC. There could be many explanations why, but foremost it indicates an opportunity.

Can we measure how much faster are these ASP.NET benchmarks with this change then?

jkotas · 2023-12-31T03:55:44Z

This change stacks very well with another improvement in stackwalking (#96150).

I like #96150. It was a pure code optimization, no downsides.

This change is a tradeoff. It allocates some extra memory to get some extra throughput in an artificial benchmark. My experience is that changes like that have a high probability of being ineffective in practice, so I am looking for a proof that it is not the case.

davidwrighton · 2024-01-09T21:33:10Z

I'm curious if this BTree approach is the best approach. Remember that RUNTIME_FUNCTION lookup is a fairly simple sorted mapping. Another approach that I've thought of is to build a table of runtime function indices where each index the lowest possible index for a given region of memory. Then we could layer that on the existing RUNTIME_FUNCTION table, and get O(1) lookup to some level of granularity, and then a very short binary walk to find the exact target. For instance, size the index table so that it spans the .TEXT region at a granularity such that the average space between indices is 8 or 16 RUNTIME_FUNCTIONS. Then the index table becomes a single lookup, followed by a normal scan in a much smaller region of the RUNTIME_FUNCTION table. I agree with @jkotas that we need a real world scenario before doing too much allocation and such for all of this though.

Ahh, I see that @jkotas already suggested my idea, although he proposed a larger amount of memory per entry in the interpolation table.

What I've seen is that while managed methods can be highly variable in size, they are uniform enough in practice for an interpolation style approach to be quite effective. In fact, my guess is that it's likely more efficient than a BTree. In fact if I were benchmarking this, I would try the interpolation approach + a linear scan of RUNTIME_FUNCTION entries in the possible region. My guess is that you will find through experimentation that it isn't cache misses you are need to avoid, but branch mispredicts, and an interpolation table + a single linear scan will minimize those.

VSadov · 2024-06-28T18:07:03Z

This is getting old. I do not have a new ideas here.

VSadov added os-windows tenet-performance Performance related issue area-NativeAOT-coreclr labels Dec 22, 2023

ghost assigned VSadov Dec 22, 2023

VSadov marked this pull request as ready for review December 22, 2023 21:40

VSadov requested a review from MichalStrehovsky as a code owner December 22, 2023 21:40

filipnavara reviewed Dec 22, 2023

View reviewed changes

src/coreclr/nativeaot/Runtime/windows/CoffNativeCodeManager.cpp Outdated Show resolved Hide resolved

jkotas reviewed Dec 22, 2023

View reviewed changes

src/coreclr/nativeaot/Runtime/windows/CoffNativeCodeManager.cpp Outdated Show resolved Hide resolved

build-analysis bot mentioned this pull request Dec 23, 2023

Tracking issue for CI build timeouts #76454

Closed

build-analysis bot mentioned this pull request Dec 23, 2023

System.Tests.StringTests failing on windows Release NativeAOT_Libraries leg #96279

Closed

build-analysis bot mentioned this pull request Dec 23, 2023

Test failure - System.NullReferenceException in System.Threading.Lock.TryInitializeStatics #94728

Closed

VSadov added 6 commits December 28, 2023 13:48

indexed unwind info search

b74ee38

handle locations inside the last method

588b65a

PR feedback

90cace2

build index lazily

a985d18

make concurrent threads help with building the index

1728588

couple tweaks and comments

9c7152a

VSadov force-pushed the cdMan branch from a1f1647 to 9c7152a Compare December 28, 2023 22:26

VSadov requested a review from jkotas December 29, 2023 03:17

VSadov closed this Jun 28, 2024

github-actions bot locked and limited conversation to collaborators Jul 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NativeAOT] Faster lookup for method infos on Windows. #96283

[NativeAOT] Faster lookup for method infos on Windows. #96283

VSadov commented Dec 22, 2023 •

edited

Loading

ghost commented Dec 22, 2023

VSadov commented Dec 22, 2023

azure-pipelines bot commented Dec 22, 2023

VSadov commented Dec 22, 2023 •

edited

Loading

jkotas commented Dec 22, 2023

jkotas commented Dec 22, 2023

VSadov commented Dec 22, 2023

VSadov commented Dec 23, 2023 •

edited

Loading

VSadov commented Dec 23, 2023

VSadov commented Dec 23, 2023

jkotas commented Dec 23, 2023 •

edited

Loading

VSadov commented Dec 23, 2023

VSadov commented Dec 23, 2023 •

edited

Loading

VSadov commented Dec 24, 2023

VSadov commented Dec 25, 2023

azure-pipelines bot commented Dec 25, 2023

VSadov commented Dec 26, 2023 •

edited

Loading

VSadov commented Dec 27, 2023 •

edited

Loading

AustinWise commented Dec 28, 2023

jkotas commented Dec 30, 2023

jkotas commented Dec 30, 2023

VSadov commented Dec 30, 2023 •

edited

Loading

jkotas commented Dec 31, 2023 •

edited

Loading

VSadov commented Dec 31, 2023 •

edited

Loading

jkotas commented Dec 31, 2023

jkotas commented Dec 31, 2023

davidwrighton commented Jan 9, 2024 •

edited

Loading

VSadov commented Jun 28, 2024

[NativeAOT] Faster lookup for method infos on Windows. #96283

[NativeAOT] Faster lookup for method infos on Windows. #96283

Conversation

VSadov commented Dec 22, 2023 • edited Loading

ghost commented Dec 22, 2023

VSadov commented Dec 22, 2023

azure-pipelines bot commented Dec 22, 2023

VSadov commented Dec 22, 2023 • edited Loading

jkotas commented Dec 22, 2023

jkotas commented Dec 22, 2023

VSadov commented Dec 22, 2023

VSadov commented Dec 23, 2023 • edited Loading

VSadov commented Dec 23, 2023

VSadov commented Dec 23, 2023

jkotas commented Dec 23, 2023 • edited Loading

VSadov commented Dec 23, 2023

VSadov commented Dec 23, 2023 • edited Loading

VSadov commented Dec 24, 2023

VSadov commented Dec 25, 2023

azure-pipelines bot commented Dec 25, 2023

VSadov commented Dec 26, 2023 • edited Loading

VSadov commented Dec 27, 2023 • edited Loading

AustinWise commented Dec 28, 2023

jkotas commented Dec 30, 2023

jkotas commented Dec 30, 2023

VSadov commented Dec 30, 2023 • edited Loading

jkotas commented Dec 31, 2023 • edited Loading

VSadov commented Dec 31, 2023 • edited Loading

jkotas commented Dec 31, 2023

jkotas commented Dec 31, 2023

davidwrighton commented Jan 9, 2024 • edited Loading

VSadov commented Jun 28, 2024

VSadov commented Dec 22, 2023 •

edited

Loading

VSadov commented Dec 22, 2023 •

edited

Loading

VSadov commented Dec 23, 2023 •

edited

Loading

jkotas commented Dec 23, 2023 •

edited

Loading

VSadov commented Dec 23, 2023 •

edited

Loading

VSadov commented Dec 26, 2023 •

edited

Loading

VSadov commented Dec 27, 2023 •

edited

Loading

VSadov commented Dec 30, 2023 •

edited

Loading

jkotas commented Dec 31, 2023 •

edited

Loading

VSadov commented Dec 31, 2023 •

edited

Loading

davidwrighton commented Jan 9, 2024 •

edited

Loading