Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NativeAOT] Faster lookup for method infos on Windows. #96283

Closed
wants to merge 6 commits into from

Conversation

VSadov
Copy link
Member

@VSadov VSadov commented Dec 22, 2023

Binary search in an array of tens of thousands records jumps all over the place and is not cache friendly. It is a noticeable expense when we stack walk and need to query for lots of function infos.

A B+ tree-like 16-ary index, where each node fits in a cache line, could reduce the number of cache misses by about 3X times.

@ghost
Copy link

ghost commented Dec 22, 2023

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas
See info in area-owners.md if you want to be subscribed.

Issue Details

Binary search in an array of tens of thousands records jumps all over the place and is not cache friendly. It is a noticeable expense when we stack walk and need to query for function infos using locations in the code.

A B-tree like 16-ary index, where each node fits in a cache line, could reduce the number of cache misses by about 3X times.

Author: VSadov
Assignees: -
Labels:

os-windows, tenet-performance, area-NativeAOT-coreclr

Milestone: -

@ghost ghost assigned VSadov Dec 22, 2023
@VSadov
Copy link
Member Author

VSadov commented Dec 22, 2023

/azp run runtime-nativeaot-outerloop

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@VSadov
Copy link
Member Author

VSadov commented Dec 22, 2023

To measure the impact I use the following microbenchmark

The numbers are averaged GC Gen0 pauses in milliseconds. Lower is better.
On x64, Windows10, AMD 5950X, 16 cores, 32 logical

I see ~ 10% improvement.

==== Before the change:

0.1448154296875006
0.14455371093750016
0.14472656250000057
0.14519238281250038
0.1447460937500003
. . .

=== After the change:

0.13505078125000033
0.13351171875000004
0.13270898437499998
0.13221289062500033
0.13344238281250007
. . .

@VSadov VSadov marked this pull request as ready for review December 22, 2023 21:40
@jkotas
Copy link
Member

jkotas commented Dec 22, 2023

What does this do to startup time and startup working set?

@jkotas
Copy link
Member

jkotas commented Dec 22, 2023

It would be best to create the secondary index at build time somehow.

@VSadov
Copy link
Member Author

VSadov commented Dec 22, 2023

What does this do to startup time and startup working set?

I think it will be fast, but it is definitely worth measuring.
Looking through the table is a concern in stack crawling, since we do it many times, but for the startup it is just one-time cost.

It is also pay-for play. - In a sense that a program with a few managed methods would not see much impact. The index is roughly ~2 bytes per method body, so it would need to be something pretty large. The referenced benchmark has ~4 thousand methods and the size of the lowest index level is just ~1.5Kb, other levels are much less (log16 drops fast).

I will check what happens with large(ish) testcases like Concurrent.Collections. I think that has tens of thousands methods.

@VSadov
Copy link
Member Author

VSadov commented Dec 23, 2023

Other questions to consider:

  • Can this be applied to Unix?
    Any search in a contiguous sorted array can be indexed this way, but I am not sure if what we have for Unix fits this model and if it does, at what level the index could be added (CodeManager? llvm?).
    Windows is more straightforward, so I'd do Windows first and then see what can be done for Unix.

  • It feels like searching in the nodes may be vectorizable. (given an aligned array of 16 int32 values, find first that is >= than given X)
    I am just not an expert enough in those things to tell if there are instructions for this or whether it could be useful. We expect just 3-4 levels in the tree so involving simd for 3-4 compares may not pay for itself.
    I think it might run into diminishing returns as memory access pattern would be the same.

@VSadov
Copy link
Member Author

VSadov commented Dec 23, 2023

=== Some data for the System.Collections.Concurrent.Tests.exe (17.5 Mb binary )

m_nRuntimeFunctionTable == 64859 (not all methods are managed, but for our purposes this is close enough to the method count)

m_indexCount == 4 (we have the last level index + 3 sub-indices)

the last level index size is 8108 elements - roughly 32 Kb

By the time we got to allocating the index we've already allocated 1.2 Mb of native memory. Managed code could not run yet.
This is with WKS. SVR will have more.
I think a few Kb allocation size is relatively insignificant.

@VSadov
Copy link
Member Author

VSadov commented Dec 23, 2023

==== Stats for the TodosApi.exe (19 Mb binary)

m_nRuntimeFunctionTable 65239

the outer index size is 8155 elements, roughly the same 32Kb as with ConcurrentCollections test

By the time we got to allocating the index we've already alocated 8 Mb (this is SVR)
Commit size is 21Mb, I assume it is with just one heap, or it'd be hundreds of megs comitted.
I do not think 32Kb allocation matters a lot next to this.

@jkotas
Copy link
Member

jkotas commented Dec 23, 2023

m_nRuntimeFunctionTable 65239

Does this mean that we will page in 65239 * 8 = ~510kB to build this index? That is a lot.

Can this be applied to Unix?

IIRC, Linux ELF has some sort of secondary index already.

Native AOT has a strong bias towards great startup time. It is the key characteristics that makes it different from regular CoreCLR with a JIT. I do not think we want to be regressing native AOT startup to improve GC microbenchmarks.

@VSadov
Copy link
Member Author

VSadov commented Dec 23, 2023

I completely understand the concerns.
So far I measured impact on the startup size. I think it is very low. If somehow the index could be optimized to allocate zero bytes, it’d give us 0.001 reduction in startup size.

I did not measure the time though. It may tell a different story.

@VSadov
Copy link
Member Author

VSadov commented Dec 23, 2023

Also - it is possible to shift the timing of creating the index until we need it. In start-up terms it is an eternity till the first GC. The first suspending thread could pay for building the index.

We have several options for moving costs out of startup path, but let’s see if we need that.

@VSadov
Copy link
Member Author

VSadov commented Dec 24, 2023

Some preliminary report on timings (very rough estimates)

=== based on System.Collections.Concurrent.Tests.exe (17.5 Mb binary ).

InitializeRuntime takes 6487 microseconds on my machine
InitFuncTableIndex is a part of the above and takes 171 microseconds, which is roughly 2%.

InitializeRuntime is what runs before calling __managed__Main.
It does not include any managed initialization that is later done by the app itself - it is really up to the app anyways.
It does not include any time that OS loader spends loading/relocating the binary. I did not yet figure how to get a useful measure of that. Besides OS can "cheat" in a number of ways, so maybe we can write that time off as something not in our direct control.

@VSadov
Copy link
Member Author

VSadov commented Dec 25, 2023

/azp run runtime-nativeaot-outerloop

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@VSadov
Copy link
Member Author

VSadov commented Dec 26, 2023

171 microseconds and 2% of startup time seems overall and relatively cheap. It is measurable though and we do not need the index until the first exception is thrown or the first GC happens.
I have made the initialization to happen lazily - to get it out of the way of startup and InitializeRuntime.

@VSadov
Copy link
Member Author

VSadov commented Dec 27, 2023

It would be best to create the secondary index at build time somehow.

I was thinking about this and it looks like it could get complicated. We would need to teach the IL compiler about this optimization, find a way to get offsets of all managed methods at compile time, build the index, have some place in the file format to store the index and then fetch it at module initialization.
On the other hand, if the index can be built on demand and relatively cheaply (relative to the operations that will cause it to materialize), then it could stay an internal affair of CoffNativeCodeManager

Another potential problem is that the table constructed this way will cover only managed methods (which by itself is sufficient), but the actual runtime function table will contain infos for unmanaged methods too, so we would need to "re-bias" the indices to account for the index of the fist managed method. To do such re-biasing we would need to figure the index of the first managed method, possibly via binary search, which will raise the same question - would we want to pay the costs of searching and then rebiasing at startup?

@AustinWise
Copy link
Contributor

It would be best to create the secondary index at build time somehow.

I was thinking about this and it looks like it could get complicated.

Instead of generating the index in ILC.exe, could it be done after link.exe creates the program executable? A post-processing step could create a new section in the executable and put the index there.

The obvious downside I see with this approach is it would complicate the MSBuild target files for native aot.

@jkotas
Copy link
Member

jkotas commented Dec 30, 2023

To measure the impact I use the following microbenchmark

It is very rare to see a deep stack composed of tiny methods. Have you done any perf measurements for typical stacks found in real-world workloads? If we were to assume more typical methods (typical method code size is 100's bytes), would the algorithm design be different?

It may be better to allocate a single int32_t[cbManagedCodeRange / 4096 + 1] array and then use (pc - pvManagedCodeStartRange) / 4096 to lookup in this array to tell us where to start the final search in RUNTIME_FUNCTION array. I think it would consume about as much extra memory as your current algorithm and it should be faster for more typical stacks composed of average-sized methods.

@jkotas
Copy link
Member

jkotas commented Dec 30, 2023

Instead of generating the index in ILC.exe, could it be done after link.exe creates the program executable? A post-processing step could create a new section in the executable and put the index there.

Yes, this would be the way to build this lookup map at build time. On Windows, the new section can be a resource - it is easy to attach a new resource to a Windows executable.

Having said this, creating the index as lazily as possible at runtime would work too.

@VSadov
Copy link
Member Author

VSadov commented Dec 30, 2023

It is very rare to see a deep stack composed of tiny methods. Have you done any perf measurements for typical stacks found in real-world workloads? If we were to assume more typical methods (typical method code size is 100's bytes), would the algorithm design be different?

What makes that benchmark somewhat unnatural is that in regular programs the stack walks are often repetitive as often only a portion of program's code is active, while in this benchmark nearly all method infos will be needed on every GC. I think the point was to defeat any attempts to profit from temporal locality (i.e. by caching recent lookups) by demonstrating that a program with poor method locality can be easily constructed.

I do not think the average size of methods in the benchmark makes a difference though. Ultimately we are simply doing a search in a sorted array of ints. We can actually know the average distance between ints (method size) from the first/last int in the array and their count, if knowing could help.

It may be better to allocate a single int32_t[cbManagedCodeRange / 4096 + 1] array and then use (pc - pvManagedCodeStartRange) / 4096 to lookup in this array to tell us where to start the final search in RUNTIME_FUNCTION array. I think it would consume about as much extra memory as your current algorithm and it should be faster for more typical stacks composed of average-sized methods.

If I understand the idea correctly, it would be something like interpolation search - given an IP, we will try guessing the approximate entry via linear interpolation and then search in the vicinity of that entry. I think, just like with interpolation search, this will require that distribution of sizes is very uniform - to not end up with best case slightly better, while average/worst case much worse than divide-and-conquer.
On the other hand dividing by 16 narrows the search very quickly and there is no sensitivity to the method sizes - average or their distribution.

I.E roughly speaking with 16^4 == 65536 methods, we will see 4 cache misses in both average and worst case. I think this is close to the minimum possible and will be hard to improve by a lot.
The number of comparisons is small too: 4 * 8 == 32 on average, although I think cache misses will cost more than compares. We can assume that nothing of this stayed in the cache since the last GC, so every cacheline we are touching is missing at least in L1.

@jkotas
Copy link
Member

jkotas commented Dec 31, 2023

What makes that benchmark somewhat unnatural is that in regular programs the stack walks are often repetitive as often only a portion of program's code is active, while in this benchmark nearly all method infos will be needed on every GC.

Right, the microbenchmark is somewhat unnatural and it demonstrates ~8% improvement. 8% improvement for a microbenchmark is not that much. I would like to understand average improvement from this change in real world workloads so that we can reason about whether the change is worth the added complexity and memory, and also whether the chosen algorithm produces the best average improvement in real world workloads.

@VSadov
Copy link
Member Author

VSadov commented Dec 31, 2023

This change stacks very well with another improvement in stackwalking (#96150).

When measured with both changes we get close to 2X GC pause compared to conservative stackwalking (on this benchmark). We started at about 3X on the ratio of precise to conservative stackwalks. I see it as an improvement. One way or another.

I am not sure if anything that we consider "real world" will notice 8 percent in a specific area, but as I remember some ASP.Net benchmarks had noticeably faster throughput when run with conservative GC. There could be many explanations why, but foremost it indicates an opportunity. My thought is that improving performance of stackwalks incrementally should not make things worse and with enough micro improvements we may see macro results.

I am looking at constituent parts of a typical stackwalk that can be improved. This PR is about searching for the method info as it is a significant part of the overall cost according to profiler. Others are unwinding and GC decoding. Touching all that at once could be an unwieldy change.

This benchmark was chosen because it is sensitive to the area that is changed and useful as a repeatable measure of the progress.
It may be less sensitive to GC decoding, for example, so changes in that area will need an appropriate/modified benchmark. It would be hard though to gate every change on improvements in "real world workloads".

@jkotas
Copy link
Member

jkotas commented Dec 31, 2023

I am not sure if anything that we consider "real world" will notice 8 percent in a specific area, but as I remember some ASP.Net benchmarks had noticeably faster throughput when run with conservative GC. There could be many explanations why, but foremost it indicates an opportunity.

Can we measure how much faster are these ASP.NET benchmarks with this change then?

@jkotas
Copy link
Member

jkotas commented Dec 31, 2023

This change stacks very well with another improvement in stackwalking (#96150).

I like #96150. It was a pure code optimization, no downsides.

This change is a tradeoff. It allocates some extra memory to get some extra throughput in an artificial benchmark. My experience is that changes like that have a high probability of being ineffective in practice, so I am looking for a proof that it is not the case.

@davidwrighton
Copy link
Member

davidwrighton commented Jan 9, 2024

I'm curious if this BTree approach is the best approach. Remember that RUNTIME_FUNCTION lookup is a fairly simple sorted mapping. Another approach that I've thought of is to build a table of runtime function indices where each index the lowest possible index for a given region of memory. Then we could layer that on the existing RUNTIME_FUNCTION table, and get O(1) lookup to some level of granularity, and then a very short binary walk to find the exact target. For instance, size the index table so that it spans the .TEXT region at a granularity such that the average space between indices is 8 or 16 RUNTIME_FUNCTIONS. Then the index table becomes a single lookup, followed by a normal scan in a much smaller region of the RUNTIME_FUNCTION table. I agree with @jkotas that we need a real world scenario before doing too much allocation and such for all of this though.

Ahh, I see that @jkotas already suggested my idea, although he proposed a larger amount of memory per entry in the interpolation table.

What I've seen is that while managed methods can be highly variable in size, they are uniform enough in practice for an interpolation style approach to be quite effective. In fact, my guess is that it's likely more efficient than a BTree. In fact if I were benchmarking this, I would try the interpolation approach + a linear scan of RUNTIME_FUNCTION entries in the possible region. My guess is that you will find through experimentation that it isn't cache misses you are need to avoid, but branch mispredicts, and an interpolation table + a single linear scan will minimize those.

@VSadov
Copy link
Member Author

VSadov commented Jun 28, 2024

This is getting old. I do not have a new ideas here.

@VSadov VSadov closed this Jun 28, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Jul 29, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants