-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: improve inlining cost model #17566
Comments
I'm going to toss in a few more ideas to consider. An unexported function that is only called once is often cheaper to inline. Functions that include tests of parameter values can be cheaper to inline for specific calls that pass constant arguments for those parameters. That is, the cost of inlining is not solely determined by the function itself, it is also determined by the nature of the call. Functions that only make function calls in error cases, which is fairly common, can be cheaper to handle as a mix of inlining and outlining: you inline the main control flow but leave the error handling in a separate function. This may be particularly worth considering when inlining across packages, as the export data only needs to include the main control flow. (Error cases are detectable as the control flow blocks that return a non-nil value for a single result parameter of type One of the most important optimizations for large programs is feedback directed optimization aka profiled guided optimization. One of the most important lessons to learn from feedback/profiling is which functions are worth inlining, both on a per-call basis and on a "most calls pass X as argument N" basis. Therefore, while we have no FDO/PGO framework at present, any work on inlining should consider how to incorporate information gleaned from such a framework when it exists. Pareto optimal is a nice goal but I suspect it is somewhat unrealistic. It's almost always possible to find a horrible decision made by any specific algorithm, but the algorithm can still be better on realistic benchmarks. |
A common case where this would apply is when calling marshallers/unmarshallers that use |
Along the lines of what iant@ said, it's common for C++ compilers take into account whether a callsite appears in a loop (and thus might be "hotter"). This can help for toolchains that don't support FDO/PGO or for applications in which FDO/PGO are not being used. |
No pragmas that mandates inline, please.
I already expressed dislike for //go:noinline, and I will firmly object any
proposal for //go:mustinline or something like that, even if it's limited
to runtime.
If we can't find a good heuristics for the runtime package, I don't think
it will handle real-world cases well.
Also, we need to somehow fix the traceback for inlined non-leaf functions
first.
Another idea for the inlining decision is how simpler could the function
body be if inlined. Esp. for reflect using functions that has fast paths,
if the input type matches the fast path, even though the function might be
very complicated, the inlined version might be really simple.
|
Couldn't we obtain a minor improvement in the cost model by measuring the size of generated assembly language? It would require preserving a copy of the tree till after compilation, and doing compilation bottom-up (same way as inlining is scheduled) but that would give you a more accurate measure. There's a moderate chance of being able to determine goodness of constant parameters at the SSA-level, too. Note that this would require rearranging all of these transformations (inlining, escape analysis, closure conversion, compilation) to run them function/recursive-function-nest at-a-time, so that the results from compiling bottom-most functions all the way to assembly language would be available to inform inlining at the next level up. |
I have also considered this. There'd be a lot of high risk work rearranging the rest of the compiler to work this way. It could also hurt our chances to get a big boost out of concurrent compilation; you want to start on the biggest, slowest functions ASAP, but those are the most likely to depend on many other functions. |
It doesn't look that high risk to me; it's just another iteration order. SSA also gives us a slightly more tractable place to compute things like "constant parameter values that shrink code size", even if it is only so crude as looking for blocks directly conditional on comparisons with parameter values. |
I think we could test the inlining benefits of the bottom-up compilation pretty easily. One way is to do it just for inter-package compilation (as suggested above); another is to hack cmd/compile to dump the function asm size somewhere and then hack cmd/go to compile all packages twice, using the dumped sizes for the second round. |
Out of curiosity, why "often"? I can't think off the top of my head a case in which the contrary is true. Also, just to understand, in |
It is not true when the code looks like
Because in the normal case where you don't need to call
In package main, yes. |
Oh I see, makes sense. Would be nice (also in other cases) if setting up the stack frame could be sunk in the if, but likely it wouldn't be worth the extra effort.
The tyranny of unit-at-a-time :D |
Functions that start with a run of |
@RalphCorderoy I've been thinking about the same kind of function body "chunking" for early returns. Especially interesting for quick paths, where the slow path is too big to inline. Unless the compiler chunks, it's up to the developer to split the function in two I presume. |
Hi @mvdan, Split the function in two with the intention the compiler then inlines the non-leaf first one? |
Yes, for example, here
|
Too late for 1.9. |
Change https://golang.org/cl/57410 mentions this issue: |
The intent is to allow more aggressive refactoring in the runtime without silent performance changes. The test would be useful for many functions. I've seeded it with the runtime functions tophash and add; it will grow organically (or wither!) from here. Updates #21536 and #17566 Change-Id: Ib26d9cfd395e7a8844150224da0856add7bedc42 Reviewed-on: https://go-review.googlesource.com/57410 Reviewed-by: Martin Möhrmann <moehrmann@google.com> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Another example of current inlining heuristic punishing more readable code
is more "expensive" than
This is based on real code from regexp package (see https://go-review.googlesource.com/c/go/+/65491 |
Here’s a compromise idea that I imagine will please no one. :) Allow call site annotations like This grants complete control to the user who really needs it. It is enough of a pain to discourage casual use. It is scoped to exactly one Go version to avoid accidental rot. It provides raw data about optimal inlining decisions to compiler authors. |
I'm not specifically talking about your suggestion. Just the general idea of inline annotations, both callsite and declaration site.
With callsite annotations every place
This will happen with both callsite or declaration annotations, just at a slightly different level.
If the library containing |
Quick observation: right now, the manual inlining generally only happens within a module or even a single package, not crossing module boundaries. Perhaps some form of directive could have a similar restriction, to prevent public APIs from being littered with dubious performance directives. Or perhaps we could even restrict it to unexported functions. |
I was seconds away from posting a similar suggestion about only allowing inlining annotations on unexported functions. That would cover every case I've ever ran into and avoid all of the terrible manual inlining I have done for substantial gains over the literal years. edit: The main pain with not having control comes from exploratory programming. It is a ton of work to manually inline, find out it's not a great idea, and then undo. Then do that a ton over many refactorings that could change the assumptions. I would personally be happy manually inlining still if it wasn't so painful to discover when it was a good idea for a specific case. So even if the annotations were ignored unless compiled with a special flag, I'd still be happy. |
I was just pointing out that the arguments you made don't really apply to callsite annotations as much as they apply to function annotations, though. Conflating the two don't seem to help discussion.
Each place where the caller cares enough to and the inliner doesn't do it automatically, yes. That is pretty much exactly what we want to avoid the copy-paste scenario. Why would that be an argument against callsite annotations? We all agree it would be ideal if the inliner was better so that we didn't have to; the assumption is that it likely won't become so much better that there will be no need for manual control.
That is pretty much what is already happening today, with the difference right now the inlining has to be done manually. Callsite annotations wouldn't make this worse, on the contrary, not only it would avoid having to duplicate code, it would potetntially even avoid having to expose additional APIs (like the one linked above).
My point was that it is already happening today, with the difference that right now people have to manually inline code. That point still applies. Furthermore, I think we agree it would happen less in the case of callsite annotations.
This is definitely one valid argument. I kinda already offered a solution for that though ("even though it's pretty much orthogonal to the proposal itself, when the inliner gets better and starts to actually make decisions based on callsites instead of functions, it may actually transitively propagate it as a hint that the specific callpath is expected to be frequently executed"). Regardless, I wouldn't discount a practical solution just because it doesn't cover 100% of the possible use-cases (as by that standard Go itself likely wouldn't exist)
One note though: I still think there's value in allowing callsite annotations to request cross-module inlining (I gave one example just above, and can probably dig a few others up). Restricing function annotations to unexported symbols OTOH sounds like a pretty reasonable compromise.
Or maybe, only callsite annotations in the root module (the module that contains the main package) are enabled by default? |
I think the danger there is that people could add notes to their function definition godoc like "for maximum performance, annotate calls with...". We can strongly warn against that kind of thing, but it will happen anyway. That's why I think starting with unexported funcs would give a large part of the benefit at little risk, so it could be the first step for an annotation. |
Just hit an annoyance with the inliner in past few days, reading and writing individual bits. Writer.WriteBit() function is inlineable. By making 2 concessions/hacks, x argument an uint64, and reusing x for carry result. However, ReadBit() which is using a similar strategy, is "function too complex: cost 88 exceeds budget 80" |
I'm having a bunch of problems with bad inlining when implementing hash functions. I also can't use the useful: func memset(s []byte, c byte) {
for i := range s {
s[i] = c
}
} Because the fact that it doesn't inline range loops, then if I call |
I dont think the existing range clear optimizations will trigger even if the inlining is changed. Better to write that directly for now:
If inlining is improved it can take care of the call overhead here. |
My colleague wrote a tool that folks following this thread might find useful (it allows for adding comments that are "assertions" on the compiled code for checking whether a function (or a call-site) is inlined). We at cockroachdb began introducing these "assertions" into the codebase and are verifying them during the linter test runs. |
I just came out of a fight with the inliner and figured I would report back. I was trying to outline an allocation for a function with multiple return values, and the inliner had a lot of non-obvious opinions. Here's the version that finally worked, with cost 72.
Both the more idiomatic ways to write this function have costs higher than 84. Avoiding the named returns costs 84.
Making separate allocations costs 90.
|
As already Than McIntosh mentioned it's a common practise to boost inlining to FORs, since the callsite could be "hotter". This patch implements this functionality. The implementation uses a stack of FORs to recognise which calls are in a loop. The stack is maintained alongside inlnode function works and contains information about ancenstor FORs relative to a current node in inlnode. The forContext contains a liveCounter which shows for how many nodes this FOR is ancestor. Current constants are the following: A "big" FOR is a FOR which contains >=inlineBigForNodes(50) nodes or has more than inlineBigForCallNodes(5) inlinable call nodes. In such FORs no boost is applied. Other FORs are considired to be small and boost callsites with an extra budget equals to inlineExtraForBudget(20). Updates golang#17566 The following results on GO1, while binary size not increased significantly 10441232 -> 10465920, which is less than 0.3%. goos: linux goarch: amd64 pkg: test/bench/go1 cpu: Intel(R) Xeon(R) Gold 6230N CPU @ 2.30GHz name old time/op new time/op delta BinaryTree17-8 2.15s ± 1% 2.15s ± 1% ~ (p=0.589 n=6+6) Fannkuch11-8 2.70s ± 0% 2.70s ± 0% -0.08% (p=0.002 n=6+6) FmtFprintfEmpty-8 31.9ns ± 0% 31.9ns ± 3% ~ (p=0.907 n=6+6) FmtFprintfString-8 57.0ns ± 0% 57.6ns ± 0% +1.19% (p=0.004 n=5+6) FmtFprintfInt-8 65.2ns ± 0% 64.1ns ± 0% -1.57% (p=0.002 n=6+6) FmtFprintfIntInt-8 103ns ± 0% 103ns ± 0% ~ (p=0.079 n=5+4) FmtFprintfPrefixedInt-8 119ns ± 0% 118ns ± 0% -0.37% (p=0.008 n=5+5) FmtFprintfFloat-8 169ns ± 0% 173ns ± 0% +2.55% (p=0.004 n=5+6) FmtManyArgs-8 450ns ± 1% 450ns ± 0% ~ (p=1.000 n=6+6) GobDecode-8 4.38ms ± 1% 4.35ms ± 1% ~ (p=0.132 n=6+6) GobEncode-8 3.07ms ± 0% 3.06ms ± 0% -0.38% (p=0.009 n=6+6) Gzip-8 195ms ± 0% 195ms ± 0% ~ (p=0.095 n=5+5) Gunzip-8 28.2ms ± 0% 28.4ms ± 0% +0.57% (p=0.004 n=6+6) HTTPClientServer-8 45.1µs ± 1% 45.3µs ± 1% ~ (p=0.082 n=5+6) JSONEncode-8 7.98ms ± 1% 7.94ms ± 0% -0.47% (p=0.015 n=6+6) JSONDecode-8 35.4ms ± 1% 35.1ms ± 0% -1.04% (p=0.002 n=6+6) Mandelbrot200-8 4.50ms ± 0% 4.50ms ± 0% ~ (p=0.699 n=6+6) GoParse-8 2.98ms ± 0% 2.99ms ± 1% ~ (p=0.095 n=5+5) RegexpMatchEasy0_32-8 55.5ns ± 1% 52.8ns ± 2% -4.94% (p=0.002 n=6+6) RegexpMatchEasy0_1K-8 178ns ± 0% 162ns ± 1% -9.18% (p=0.002 n=6+6) RegexpMatchEasy1_32-8 50.1ns ± 0% 48.4ns ± 2% -3.34% (p=0.002 n=6+6) RegexpMatchEasy1_1K-8 272ns ± 2% 268ns ± 1% ~ (p=0.065 n=6+6) RegexpMatchMedium_32-8 907ns ± 5% 897ns ± 7% ~ (p=0.660 n=6+6) RegexpMatchMedium_1K-8 26.5µs ± 0% 26.6µs ± 0% +0.41% (p=0.008 n=5+5) RegexpMatchHard_32-8 1.28µs ± 0% 1.29µs ± 1% ~ (p=0.167 n=6+6) RegexpMatchHard_1K-8 38.5µs ± 0% 38.6µs ± 0% ~ (p=0.126 n=6+5) Revcomp-8 398ms ± 0% 395ms ± 0% -0.64% (p=0.010 n=6+4) Template-8 48.4ms ± 0% 47.8ms ± 0% -1.30% (p=0.008 n=5+5) TimeParse-8 213ns ± 0% 213ns ± 0% ~ (p=0.108 n=6+6) TimeFormat-8 294ns ± 0% 259ns ± 0% -11.86% (p=0.000 n=5+6) [Geo mean] 40.4µs 40.0µs -1.11% name old speed new speed delta GobDecode-8 175MB/s ± 1% 176MB/s ± 1% ~ (p=0.132 n=6+6) GobEncode-8 250MB/s ± 0% 251MB/s ± 0% +0.38% (p=0.009 n=6+6) Gzip-8 99.3MB/s ± 0% 99.4MB/s ± 0% ~ (p=0.095 n=5+5) Gunzip-8 687MB/s ± 0% 683MB/s ± 0% -0.57% (p=0.004 n=6+6) JSONEncode-8 243MB/s ± 1% 244MB/s ± 0% +0.47% (p=0.015 n=6+6) JSONDecode-8 54.8MB/s ± 1% 55.3MB/s ± 0% +1.04% (p=0.002 n=6+6) GoParse-8 19.4MB/s ± 0% 19.4MB/s ± 1% ~ (p=0.103 n=5+5) RegexpMatchEasy0_32-8 576MB/s ± 1% 606MB/s ± 2% +5.21% (p=0.002 n=6+6) RegexpMatchEasy0_1K-8 5.75GB/s ± 0% 6.33GB/s ± 1% +10.10% (p=0.002 n=6+6) RegexpMatchEasy1_32-8 639MB/s ± 0% 661MB/s ± 2% +3.47% (p=0.002 n=6+6) RegexpMatchEasy1_1K-8 3.76GB/s ± 2% 3.82GB/s ± 1% ~ (p=0.065 n=6+6) RegexpMatchMedium_32-8 35.4MB/s ± 5% 35.7MB/s ± 7% ~ (p=0.615 n=6+6) RegexpMatchMedium_1K-8 38.6MB/s ± 0% 38.4MB/s ± 0% -0.40% (p=0.008 n=5+5) RegexpMatchHard_32-8 25.0MB/s ± 0% 24.8MB/s ± 1% ~ (p=0.167 n=6+6) RegexpMatchHard_1K-8 26.6MB/s ± 0% 26.6MB/s ± 0% ~ (p=0.238 n=5+5) Revcomp-8 639MB/s ± 0% 643MB/s ± 0% +0.65% (p=0.010 n=6+4) Template-8 40.1MB/s ± 0% 40.6MB/s ± 0% +1.32% (p=0.008 n=5+5) [Geo mean] 176MB/s 178MB/s +1.38%
As already Than McIntosh mentioned it's a common practise to boost inlining to FORs, since the callsite could be "hotter". This patch implements this functionality. The implementation uses a stack of FORs to recognise calls which are in a loop. The stack is maintained alongside inlnode function works and contains information about ancenstor FORs relative to a current node in inlnode. The forContext contains a liveCounter which shows for how many nodes this FOR is ancestor. Current constants are the following: A "big" FOR is a FOR which contains >=inlineBigForNodes(50) nodes or has more than inlineBigForCallNodes(5) inlinable call nodes. In such FORs no boost is applied. Other FORs are considired to be small and boost callsites with an extra budget equals to inlineExtraForBudget(20). Updates golang#17566 The following results on GO1, while binary size not increased significantly 10441232 -> 10465920, which is less than 0.3%. goos: linux goarch: amd64 pkg: test/bench/go1 cpu: Intel(R) Xeon(R) Gold 6230N CPU @ 2.30GHz name old time/op new time/op delta BinaryTree17-8 2.15s ± 1% 2.15s ± 1% ~ (p=0.589 n=6+6) Fannkuch11-8 2.70s ± 0% 2.70s ± 0% -0.08% (p=0.002 n=6+6) FmtFprintfEmpty-8 31.9ns ± 0% 31.9ns ± 3% ~ (p=0.907 n=6+6) FmtFprintfString-8 57.0ns ± 0% 57.6ns ± 0% +1.19% (p=0.004 n=5+6) FmtFprintfInt-8 65.2ns ± 0% 64.1ns ± 0% -1.57% (p=0.002 n=6+6) FmtFprintfIntInt-8 103ns ± 0% 103ns ± 0% ~ (p=0.079 n=5+4) FmtFprintfPrefixedInt-8 119ns ± 0% 118ns ± 0% -0.37% (p=0.008 n=5+5) FmtFprintfFloat-8 169ns ± 0% 173ns ± 0% +2.55% (p=0.004 n=5+6) FmtManyArgs-8 450ns ± 1% 450ns ± 0% ~ (p=1.000 n=6+6) GobDecode-8 4.38ms ± 1% 4.35ms ± 1% ~ (p=0.132 n=6+6) GobEncode-8 3.07ms ± 0% 3.06ms ± 0% -0.38% (p=0.009 n=6+6) Gzip-8 195ms ± 0% 195ms ± 0% ~ (p=0.095 n=5+5) Gunzip-8 28.2ms ± 0% 28.4ms ± 0% +0.57% (p=0.004 n=6+6) HTTPClientServer-8 45.1µs ± 1% 45.3µs ± 1% ~ (p=0.082 n=5+6) JSONEncode-8 7.98ms ± 1% 7.94ms ± 0% -0.47% (p=0.015 n=6+6) JSONDecode-8 35.4ms ± 1% 35.1ms ± 0% -1.04% (p=0.002 n=6+6) Mandelbrot200-8 4.50ms ± 0% 4.50ms ± 0% ~ (p=0.699 n=6+6) GoParse-8 2.98ms ± 0% 2.99ms ± 1% ~ (p=0.095 n=5+5) RegexpMatchEasy0_32-8 55.5ns ± 1% 52.8ns ± 2% -4.94% (p=0.002 n=6+6) RegexpMatchEasy0_1K-8 178ns ± 0% 162ns ± 1% -9.18% (p=0.002 n=6+6) RegexpMatchEasy1_32-8 50.1ns ± 0% 48.4ns ± 2% -3.34% (p=0.002 n=6+6) RegexpMatchEasy1_1K-8 272ns ± 2% 268ns ± 1% ~ (p=0.065 n=6+6) RegexpMatchMedium_32-8 907ns ± 5% 897ns ± 7% ~ (p=0.660 n=6+6) RegexpMatchMedium_1K-8 26.5µs ± 0% 26.6µs ± 0% +0.41% (p=0.008 n=5+5) RegexpMatchHard_32-8 1.28µs ± 0% 1.29µs ± 1% ~ (p=0.167 n=6+6) RegexpMatchHard_1K-8 38.5µs ± 0% 38.6µs ± 0% ~ (p=0.126 n=6+5) Revcomp-8 398ms ± 0% 395ms ± 0% -0.64% (p=0.010 n=6+4) Template-8 48.4ms ± 0% 47.8ms ± 0% -1.30% (p=0.008 n=5+5) TimeParse-8 213ns ± 0% 213ns ± 0% ~ (p=0.108 n=6+6) TimeFormat-8 294ns ± 0% 259ns ± 0% -11.86% (p=0.000 n=5+6) [Geo mean] 40.4µs 40.0µs -1.11% name old speed new speed delta GobDecode-8 175MB/s ± 1% 176MB/s ± 1% ~ (p=0.132 n=6+6) GobEncode-8 250MB/s ± 0% 251MB/s ± 0% +0.38% (p=0.009 n=6+6) Gzip-8 99.3MB/s ± 0% 99.4MB/s ± 0% ~ (p=0.095 n=5+5) Gunzip-8 687MB/s ± 0% 683MB/s ± 0% -0.57% (p=0.004 n=6+6) JSONEncode-8 243MB/s ± 1% 244MB/s ± 0% +0.47% (p=0.015 n=6+6) JSONDecode-8 54.8MB/s ± 1% 55.3MB/s ± 0% +1.04% (p=0.002 n=6+6) GoParse-8 19.4MB/s ± 0% 19.4MB/s ± 1% ~ (p=0.103 n=5+5) RegexpMatchEasy0_32-8 576MB/s ± 1% 606MB/s ± 2% +5.21% (p=0.002 n=6+6) RegexpMatchEasy0_1K-8 5.75GB/s ± 0% 6.33GB/s ± 1% +10.10% (p=0.002 n=6+6) RegexpMatchEasy1_32-8 639MB/s ± 0% 661MB/s ± 2% +3.47% (p=0.002 n=6+6) RegexpMatchEasy1_1K-8 3.76GB/s ± 2% 3.82GB/s ± 1% ~ (p=0.065 n=6+6) RegexpMatchMedium_32-8 35.4MB/s ± 5% 35.7MB/s ± 7% ~ (p=0.615 n=6+6) RegexpMatchMedium_1K-8 38.6MB/s ± 0% 38.4MB/s ± 0% -0.40% (p=0.008 n=5+5) RegexpMatchHard_32-8 25.0MB/s ± 0% 24.8MB/s ± 1% ~ (p=0.167 n=6+6) RegexpMatchHard_1K-8 26.6MB/s ± 0% 26.6MB/s ± 0% ~ (p=0.238 n=5+5) Revcomp-8 639MB/s ± 0% 643MB/s ± 0% +0.65% (p=0.010 n=6+4) Template-8 40.1MB/s ± 0% 40.6MB/s ± 0% +1.32% (p=0.008 n=5+5) [Geo mean] 176MB/s 178MB/s +1.38%
Change https://golang.org/cl/347732 mentions this issue: |
As already Than McIntosh mentioned it's a common practise to boost inlining to FORs, since the callsite could be "hotter". This patch implements this functionality. The implementation uses a stack of FORs to recognise calls which are in a loop. The stack is maintained alongside inlnode function works and contains information about ancenstor FORs relative to a current node in inlnode. The forContext contains a liveCounter which shows for how many nodes this FOR is ancestor. Current constants are the following: A "big" FOR is a FOR which contains >=inlineBigForNodes(37) nodes or has more than inlineBigForCallNodes(3) inlinable call nodes. In such FORs no boost is applied. Other FORs are considired to be small and boost callsites with an extra budget equals to inlineExtraForBudget(13). Updates golang#17566 The following results on GO1, while binary size not increased significantly 10441232 -> 10465920, which is less than 0.3%. goos: linux goarch: amd64 pkg: test/bench/go1 cpu: Intel(R) Xeon(R) Gold 6230N CPU @ 2.30GHz name old time/op new time/op delta BinaryTree17-8 2.15s ± 1% 2.16s ± 1% ~ (p=0.132 n=6+6) Fannkuch11-8 2.70s ± 0% 2.70s ± 0% +0.12% (p=0.004 n=6+5) FmtFprintfEmpty-8 31.9ns ± 0% 31.3ns ± 0% -2.05% (p=0.008 n=5+5) FmtFprintfString-8 57.0ns ± 0% 57.7ns ± 1% +1.30% (p=0.002 n=6+6) FmtFprintfInt-8 65.2ns ± 0% 64.1ns ± 0% -1.63% (p=0.008 n=5+5) FmtFprintfIntInt-8 103ns ± 0% 102ns ± 0% -1.01% (p=0.000 n=5+6) FmtFprintfPrefixedInt-8 119ns ± 0% 119ns ± 0% +0.31% (p=0.026 n=5+6) FmtFprintfFloat-8 169ns ± 0% 169ns ± 0% +0.14% (p=0.008 n=5+5) FmtManyArgs-8 445ns ± 0% 452ns ± 0% +1.39% (p=0.004 n=6+5) GobDecode-8 4.37ms ± 1% 4.42ms ± 1% +1.03% (p=0.002 n=6+6) GobEncode-8 3.07ms ± 0% 3.03ms ± 0% -1.07% (p=0.004 n=5+6) Gzip-8 195ms ± 0% 195ms ± 0% ~ (p=0.063 n=5+4) Gunzip-8 28.2ms ± 0% 28.8ms ± 0% +2.13% (p=0.004 n=5+6) HTTPClientServer-8 45.0µs ± 1% 45.4µs ± 1% +0.94% (p=0.030 n=6+5) JSONEncode-8 8.01ms ± 0% 8.00ms ± 1% ~ (p=0.429 n=5+6) JSONDecode-8 35.3ms ± 1% 35.2ms ± 0% ~ (p=0.841 n=5+5) Mandelbrot200-8 4.50ms ± 0% 4.49ms ± 0% ~ (p=0.093 n=6+6) GoParse-8 3.03ms ± 1% 2.97ms ± 1% -1.97% (p=0.004 n=6+5) RegexpMatchEasy0_32-8 55.4ns ± 0% 53.2ns ± 1% -3.89% (p=0.008 n=5+5) RegexpMatchEasy0_1K-8 178ns ± 0% 162ns ± 1% -8.72% (p=0.004 n=5+6) RegexpMatchEasy1_32-8 50.1ns ± 0% 47.4ns ± 1% -5.32% (p=0.008 n=5+5) RegexpMatchEasy1_1K-8 271ns ± 1% 261ns ± 0% -3.67% (p=0.002 n=6+6) RegexpMatchMedium_32-8 949ns ± 0% 904ns ± 5% -4.81% (p=0.004 n=5+6) RegexpMatchMedium_1K-8 27.1µs ± 7% 27.3µs ± 6% ~ (p=0.818 n=6+6) RegexpMatchHard_32-8 1.28µs ± 2% 1.27µs ± 1% ~ (p=0.180 n=6+6) RegexpMatchHard_1K-8 38.5µs ± 0% 38.5µs ± 0% ~ (p=0.329 n=6+5) Revcomp-8 397ms ± 0% 396ms ± 0% -0.33% (p=0.026 n=6+6) Template-8 48.1ms ± 1% 48.2ms ± 1% ~ (p=0.222 n=5+5) TimeParse-8 213ns ± 0% 214ns ± 0% ~ (p=0.076 n=4+6) TimeFormat-8 295ns ± 1% 292ns ± 0% -1.13% (p=0.000 n=6+5) [Geo mean] 40.5µs 40.1µs -0.96% name old speed new speed delta GobDecode-8 176MB/s ± 1% 174MB/s ± 1% -1.02% (p=0.002 n=6+6) GobEncode-8 250MB/s ± 0% 253MB/s ± 0% +1.08% (p=0.004 n=5+6) Gzip-8 100MB/s ± 0% 100MB/s ± 0% +0.23% (p=0.048 n=5+4) Gunzip-8 687MB/s ± 0% 673MB/s ± 0% -2.08% (p=0.004 n=5+6) JSONEncode-8 242MB/s ± 0% 243MB/s ± 1% ~ (p=0.429 n=5+6) JSONDecode-8 54.9MB/s ± 1% 55.1MB/s ± 0% ~ (p=0.873 n=5+5) GoParse-8 19.1MB/s ± 1% 19.5MB/s ± 1% +2.01% (p=0.004 n=6+5) RegexpMatchEasy0_32-8 578MB/s ± 0% 601MB/s ± 1% +4.06% (p=0.008 n=5+5) RegexpMatchEasy0_1K-8 5.74GB/s ± 1% 6.30GB/s ± 1% +9.90% (p=0.002 n=6+6) RegexpMatchEasy1_32-8 639MB/s ± 0% 675MB/s ± 1% +5.63% (p=0.008 n=5+5) RegexpMatchEasy1_1K-8 3.78GB/s ± 1% 3.92GB/s ± 0% +3.81% (p=0.002 n=6+6) RegexpMatchMedium_32-8 33.7MB/s ± 0% 35.5MB/s ± 5% +5.30% (p=0.004 n=5+6) RegexpMatchMedium_1K-8 37.9MB/s ± 6% 37.6MB/s ± 5% ~ (p=0.818 n=6+6) RegexpMatchHard_32-8 24.9MB/s ± 2% 25.2MB/s ± 1% ~ (p=0.167 n=6+6) RegexpMatchHard_1K-8 26.6MB/s ± 0% 26.6MB/s ± 0% ~ (p=0.355 n=6+5) Revcomp-8 640MB/s ± 0% 642MB/s ± 0% +0.33% (p=0.026 n=6+6) Template-8 40.4MB/s ± 1% 40.2MB/s ± 1% ~ (p=0.222 n=5+5) [Geo mean] 175MB/s 178MB/s +1.69%
As already Than McIntosh mentioned it's a common practise to boost inlining to FORs, since the callsite could be "hotter". This patch implements this functionality. The implementation uses a stack of FORs to recognise calls which are in a loop. The stack is maintained alongside inlnode function works and contains information about ancenstor FORs relative to a current node in inlnode. The forContext contains a liveCounter which shows for how many nodes this FOR is ancestor. Current constants are the following: A "big" FOR is a FOR which contains >=inlineBigForNodes(37) nodes or has more than inlineBigForCallNodes(3) inlinable call nodes. In such FORs no boost is applied. Other FORs are considired to be small and boost callsites with an extra budget equals to inlineExtraForBudget(13). Updates golang#17566 The following results on GO1, while binary size not increased significantly 10441232 -> 10465920, which is less than 0.3%. goos: linux goarch: amd64 pkg: test/bench/go1 cpu: Intel(R) Xeon(R) Gold 6230N CPU @ 2.30GHz name old time/op new time/op delta BinaryTree17-8 2.15s ± 1% 2.16s ± 1% ~ (p=0.132 n=6+6) Fannkuch11-8 2.70s ± 0% 2.70s ± 0% +0.12% (p=0.004 n=6+5) FmtFprintfEmpty-8 31.9ns ± 0% 31.3ns ± 0% -2.05% (p=0.008 n=5+5) FmtFprintfString-8 57.0ns ± 0% 57.7ns ± 1% +1.30% (p=0.002 n=6+6) FmtFprintfInt-8 65.2ns ± 0% 64.1ns ± 0% -1.63% (p=0.008 n=5+5) FmtFprintfIntInt-8 103ns ± 0% 102ns ± 0% -1.01% (p=0.000 n=5+6) FmtFprintfPrefixedInt-8 119ns ± 0% 119ns ± 0% +0.31% (p=0.026 n=5+6) FmtFprintfFloat-8 169ns ± 0% 169ns ± 0% +0.14% (p=0.008 n=5+5) FmtManyArgs-8 445ns ± 0% 452ns ± 0% +1.39% (p=0.004 n=6+5) GobDecode-8 4.37ms ± 1% 4.42ms ± 1% +1.03% (p=0.002 n=6+6) GobEncode-8 3.07ms ± 0% 3.03ms ± 0% -1.07% (p=0.004 n=5+6) Gzip-8 195ms ± 0% 195ms ± 0% ~ (p=0.063 n=5+4) Gunzip-8 28.2ms ± 0% 28.8ms ± 0% +2.13% (p=0.004 n=5+6) HTTPClientServer-8 45.0µs ± 1% 45.4µs ± 1% +0.94% (p=0.030 n=6+5) JSONEncode-8 8.01ms ± 0% 8.00ms ± 1% ~ (p=0.429 n=5+6) JSONDecode-8 35.3ms ± 1% 35.2ms ± 0% ~ (p=0.841 n=5+5) Mandelbrot200-8 4.50ms ± 0% 4.49ms ± 0% ~ (p=0.093 n=6+6) GoParse-8 3.03ms ± 1% 2.97ms ± 1% -1.97% (p=0.004 n=6+5) RegexpMatchEasy0_32-8 55.4ns ± 0% 53.2ns ± 1% -3.89% (p=0.008 n=5+5) RegexpMatchEasy0_1K-8 178ns ± 0% 162ns ± 1% -8.72% (p=0.004 n=5+6) RegexpMatchEasy1_32-8 50.1ns ± 0% 47.4ns ± 1% -5.32% (p=0.008 n=5+5) RegexpMatchEasy1_1K-8 271ns ± 1% 261ns ± 0% -3.67% (p=0.002 n=6+6) RegexpMatchMedium_32-8 949ns ± 0% 904ns ± 5% -4.81% (p=0.004 n=5+6) RegexpMatchMedium_1K-8 27.1µs ± 7% 27.3µs ± 6% ~ (p=0.818 n=6+6) RegexpMatchHard_32-8 1.28µs ± 2% 1.27µs ± 1% ~ (p=0.180 n=6+6) RegexpMatchHard_1K-8 38.5µs ± 0% 38.5µs ± 0% ~ (p=0.329 n=6+5) Revcomp-8 397ms ± 0% 396ms ± 0% -0.33% (p=0.026 n=6+6) Template-8 48.1ms ± 1% 48.2ms ± 1% ~ (p=0.222 n=5+5) TimeParse-8 213ns ± 0% 214ns ± 0% ~ (p=0.076 n=4+6) TimeFormat-8 295ns ± 1% 292ns ± 0% -1.13% (p=0.000 n=6+5) [Geo mean] 40.5µs 40.1µs -0.96% name old speed new speed delta GobDecode-8 176MB/s ± 1% 174MB/s ± 1% -1.02% (p=0.002 n=6+6) GobEncode-8 250MB/s ± 0% 253MB/s ± 0% +1.08% (p=0.004 n=5+6) Gzip-8 100MB/s ± 0% 100MB/s ± 0% +0.23% (p=0.048 n=5+4) Gunzip-8 687MB/s ± 0% 673MB/s ± 0% -2.08% (p=0.004 n=5+6) JSONEncode-8 242MB/s ± 0% 243MB/s ± 1% ~ (p=0.429 n=5+6) JSONDecode-8 54.9MB/s ± 1% 55.1MB/s ± 0% ~ (p=0.873 n=5+5) GoParse-8 19.1MB/s ± 1% 19.5MB/s ± 1% +2.01% (p=0.004 n=6+5) RegexpMatchEasy0_32-8 578MB/s ± 0% 601MB/s ± 1% +4.06% (p=0.008 n=5+5) RegexpMatchEasy0_1K-8 5.74GB/s ± 1% 6.30GB/s ± 1% +9.90% (p=0.002 n=6+6) RegexpMatchEasy1_32-8 639MB/s ± 0% 675MB/s ± 1% +5.63% (p=0.008 n=5+5) RegexpMatchEasy1_1K-8 3.78GB/s ± 1% 3.92GB/s ± 0% +3.81% (p=0.002 n=6+6) RegexpMatchMedium_32-8 33.7MB/s ± 0% 35.5MB/s ± 5% +5.30% (p=0.004 n=5+6) RegexpMatchMedium_1K-8 37.9MB/s ± 6% 37.6MB/s ± 5% ~ (p=0.818 n=6+6) RegexpMatchHard_32-8 24.9MB/s ± 2% 25.2MB/s ± 1% ~ (p=0.167 n=6+6) RegexpMatchHard_1K-8 26.6MB/s ± 0% 26.6MB/s ± 0% ~ (p=0.355 n=6+5) Revcomp-8 640MB/s ± 0% 642MB/s ± 0% +0.33% (p=0.026 n=6+6) Template-8 40.4MB/s ± 1% 40.2MB/s ± 1% ~ (p=0.222 n=5+5) [Geo mean] 175MB/s 178MB/s +1.69%
As already Than McIntosh mentioned it's a common practise to boost inlining to FORs, since the callsite could be "hotter". This patch implements this functionality. The implementation uses a stack of FORs to recognise calls which are in a loop. The stack is maintained alongside inlnode function works and contains information about ancenstor FORs relative to a current node in inlnode. The forContext contains a liveCounter which shows for how many nodes this FOR is ancestor. Current constants are the following: A "big" FOR is a FOR which contains >=inlineBigForNodes(37) nodes or has more than inlineBigForCallNodes(3) inlinable call nodes. In such FORs no boost is applied. Other FORs are considired to be small and boost callsites with an extra budget equals to inlineExtraForBudget(13). Updates golang#17566 The following results on GO1, while binary size not increased significantly 10441232 -> 10465920, which is less than 0.3%. goos: linux goarch: amd64 pkg: test/bench/go1 cpu: Intel(R) Xeon(R) Gold 6230N CPU @ 2.30GHz name old time/op new time/op delta BinaryTree17-8 2.15s ± 1% 2.16s ± 1% ~ (p=0.132 n=6+6) Fannkuch11-8 2.70s ± 0% 2.70s ± 0% +0.12% (p=0.004 n=6+5) FmtFprintfEmpty-8 31.9ns ± 0% 31.3ns ± 0% -2.05% (p=0.008 n=5+5) FmtFprintfString-8 57.0ns ± 0% 57.7ns ± 1% +1.30% (p=0.002 n=6+6) FmtFprintfInt-8 65.2ns ± 0% 64.1ns ± 0% -1.63% (p=0.008 n=5+5) FmtFprintfIntInt-8 103ns ± 0% 102ns ± 0% -1.01% (p=0.000 n=5+6) FmtFprintfPrefixedInt-8 119ns ± 0% 119ns ± 0% +0.31% (p=0.026 n=5+6) FmtFprintfFloat-8 169ns ± 0% 169ns ± 0% +0.14% (p=0.008 n=5+5) FmtManyArgs-8 445ns ± 0% 452ns ± 0% +1.39% (p=0.004 n=6+5) GobDecode-8 4.37ms ± 1% 4.42ms ± 1% +1.03% (p=0.002 n=6+6) GobEncode-8 3.07ms ± 0% 3.03ms ± 0% -1.07% (p=0.004 n=5+6) Gzip-8 195ms ± 0% 195ms ± 0% ~ (p=0.063 n=5+4) Gunzip-8 28.2ms ± 0% 28.8ms ± 0% +2.13% (p=0.004 n=5+6) HTTPClientServer-8 45.0µs ± 1% 45.4µs ± 1% +0.94% (p=0.030 n=6+5) JSONEncode-8 8.01ms ± 0% 8.00ms ± 1% ~ (p=0.429 n=5+6) JSONDecode-8 35.3ms ± 1% 35.2ms ± 0% ~ (p=0.841 n=5+5) Mandelbrot200-8 4.50ms ± 0% 4.49ms ± 0% ~ (p=0.093 n=6+6) GoParse-8 3.03ms ± 1% 2.97ms ± 1% -1.97% (p=0.004 n=6+5) RegexpMatchEasy0_32-8 55.4ns ± 0% 53.2ns ± 1% -3.89% (p=0.008 n=5+5) RegexpMatchEasy0_1K-8 178ns ± 0% 162ns ± 1% -8.72% (p=0.004 n=5+6) RegexpMatchEasy1_32-8 50.1ns ± 0% 47.4ns ± 1% -5.32% (p=0.008 n=5+5) RegexpMatchEasy1_1K-8 271ns ± 1% 261ns ± 0% -3.67% (p=0.002 n=6+6) RegexpMatchMedium_32-8 949ns ± 0% 904ns ± 5% -4.81% (p=0.004 n=5+6) RegexpMatchMedium_1K-8 27.1µs ± 7% 27.3µs ± 6% ~ (p=0.818 n=6+6) RegexpMatchHard_32-8 1.28µs ± 2% 1.27µs ± 1% ~ (p=0.180 n=6+6) RegexpMatchHard_1K-8 38.5µs ± 0% 38.5µs ± 0% ~ (p=0.329 n=6+5) Revcomp-8 397ms ± 0% 396ms ± 0% -0.33% (p=0.026 n=6+6) Template-8 48.1ms ± 1% 48.2ms ± 1% ~ (p=0.222 n=5+5) TimeParse-8 213ns ± 0% 214ns ± 0% ~ (p=0.076 n=4+6) TimeFormat-8 295ns ± 1% 292ns ± 0% -1.13% (p=0.000 n=6+5) [Geo mean] 40.5µs 40.1µs -0.96% name old speed new speed delta GobDecode-8 176MB/s ± 1% 174MB/s ± 1% -1.02% (p=0.002 n=6+6) GobEncode-8 250MB/s ± 0% 253MB/s ± 0% +1.08% (p=0.004 n=5+6) Gzip-8 100MB/s ± 0% 100MB/s ± 0% +0.23% (p=0.048 n=5+4) Gunzip-8 687MB/s ± 0% 673MB/s ± 0% -2.08% (p=0.004 n=5+6) JSONEncode-8 242MB/s ± 0% 243MB/s ± 1% ~ (p=0.429 n=5+6) JSONDecode-8 54.9MB/s ± 1% 55.1MB/s ± 0% ~ (p=0.873 n=5+5) GoParse-8 19.1MB/s ± 1% 19.5MB/s ± 1% +2.01% (p=0.004 n=6+5) RegexpMatchEasy0_32-8 578MB/s ± 0% 601MB/s ± 1% +4.06% (p=0.008 n=5+5) RegexpMatchEasy0_1K-8 5.74GB/s ± 1% 6.30GB/s ± 1% +9.90% (p=0.002 n=6+6) RegexpMatchEasy1_32-8 639MB/s ± 0% 675MB/s ± 1% +5.63% (p=0.008 n=5+5) RegexpMatchEasy1_1K-8 3.78GB/s ± 1% 3.92GB/s ± 0% +3.81% (p=0.002 n=6+6) RegexpMatchMedium_32-8 33.7MB/s ± 0% 35.5MB/s ± 5% +5.30% (p=0.004 n=5+6) RegexpMatchMedium_1K-8 37.9MB/s ± 6% 37.6MB/s ± 5% ~ (p=0.818 n=6+6) RegexpMatchHard_32-8 24.9MB/s ± 2% 25.2MB/s ± 1% ~ (p=0.167 n=6+6) RegexpMatchHard_1K-8 26.6MB/s ± 0% 26.6MB/s ± 0% ~ (p=0.355 n=6+5) Revcomp-8 640MB/s ± 0% 642MB/s ± 0% +0.33% (p=0.026 n=6+6) Template-8 40.4MB/s ± 1% 40.2MB/s ± 1% ~ (p=0.222 n=5+5) [Geo mean] 175MB/s 178MB/s +1.69%
As already Than McIntosh mentioned it's a common practise to boost inlining to FORs, since the callsite could be "hotter". This patch implements this functionality. The implementation uses a stack of FORs to recognise calls which are in a loop. The stack is maintained alongside inlnode function works and contains information about ancenstor FORs relative to a current node in inlnode. There is "big" FOR which cost is >= inlineBigForCost(47). In such FORs no boost is applied. Updates golang#17566 The following results on GO1, while binary size not increased significantly 10441232 -> 10465920, which is less than 0.3%. goos: linux goarch: amd64 pkg: test/bench/go1 cpu: Intel(R) Xeon(R) Gold 6230N CPU @ 2.30GHz name old time/op new time/op delta BinaryTree17-8 2.15s ± 1% 2.17s ± 1% +0.86% (p=0.041 n=6+6) Fannkuch11-8 2.70s ± 0% 2.72s ± 0% +0.71% (p=0.002 n=6+6) FmtFprintfEmpty-8 31.9ns ± 0% 31.6ns ± 0% -1.06% (p=0.008 n=5+5) FmtFprintfString-8 57.0ns ± 0% 58.3ns ± 0% +2.26% (p=0.004 n=6+5) FmtFprintfInt-8 65.2ns ± 0% 64.1ns ± 0% -1.65% (p=0.000 n=5+4) FmtFprintfIntInt-8 103ns ± 0% 102ns ± 0% -0.91% (p=0.000 n=5+6) FmtFprintfPrefixedInt-8 119ns ± 0% 118ns ± 0% -0.60% (p=0.008 n=5+5) FmtFprintfFloat-8 169ns ± 0% 171ns ± 0% +1.50% (p=0.004 n=5+6) FmtManyArgs-8 445ns ± 0% 445ns ± 0% ~ (p=0.506 n=6+5) GobDecode-8 4.37ms ± 1% 4.41ms ± 0% +0.79% (p=0.009 n=6+6) GobEncode-8 3.07ms ± 0% 3.05ms ± 0% -0.42% (p=0.004 n=5+6) Gzip-8 195ms ± 0% 194ms ± 0% -0.40% (p=0.009 n=5+6) Gunzip-8 28.2ms ± 0% 28.9ms ± 0% +2.22% (p=0.004 n=5+6) HTTPClientServer-8 45.0µs ± 1% 45.4µs ± 0% +0.97% (p=0.030 n=6+5) JSONEncode-8 8.01ms ± 0% 7.95ms ± 0% -0.78% (p=0.008 n=5+5) JSONDecode-8 35.3ms ± 1% 35.0ms ± 0% -1.04% (p=0.004 n=5+6) Mandelbrot200-8 4.50ms ± 0% 4.50ms ± 0% ~ (p=0.662 n=6+5) GoParse-8 3.03ms ± 1% 2.96ms ± 0% -2.41% (p=0.004 n=6+5) RegexpMatchEasy0_32-8 55.4ns ± 0% 53.8ns ± 0% -2.83% (p=0.004 n=5+6) RegexpMatchEasy0_1K-8 178ns ± 0% 162ns ± 1% -8.76% (p=0.004 n=5+6) RegexpMatchEasy1_32-8 50.1ns ± 0% 49.6ns ± 0% -0.92% (p=0.004 n=5+6) RegexpMatchEasy1_1K-8 271ns ± 1% 268ns ± 0% -1.15% (p=0.002 n=6+6) RegexpMatchMedium_32-8 949ns ± 0% 862ns ± 0% -9.20% (p=0.008 n=5+5) RegexpMatchMedium_1K-8 27.1µs ± 7% 27.4µs ± 7% ~ (p=0.589 n=6+6) RegexpMatchHard_32-8 1.28µs ± 2% 1.27µs ± 1% ~ (p=0.065 n=6+6) RegexpMatchHard_1K-8 38.5µs ± 0% 38.5µs ± 0% ~ (p=0.132 n=6+6) Revcomp-8 397ms ± 0% 397ms ± 0% ~ (p=1.000 n=6+6) Template-8 48.1ms ± 1% 47.8ms ± 0% -0.48% (p=0.016 n=5+5) TimeParse-8 213ns ± 0% 213ns ± 0% ~ (p=0.467 n=4+6) TimeFormat-8 295ns ± 1% 294ns ± 0% ~ (p=0.554 n=6+5) [Geo mean] 40.5µs 40.2µs -0.81% name old speed new speed delta GobDecode-8 176MB/s ± 1% 174MB/s ± 0% -0.79% (p=0.009 n=6+6) GobEncode-8 250MB/s ± 0% 251MB/s ± 0% +0.42% (p=0.004 n=5+6) Gzip-8 100MB/s ± 0% 100MB/s ± 0% +0.40% (p=0.009 n=5+6) Gunzip-8 687MB/s ± 0% 672MB/s ± 0% -2.17% (p=0.004 n=5+6) JSONEncode-8 242MB/s ± 0% 244MB/s ± 0% +0.78% (p=0.008 n=5+5) JSONDecode-8 54.9MB/s ± 1% 55.5MB/s ± 0% +1.05% (p=0.004 n=5+6) GoParse-8 19.1MB/s ± 1% 19.6MB/s ± 0% +2.48% (p=0.004 n=6+5) RegexpMatchEasy0_32-8 578MB/s ± 0% 594MB/s ± 0% +2.89% (p=0.008 n=5+5) RegexpMatchEasy0_1K-8 5.74GB/s ± 1% 6.31GB/s ± 1% +9.95% (p=0.002 n=6+6) RegexpMatchEasy1_32-8 639MB/s ± 0% 645MB/s ± 0% +0.93% (p=0.004 n=5+6) RegexpMatchEasy1_1K-8 3.78GB/s ± 1% 3.82GB/s ± 0% +1.15% (p=0.002 n=6+6) RegexpMatchMedium_32-8 33.7MB/s ± 0% 37.1MB/s ± 0% +10.15% (p=0.008 n=5+5) RegexpMatchMedium_1K-8 37.9MB/s ± 6% 37.5MB/s ± 7% ~ (p=0.697 n=6+6) RegexpMatchHard_32-8 24.9MB/s ± 2% 25.1MB/s ± 1% ~ (p=0.058 n=6+6) RegexpMatchHard_1K-8 26.6MB/s ± 0% 26.6MB/s ± 0% ~ (p=0.195 n=6+6) Revcomp-8 640MB/s ± 0% 641MB/s ± 0% ~ (p=1.000 n=6+6) Template-8 40.4MB/s ± 1% 40.6MB/s ± 0% +0.47% (p=0.016 n=5+5) [Geo mean] 175MB/s 178MB/s +1.56%
As already Than McIntosh mentioned it's a common practise to boost inlining to FORs, since the callsite could be "hotter". This patch implements this functionality. The implementation uses a stack of FORs to recognise calls which are in a loop. The stack is maintained alongside inlnode function works and contains information about ancenstor FORs relative to a current node in inlnode. There is "big" FOR which cost is >= inlineBigForCost(47). In such FORs no boost is applied. Updates golang#17566 The following results on GO1, while binary size not increased significantly 10441232 -> 10465920, which is less than 0.3%. goos: linux goarch: amd64 pkg: test/bench/go1 cpu: Intel(R) Xeon(R) Gold 6230N CPU @ 2.30GHz name old time/op new time/op delta BinaryTree17-8 2.15s ± 1% 2.17s ± 1% +0.86% (p=0.041 n=6+6) Fannkuch11-8 2.70s ± 0% 2.72s ± 0% +0.71% (p=0.002 n=6+6) FmtFprintfEmpty-8 31.9ns ± 0% 31.6ns ± 0% -1.06% (p=0.008 n=5+5) FmtFprintfString-8 57.0ns ± 0% 58.3ns ± 0% +2.26% (p=0.004 n=6+5) FmtFprintfInt-8 65.2ns ± 0% 64.1ns ± 0% -1.65% (p=0.000 n=5+4) FmtFprintfIntInt-8 103ns ± 0% 102ns ± 0% -0.91% (p=0.000 n=5+6) FmtFprintfPrefixedInt-8 119ns ± 0% 118ns ± 0% -0.60% (p=0.008 n=5+5) FmtFprintfFloat-8 169ns ± 0% 171ns ± 0% +1.50% (p=0.004 n=5+6) FmtManyArgs-8 445ns ± 0% 445ns ± 0% ~ (p=0.506 n=6+5) GobDecode-8 4.37ms ± 1% 4.41ms ± 0% +0.79% (p=0.009 n=6+6) GobEncode-8 3.07ms ± 0% 3.05ms ± 0% -0.42% (p=0.004 n=5+6) Gzip-8 195ms ± 0% 194ms ± 0% -0.40% (p=0.009 n=5+6) Gunzip-8 28.2ms ± 0% 28.9ms ± 0% +2.22% (p=0.004 n=5+6) HTTPClientServer-8 45.0µs ± 1% 45.4µs ± 0% +0.97% (p=0.030 n=6+5) JSONEncode-8 8.01ms ± 0% 7.95ms ± 0% -0.78% (p=0.008 n=5+5) JSONDecode-8 35.3ms ± 1% 35.0ms ± 0% -1.04% (p=0.004 n=5+6) Mandelbrot200-8 4.50ms ± 0% 4.50ms ± 0% ~ (p=0.662 n=6+5) GoParse-8 3.03ms ± 1% 2.96ms ± 0% -2.41% (p=0.004 n=6+5) RegexpMatchEasy0_32-8 55.4ns ± 0% 53.8ns ± 0% -2.83% (p=0.004 n=5+6) RegexpMatchEasy0_1K-8 178ns ± 0% 162ns ± 1% -8.76% (p=0.004 n=5+6) RegexpMatchEasy1_32-8 50.1ns ± 0% 49.6ns ± 0% -0.92% (p=0.004 n=5+6) RegexpMatchEasy1_1K-8 271ns ± 1% 268ns ± 0% -1.15% (p=0.002 n=6+6) RegexpMatchMedium_32-8 949ns ± 0% 862ns ± 0% -9.20% (p=0.008 n=5+5) RegexpMatchMedium_1K-8 27.1µs ± 7% 27.4µs ± 7% ~ (p=0.589 n=6+6) RegexpMatchHard_32-8 1.28µs ± 2% 1.27µs ± 1% ~ (p=0.065 n=6+6) RegexpMatchHard_1K-8 38.5µs ± 0% 38.5µs ± 0% ~ (p=0.132 n=6+6) Revcomp-8 397ms ± 0% 397ms ± 0% ~ (p=1.000 n=6+6) Template-8 48.1ms ± 1% 47.8ms ± 0% -0.48% (p=0.016 n=5+5) TimeParse-8 213ns ± 0% 213ns ± 0% ~ (p=0.467 n=4+6) TimeFormat-8 295ns ± 1% 294ns ± 0% ~ (p=0.554 n=6+5) [Geo mean] 40.5µs 40.2µs -0.81% name old speed new speed delta GobDecode-8 176MB/s ± 1% 174MB/s ± 0% -0.79% (p=0.009 n=6+6) GobEncode-8 250MB/s ± 0% 251MB/s ± 0% +0.42% (p=0.004 n=5+6) Gzip-8 100MB/s ± 0% 100MB/s ± 0% +0.40% (p=0.009 n=5+6) Gunzip-8 687MB/s ± 0% 672MB/s ± 0% -2.17% (p=0.004 n=5+6) JSONEncode-8 242MB/s ± 0% 244MB/s ± 0% +0.78% (p=0.008 n=5+5) JSONDecode-8 54.9MB/s ± 1% 55.5MB/s ± 0% +1.05% (p=0.004 n=5+6) GoParse-8 19.1MB/s ± 1% 19.6MB/s ± 0% +2.48% (p=0.004 n=6+5) RegexpMatchEasy0_32-8 578MB/s ± 0% 594MB/s ± 0% +2.89% (p=0.008 n=5+5) RegexpMatchEasy0_1K-8 5.74GB/s ± 1% 6.31GB/s ± 1% +9.95% (p=0.002 n=6+6) RegexpMatchEasy1_32-8 639MB/s ± 0% 645MB/s ± 0% +0.93% (p=0.004 n=5+6) RegexpMatchEasy1_1K-8 3.78GB/s ± 1% 3.82GB/s ± 0% +1.15% (p=0.002 n=6+6) RegexpMatchMedium_32-8 33.7MB/s ± 0% 37.1MB/s ± 0% +10.15% (p=0.008 n=5+5) RegexpMatchMedium_1K-8 37.9MB/s ± 6% 37.5MB/s ± 7% ~ (p=0.697 n=6+6) RegexpMatchHard_32-8 24.9MB/s ± 2% 25.1MB/s ± 1% ~ (p=0.058 n=6+6) RegexpMatchHard_1K-8 26.6MB/s ± 0% 26.6MB/s ± 0% ~ (p=0.195 n=6+6) Revcomp-8 640MB/s ± 0% 641MB/s ± 0% ~ (p=1.000 n=6+6) Template-8 40.4MB/s ± 1% 40.6MB/s ± 0% +0.47% (p=0.016 n=5+5) [Geo mean] 175MB/s 178MB/s +1.56%
As already Than McIntosh mentioned it's a common practise to boost inlining to FORs, since the callsite could be "hotter". This patch implements this functionality. The implementation uses a stack of FORs to recognise calls which are in a loop. The stack is maintained alongside inlnode function works and contains information about ancenstor FORs relative to a current node in inlnode. There is "big" FOR which cost is >= inlineBigForCost(105). In such FORs no boost is applied. Updates golang#17566 The following results on GO1, while binary size not increased significantly 10454800 -> 10475120, which is less than 0.3%. goos: linux goarch: amd64 pkg: test/bench/go1 cpu: Intel(R) Xeon(R) Gold 6230N CPU @ 2.30GHz name old time/op new time/op delta BinaryTree17-8 2.15s ± 1% 2.17s ± 1% ~ (p=0.065 n=6+6) Fannkuch11-8 2.70s ± 0% 2.69s ± 0% -0.25% (p=0.010 n=6+4) FmtFprintfEmpty-8 31.9ns ± 0% 31.4ns ± 0% -1.61% (p=0.008 n=5+5) FmtFprintfString-8 57.0ns ± 0% 57.1ns ± 0% +0.26% (p=0.013 n=6+5) FmtFprintfInt-8 65.2ns ± 0% 63.9ns ± 0% -1.95% (p=0.008 n=5+5) FmtFprintfIntInt-8 103ns ± 0% 102ns ± 0% -1.01% (p=0.000 n=5+4) FmtFprintfPrefixedInt-8 119ns ± 0% 118ns ± 0% -0.50% (p=0.008 n=5+5) FmtFprintfFloat-8 169ns ± 0% 174ns ± 0% +2.75% (p=0.008 n=5+5) FmtManyArgs-8 445ns ± 0% 447ns ± 0% +0.46% (p=0.002 n=6+6) GobDecode-8 4.37ms ± 1% 4.40ms ± 0% +0.62% (p=0.009 n=6+6) GobEncode-8 3.07ms ± 0% 3.04ms ± 0% -0.78% (p=0.004 n=5+6) Gzip-8 195ms ± 0% 195ms ± 0% ~ (p=0.429 n=5+6) Gunzip-8 28.2ms ± 0% 28.2ms ± 0% ~ (p=0.662 n=5+6) HTTPClientServer-8 45.0µs ± 1% 45.4µs ± 1% ~ (p=0.093 n=6+6) JSONEncode-8 8.01ms ± 0% 8.03ms ± 0% +0.31% (p=0.008 n=5+5) JSONDecode-8 35.3ms ± 1% 35.1ms ± 0% -0.72% (p=0.008 n=5+5) Mandelbrot200-8 4.50ms ± 0% 4.49ms ± 1% ~ (p=0.937 n=6+6) GoParse-8 3.03ms ± 1% 3.00ms ± 1% ~ (p=0.180 n=6+6) RegexpMatchEasy0_32-8 55.4ns ± 0% 53.2ns ± 3% -3.92% (p=0.004 n=5+6) RegexpMatchEasy0_1K-8 178ns ± 0% 175ns ± 1% -1.57% (p=0.004 n=5+6) RegexpMatchEasy1_32-8 50.1ns ± 0% 48.3ns ± 5% ~ (p=0.082 n=5+6) RegexpMatchEasy1_1K-8 271ns ± 1% 262ns ± 1% -3.26% (p=0.004 n=6+5) RegexpMatchMedium_32-8 949ns ± 0% 886ns ± 7% ~ (p=0.329 n=5+6) RegexpMatchMedium_1K-8 27.1µs ± 7% 28.1µs ± 6% ~ (p=0.394 n=6+6) RegexpMatchHard_32-8 1.28µs ± 2% 1.29µs ± 0% ~ (p=0.056 n=6+6) RegexpMatchHard_1K-8 38.5µs ± 0% 38.4µs ± 0% -0.25% (p=0.009 n=6+5) Revcomp-8 397ms ± 0% 396ms ± 0% ~ (p=0.429 n=6+5) Template-8 48.1ms ± 1% 48.1ms ± 0% ~ (p=0.222 n=5+5) TimeParse-8 213ns ± 0% 213ns ± 0% ~ (p=0.210 n=4+6) TimeFormat-8 295ns ± 1% 259ns ± 0% -12.22% (p=0.002 n=6+6) [Geo mean] 40.5µs 40.1µs -1.00% name old speed new speed delta GobDecode-8 176MB/s ± 1% 174MB/s ± 0% -0.61% (p=0.009 n=6+6) GobEncode-8 250MB/s ± 0% 252MB/s ± 0% +0.79% (p=0.004 n=5+6) Gzip-8 100MB/s ± 0% 100MB/s ± 0% ~ (p=0.351 n=5+6) Gunzip-8 687MB/s ± 0% 687MB/s ± 0% ~ (p=0.662 n=5+6) JSONEncode-8 242MB/s ± 0% 242MB/s ± 0% -0.31% (p=0.008 n=5+5) JSONDecode-8 54.9MB/s ± 1% 55.3MB/s ± 0% +0.71% (p=0.008 n=5+5) GoParse-8 19.1MB/s ± 1% 19.3MB/s ± 1% ~ (p=0.143 n=6+6) RegexpMatchEasy0_32-8 578MB/s ± 0% 601MB/s ± 3% +4.10% (p=0.004 n=5+6) RegexpMatchEasy0_1K-8 5.74GB/s ± 1% 5.85GB/s ± 1% +1.90% (p=0.002 n=6+6) RegexpMatchEasy1_32-8 639MB/s ± 0% 663MB/s ± 4% ~ (p=0.082 n=5+6) RegexpMatchEasy1_1K-8 3.78GB/s ± 1% 3.91GB/s ± 1% +3.38% (p=0.004 n=6+5) RegexpMatchMedium_32-8 33.7MB/s ± 0% 36.2MB/s ± 7% ~ (p=0.268 n=5+6) RegexpMatchMedium_1K-8 37.9MB/s ± 6% 36.5MB/s ± 6% ~ (p=0.411 n=6+6) RegexpMatchHard_32-8 24.9MB/s ± 2% 24.8MB/s ± 0% ~ (p=0.063 n=6+6) RegexpMatchHard_1K-8 26.6MB/s ± 0% 26.7MB/s ± 0% +0.25% (p=0.009 n=6+5) Revcomp-8 640MB/s ± 0% 641MB/s ± 0% ~ (p=0.429 n=6+5) Template-8 40.4MB/s ± 1% 40.3MB/s ± 0% ~ (p=0.222 n=5+5) [Geo mean] 175MB/s 177MB/s +1.05%
As already Than McIntosh mentioned it's a common practise to boost inlining to FORs, since the callsite could be "hotter". This patch implements this functionality. The implementation uses a stack of FORs to recognise calls which are in a loop. The stack is maintained alongside inlnode function works and contains information about ancenstor FORs relative to a current node in inlnode. There is "big" FOR which cost is >= inlineBigForCost(105). In such FORs no boost is applied. Updates golang#17566 The following results on GO1, while binary size not increased significantly 10454800 -> 10475120, which is less than 0.3%. goos: linux goarch: amd64 pkg: test/bench/go1 cpu: Intel(R) Xeon(R) Gold 6230N CPU @ 2.30GHz name old time/op new time/op delta BinaryTree17-8 2.15s ± 1% 2.17s ± 1% ~ (p=0.065 n=6+6) Fannkuch11-8 2.70s ± 0% 2.69s ± 0% -0.25% (p=0.010 n=6+4) FmtFprintfEmpty-8 31.9ns ± 0% 31.4ns ± 0% -1.61% (p=0.008 n=5+5) FmtFprintfString-8 57.0ns ± 0% 57.1ns ± 0% +0.26% (p=0.013 n=6+5) FmtFprintfInt-8 65.2ns ± 0% 63.9ns ± 0% -1.95% (p=0.008 n=5+5) FmtFprintfIntInt-8 103ns ± 0% 102ns ± 0% -1.01% (p=0.000 n=5+4) FmtFprintfPrefixedInt-8 119ns ± 0% 118ns ± 0% -0.50% (p=0.008 n=5+5) FmtFprintfFloat-8 169ns ± 0% 174ns ± 0% +2.75% (p=0.008 n=5+5) FmtManyArgs-8 445ns ± 0% 447ns ± 0% +0.46% (p=0.002 n=6+6) GobDecode-8 4.37ms ± 1% 4.40ms ± 0% +0.62% (p=0.009 n=6+6) GobEncode-8 3.07ms ± 0% 3.04ms ± 0% -0.78% (p=0.004 n=5+6) Gzip-8 195ms ± 0% 195ms ± 0% ~ (p=0.429 n=5+6) Gunzip-8 28.2ms ± 0% 28.2ms ± 0% ~ (p=0.662 n=5+6) HTTPClientServer-8 45.0µs ± 1% 45.4µs ± 1% ~ (p=0.093 n=6+6) JSONEncode-8 8.01ms ± 0% 8.03ms ± 0% +0.31% (p=0.008 n=5+5) JSONDecode-8 35.3ms ± 1% 35.1ms ± 0% -0.72% (p=0.008 n=5+5) Mandelbrot200-8 4.50ms ± 0% 4.49ms ± 1% ~ (p=0.937 n=6+6) GoParse-8 3.03ms ± 1% 3.00ms ± 1% ~ (p=0.180 n=6+6) RegexpMatchEasy0_32-8 55.4ns ± 0% 53.2ns ± 3% -3.92% (p=0.004 n=5+6) RegexpMatchEasy0_1K-8 178ns ± 0% 175ns ± 1% -1.57% (p=0.004 n=5+6) RegexpMatchEasy1_32-8 50.1ns ± 0% 48.3ns ± 5% ~ (p=0.082 n=5+6) RegexpMatchEasy1_1K-8 271ns ± 1% 262ns ± 1% -3.26% (p=0.004 n=6+5) RegexpMatchMedium_32-8 949ns ± 0% 886ns ± 7% ~ (p=0.329 n=5+6) RegexpMatchMedium_1K-8 27.1µs ± 7% 28.1µs ± 6% ~ (p=0.394 n=6+6) RegexpMatchHard_32-8 1.28µs ± 2% 1.29µs ± 0% ~ (p=0.056 n=6+6) RegexpMatchHard_1K-8 38.5µs ± 0% 38.4µs ± 0% -0.25% (p=0.009 n=6+5) Revcomp-8 397ms ± 0% 396ms ± 0% ~ (p=0.429 n=6+5) Template-8 48.1ms ± 1% 48.1ms ± 0% ~ (p=0.222 n=5+5) TimeParse-8 213ns ± 0% 213ns ± 0% ~ (p=0.210 n=4+6) TimeFormat-8 295ns ± 1% 259ns ± 0% -12.22% (p=0.002 n=6+6) [Geo mean] 40.5µs 40.1µs -1.00% name old speed new speed delta GobDecode-8 176MB/s ± 1% 174MB/s ± 0% -0.61% (p=0.009 n=6+6) GobEncode-8 250MB/s ± 0% 252MB/s ± 0% +0.79% (p=0.004 n=5+6) Gzip-8 100MB/s ± 0% 100MB/s ± 0% ~ (p=0.351 n=5+6) Gunzip-8 687MB/s ± 0% 687MB/s ± 0% ~ (p=0.662 n=5+6) JSONEncode-8 242MB/s ± 0% 242MB/s ± 0% -0.31% (p=0.008 n=5+5) JSONDecode-8 54.9MB/s ± 1% 55.3MB/s ± 0% +0.71% (p=0.008 n=5+5) GoParse-8 19.1MB/s ± 1% 19.3MB/s ± 1% ~ (p=0.143 n=6+6) RegexpMatchEasy0_32-8 578MB/s ± 0% 601MB/s ± 3% +4.10% (p=0.004 n=5+6) RegexpMatchEasy0_1K-8 5.74GB/s ± 1% 5.85GB/s ± 1% +1.90% (p=0.002 n=6+6) RegexpMatchEasy1_32-8 639MB/s ± 0% 663MB/s ± 4% ~ (p=0.082 n=5+6) RegexpMatchEasy1_1K-8 3.78GB/s ± 1% 3.91GB/s ± 1% +3.38% (p=0.004 n=6+5) RegexpMatchMedium_32-8 33.7MB/s ± 0% 36.2MB/s ± 7% ~ (p=0.268 n=5+6) RegexpMatchMedium_1K-8 37.9MB/s ± 6% 36.5MB/s ± 6% ~ (p=0.411 n=6+6) RegexpMatchHard_32-8 24.9MB/s ± 2% 24.8MB/s ± 0% ~ (p=0.063 n=6+6) RegexpMatchHard_1K-8 26.6MB/s ± 0% 26.7MB/s ± 0% +0.25% (p=0.009 n=6+5) Revcomp-8 640MB/s ± 0% 641MB/s ± 0% ~ (p=0.429 n=6+5) Template-8 40.4MB/s ± 1% 40.3MB/s ± 0% ~ (p=0.222 n=5+5) [Geo mean] 175MB/s 177MB/s +1.05%
As already Than McIntosh mentioned it's a common practise to boost inlining to FORs, since the callsite could be "hotter". This patch implements this functionality. The implementation uses a stack of FORs to recognise calls which are in a loop. The stack is maintained alongside inlnode function works and contains information about ancenstor FORs relative to a current node in inlnode. There is "big" FOR which cost is >= inlineBigForCost(105). In such FORs no boost is applied. Updates golang#17566 The following results on GO1, while binary size not increased significantly 10454800 -> 10475120, which is less than 0.3%. goos: linux goarch: amd64 pkg: test/bench/go1 cpu: Intel(R) Xeon(R) Gold 6230N CPU @ 2.30GHz name old time/op new time/op delta BinaryTree17-8 2.15s ± 1% 2.17s ± 1% ~ (p=0.065 n=6+6) Fannkuch11-8 2.70s ± 0% 2.69s ± 0% -0.25% (p=0.010 n=6+4) FmtFprintfEmpty-8 31.9ns ± 0% 31.4ns ± 0% -1.61% (p=0.008 n=5+5) FmtFprintfString-8 57.0ns ± 0% 57.1ns ± 0% +0.26% (p=0.013 n=6+5) FmtFprintfInt-8 65.2ns ± 0% 63.9ns ± 0% -1.95% (p=0.008 n=5+5) FmtFprintfIntInt-8 103ns ± 0% 102ns ± 0% -1.01% (p=0.000 n=5+4) FmtFprintfPrefixedInt-8 119ns ± 0% 118ns ± 0% -0.50% (p=0.008 n=5+5) FmtFprintfFloat-8 169ns ± 0% 174ns ± 0% +2.75% (p=0.008 n=5+5) FmtManyArgs-8 445ns ± 0% 447ns ± 0% +0.46% (p=0.002 n=6+6) GobDecode-8 4.37ms ± 1% 4.40ms ± 0% +0.62% (p=0.009 n=6+6) GobEncode-8 3.07ms ± 0% 3.04ms ± 0% -0.78% (p=0.004 n=5+6) Gzip-8 195ms ± 0% 195ms ± 0% ~ (p=0.429 n=5+6) Gunzip-8 28.2ms ± 0% 28.2ms ± 0% ~ (p=0.662 n=5+6) HTTPClientServer-8 45.0µs ± 1% 45.4µs ± 1% ~ (p=0.093 n=6+6) JSONEncode-8 8.01ms ± 0% 8.03ms ± 0% +0.31% (p=0.008 n=5+5) JSONDecode-8 35.3ms ± 1% 35.1ms ± 0% -0.72% (p=0.008 n=5+5) Mandelbrot200-8 4.50ms ± 0% 4.49ms ± 1% ~ (p=0.937 n=6+6) GoParse-8 3.03ms ± 1% 3.00ms ± 1% ~ (p=0.180 n=6+6) RegexpMatchEasy0_32-8 55.4ns ± 0% 53.2ns ± 3% -3.92% (p=0.004 n=5+6) RegexpMatchEasy0_1K-8 178ns ± 0% 175ns ± 1% -1.57% (p=0.004 n=5+6) RegexpMatchEasy1_32-8 50.1ns ± 0% 48.3ns ± 5% ~ (p=0.082 n=5+6) RegexpMatchEasy1_1K-8 271ns ± 1% 262ns ± 1% -3.26% (p=0.004 n=6+5) RegexpMatchMedium_32-8 949ns ± 0% 886ns ± 7% ~ (p=0.329 n=5+6) RegexpMatchMedium_1K-8 27.1µs ± 7% 28.1µs ± 6% ~ (p=0.394 n=6+6) RegexpMatchHard_32-8 1.28µs ± 2% 1.29µs ± 0% ~ (p=0.056 n=6+6) RegexpMatchHard_1K-8 38.5µs ± 0% 38.4µs ± 0% -0.25% (p=0.009 n=6+5) Revcomp-8 397ms ± 0% 396ms ± 0% ~ (p=0.429 n=6+5) Template-8 48.1ms ± 1% 48.1ms ± 0% ~ (p=0.222 n=5+5) TimeParse-8 213ns ± 0% 213ns ± 0% ~ (p=0.210 n=4+6) TimeFormat-8 295ns ± 1% 259ns ± 0% -12.22% (p=0.002 n=6+6) [Geo mean] 40.5µs 40.1µs -1.00% name old speed new speed delta GobDecode-8 176MB/s ± 1% 174MB/s ± 0% -0.61% (p=0.009 n=6+6) GobEncode-8 250MB/s ± 0% 252MB/s ± 0% +0.79% (p=0.004 n=5+6) Gzip-8 100MB/s ± 0% 100MB/s ± 0% ~ (p=0.351 n=5+6) Gunzip-8 687MB/s ± 0% 687MB/s ± 0% ~ (p=0.662 n=5+6) JSONEncode-8 242MB/s ± 0% 242MB/s ± 0% -0.31% (p=0.008 n=5+5) JSONDecode-8 54.9MB/s ± 1% 55.3MB/s ± 0% +0.71% (p=0.008 n=5+5) GoParse-8 19.1MB/s ± 1% 19.3MB/s ± 1% ~ (p=0.143 n=6+6) RegexpMatchEasy0_32-8 578MB/s ± 0% 601MB/s ± 3% +4.10% (p=0.004 n=5+6) RegexpMatchEasy0_1K-8 5.74GB/s ± 1% 5.85GB/s ± 1% +1.90% (p=0.002 n=6+6) RegexpMatchEasy1_32-8 639MB/s ± 0% 663MB/s ± 4% ~ (p=0.082 n=5+6) RegexpMatchEasy1_1K-8 3.78GB/s ± 1% 3.91GB/s ± 1% +3.38% (p=0.004 n=6+5) RegexpMatchMedium_32-8 33.7MB/s ± 0% 36.2MB/s ± 7% ~ (p=0.268 n=5+6) RegexpMatchMedium_1K-8 37.9MB/s ± 6% 36.5MB/s ± 6% ~ (p=0.411 n=6+6) RegexpMatchHard_32-8 24.9MB/s ± 2% 24.8MB/s ± 0% ~ (p=0.063 n=6+6) RegexpMatchHard_1K-8 26.6MB/s ± 0% 26.7MB/s ± 0% +0.25% (p=0.009 n=6+5) Revcomp-8 640MB/s ± 0% 641MB/s ± 0% ~ (p=0.429 n=6+5) Template-8 40.4MB/s ± 1% 40.3MB/s ± 0% ~ (p=0.222 n=5+5) [Geo mean] 175MB/s 177MB/s +1.05%
As already Than McIntosh mentioned it's a common practise to boost inlining to FORs, since the callsite could be "hotter". This patch implements this functionality. The implementation uses a stack of FORs to recognise calls which are in a loop. The stack is maintained alongside inlnode function works and contains information about ancenstor FORs relative to a current node in inlnode. There is "big" FOR which cost is >= inlineBigForCost(105). In such FORs no boost is applied. Updates golang#17566 The following results on GO1, while binary size not increased significantly 10454800 -> 10475120, which is less than 0.3%. goos: linux goarch: amd64 pkg: test/bench/go1 cpu: Intel(R) Xeon(R) Gold 6230N CPU @ 2.30GHz name old time/op new time/op delta BinaryTree17-8 2.15s ± 1% 2.17s ± 1% ~ (p=0.065 n=6+6) Fannkuch11-8 2.70s ± 0% 2.69s ± 0% -0.25% (p=0.010 n=6+4) FmtFprintfEmpty-8 31.9ns ± 0% 31.4ns ± 0% -1.61% (p=0.008 n=5+5) FmtFprintfString-8 57.0ns ± 0% 57.1ns ± 0% +0.26% (p=0.013 n=6+5) FmtFprintfInt-8 65.2ns ± 0% 63.9ns ± 0% -1.95% (p=0.008 n=5+5) FmtFprintfIntInt-8 103ns ± 0% 102ns ± 0% -1.01% (p=0.000 n=5+4) FmtFprintfPrefixedInt-8 119ns ± 0% 118ns ± 0% -0.50% (p=0.008 n=5+5) FmtFprintfFloat-8 169ns ± 0% 174ns ± 0% +2.75% (p=0.008 n=5+5) FmtManyArgs-8 445ns ± 0% 447ns ± 0% +0.46% (p=0.002 n=6+6) GobDecode-8 4.37ms ± 1% 4.40ms ± 0% +0.62% (p=0.009 n=6+6) GobEncode-8 3.07ms ± 0% 3.04ms ± 0% -0.78% (p=0.004 n=5+6) Gzip-8 195ms ± 0% 195ms ± 0% ~ (p=0.429 n=5+6) Gunzip-8 28.2ms ± 0% 28.2ms ± 0% ~ (p=0.662 n=5+6) HTTPClientServer-8 45.0µs ± 1% 45.4µs ± 1% ~ (p=0.093 n=6+6) JSONEncode-8 8.01ms ± 0% 8.03ms ± 0% +0.31% (p=0.008 n=5+5) JSONDecode-8 35.3ms ± 1% 35.1ms ± 0% -0.72% (p=0.008 n=5+5) Mandelbrot200-8 4.50ms ± 0% 4.49ms ± 1% ~ (p=0.937 n=6+6) GoParse-8 3.03ms ± 1% 3.00ms ± 1% ~ (p=0.180 n=6+6) RegexpMatchEasy0_32-8 55.4ns ± 0% 53.2ns ± 3% -3.92% (p=0.004 n=5+6) RegexpMatchEasy0_1K-8 178ns ± 0% 175ns ± 1% -1.57% (p=0.004 n=5+6) RegexpMatchEasy1_32-8 50.1ns ± 0% 48.3ns ± 5% ~ (p=0.082 n=5+6) RegexpMatchEasy1_1K-8 271ns ± 1% 262ns ± 1% -3.26% (p=0.004 n=6+5) RegexpMatchMedium_32-8 949ns ± 0% 886ns ± 7% ~ (p=0.329 n=5+6) RegexpMatchMedium_1K-8 27.1µs ± 7% 28.1µs ± 6% ~ (p=0.394 n=6+6) RegexpMatchHard_32-8 1.28µs ± 2% 1.29µs ± 0% ~ (p=0.056 n=6+6) RegexpMatchHard_1K-8 38.5µs ± 0% 38.4µs ± 0% -0.25% (p=0.009 n=6+5) Revcomp-8 397ms ± 0% 396ms ± 0% ~ (p=0.429 n=6+5) Template-8 48.1ms ± 1% 48.1ms ± 0% ~ (p=0.222 n=5+5) TimeParse-8 213ns ± 0% 213ns ± 0% ~ (p=0.210 n=4+6) TimeFormat-8 295ns ± 1% 259ns ± 0% -12.22% (p=0.002 n=6+6) [Geo mean] 40.5µs 40.1µs -1.00% name old speed new speed delta GobDecode-8 176MB/s ± 1% 174MB/s ± 0% -0.61% (p=0.009 n=6+6) GobEncode-8 250MB/s ± 0% 252MB/s ± 0% +0.79% (p=0.004 n=5+6) Gzip-8 100MB/s ± 0% 100MB/s ± 0% ~ (p=0.351 n=5+6) Gunzip-8 687MB/s ± 0% 687MB/s ± 0% ~ (p=0.662 n=5+6) JSONEncode-8 242MB/s ± 0% 242MB/s ± 0% -0.31% (p=0.008 n=5+5) JSONDecode-8 54.9MB/s ± 1% 55.3MB/s ± 0% +0.71% (p=0.008 n=5+5) GoParse-8 19.1MB/s ± 1% 19.3MB/s ± 1% ~ (p=0.143 n=6+6) RegexpMatchEasy0_32-8 578MB/s ± 0% 601MB/s ± 3% +4.10% (p=0.004 n=5+6) RegexpMatchEasy0_1K-8 5.74GB/s ± 1% 5.85GB/s ± 1% +1.90% (p=0.002 n=6+6) RegexpMatchEasy1_32-8 639MB/s ± 0% 663MB/s ± 4% ~ (p=0.082 n=5+6) RegexpMatchEasy1_1K-8 3.78GB/s ± 1% 3.91GB/s ± 1% +3.38% (p=0.004 n=6+5) RegexpMatchMedium_32-8 33.7MB/s ± 0% 36.2MB/s ± 7% ~ (p=0.268 n=5+6) RegexpMatchMedium_1K-8 37.9MB/s ± 6% 36.5MB/s ± 6% ~ (p=0.411 n=6+6) RegexpMatchHard_32-8 24.9MB/s ± 2% 24.8MB/s ± 0% ~ (p=0.063 n=6+6) RegexpMatchHard_1K-8 26.6MB/s ± 0% 26.7MB/s ± 0% +0.25% (p=0.009 n=6+5) Revcomp-8 640MB/s ± 0% 641MB/s ± 0% ~ (p=0.429 n=6+5) Template-8 40.4MB/s ± 1% 40.3MB/s ± 0% ~ (p=0.222 n=5+5) [Geo mean] 175MB/s 177MB/s +1.05%
As already Than McIntosh mentioned it's a common practise to boost inlining to FORs, since the callsite could be "hotter". This patch implements this functionality. The implementation uses a stack of FORs to recognise calls which are in a loop. The stack is maintained alongside inlnode function works and contains information about ancenstor FORs relative to a current node in inlnode. There is "big" FOR which cost is >= inlineBigForCost(105). In such FORs no boost is applied. Updates golang#17566 The following results on GO1, while binary size not increased significantly 10454800 -> 10475120, which is less than 0.3%. goos: linux goarch: amd64 pkg: test/bench/go1 cpu: Intel(R) Xeon(R) Gold 6230N CPU @ 2.30GHz name old time/op new time/op delta BinaryTree17-8 2.15s ± 1% 2.17s ± 1% ~ (p=0.065 n=6+6) Fannkuch11-8 2.70s ± 0% 2.69s ± 0% -0.25% (p=0.010 n=6+4) FmtFprintfEmpty-8 31.9ns ± 0% 31.4ns ± 0% -1.61% (p=0.008 n=5+5) FmtFprintfString-8 57.0ns ± 0% 57.1ns ± 0% +0.26% (p=0.013 n=6+5) FmtFprintfInt-8 65.2ns ± 0% 63.9ns ± 0% -1.95% (p=0.008 n=5+5) FmtFprintfIntInt-8 103ns ± 0% 102ns ± 0% -1.01% (p=0.000 n=5+4) FmtFprintfPrefixedInt-8 119ns ± 0% 118ns ± 0% -0.50% (p=0.008 n=5+5) FmtFprintfFloat-8 169ns ± 0% 174ns ± 0% +2.75% (p=0.008 n=5+5) FmtManyArgs-8 445ns ± 0% 447ns ± 0% +0.46% (p=0.002 n=6+6) GobDecode-8 4.37ms ± 1% 4.40ms ± 0% +0.62% (p=0.009 n=6+6) GobEncode-8 3.07ms ± 0% 3.04ms ± 0% -0.78% (p=0.004 n=5+6) Gzip-8 195ms ± 0% 195ms ± 0% ~ (p=0.429 n=5+6) Gunzip-8 28.2ms ± 0% 28.2ms ± 0% ~ (p=0.662 n=5+6) HTTPClientServer-8 45.0µs ± 1% 45.4µs ± 1% ~ (p=0.093 n=6+6) JSONEncode-8 8.01ms ± 0% 8.03ms ± 0% +0.31% (p=0.008 n=5+5) JSONDecode-8 35.3ms ± 1% 35.1ms ± 0% -0.72% (p=0.008 n=5+5) Mandelbrot200-8 4.50ms ± 0% 4.49ms ± 1% ~ (p=0.937 n=6+6) GoParse-8 3.03ms ± 1% 3.00ms ± 1% ~ (p=0.180 n=6+6) RegexpMatchEasy0_32-8 55.4ns ± 0% 53.2ns ± 3% -3.92% (p=0.004 n=5+6) RegexpMatchEasy0_1K-8 178ns ± 0% 175ns ± 1% -1.57% (p=0.004 n=5+6) RegexpMatchEasy1_32-8 50.1ns ± 0% 48.3ns ± 5% ~ (p=0.082 n=5+6) RegexpMatchEasy1_1K-8 271ns ± 1% 262ns ± 1% -3.26% (p=0.004 n=6+5) RegexpMatchMedium_32-8 949ns ± 0% 886ns ± 7% ~ (p=0.329 n=5+6) RegexpMatchMedium_1K-8 27.1µs ± 7% 28.1µs ± 6% ~ (p=0.394 n=6+6) RegexpMatchHard_32-8 1.28µs ± 2% 1.29µs ± 0% ~ (p=0.056 n=6+6) RegexpMatchHard_1K-8 38.5µs ± 0% 38.4µs ± 0% -0.25% (p=0.009 n=6+5) Revcomp-8 397ms ± 0% 396ms ± 0% ~ (p=0.429 n=6+5) Template-8 48.1ms ± 1% 48.1ms ± 0% ~ (p=0.222 n=5+5) TimeParse-8 213ns ± 0% 213ns ± 0% ~ (p=0.210 n=4+6) TimeFormat-8 295ns ± 1% 259ns ± 0% -12.22% (p=0.002 n=6+6) [Geo mean] 40.5µs 40.1µs -1.00% name old speed new speed delta GobDecode-8 176MB/s ± 1% 174MB/s ± 0% -0.61% (p=0.009 n=6+6) GobEncode-8 250MB/s ± 0% 252MB/s ± 0% +0.79% (p=0.004 n=5+6) Gzip-8 100MB/s ± 0% 100MB/s ± 0% ~ (p=0.351 n=5+6) Gunzip-8 687MB/s ± 0% 687MB/s ± 0% ~ (p=0.662 n=5+6) JSONEncode-8 242MB/s ± 0% 242MB/s ± 0% -0.31% (p=0.008 n=5+5) JSONDecode-8 54.9MB/s ± 1% 55.3MB/s ± 0% +0.71% (p=0.008 n=5+5) GoParse-8 19.1MB/s ± 1% 19.3MB/s ± 1% ~ (p=0.143 n=6+6) RegexpMatchEasy0_32-8 578MB/s ± 0% 601MB/s ± 3% +4.10% (p=0.004 n=5+6) RegexpMatchEasy0_1K-8 5.74GB/s ± 1% 5.85GB/s ± 1% +1.90% (p=0.002 n=6+6) RegexpMatchEasy1_32-8 639MB/s ± 0% 663MB/s ± 4% ~ (p=0.082 n=5+6) RegexpMatchEasy1_1K-8 3.78GB/s ± 1% 3.91GB/s ± 1% +3.38% (p=0.004 n=6+5) RegexpMatchMedium_32-8 33.7MB/s ± 0% 36.2MB/s ± 7% ~ (p=0.268 n=5+6) RegexpMatchMedium_1K-8 37.9MB/s ± 6% 36.5MB/s ± 6% ~ (p=0.411 n=6+6) RegexpMatchHard_32-8 24.9MB/s ± 2% 24.8MB/s ± 0% ~ (p=0.063 n=6+6) RegexpMatchHard_1K-8 26.6MB/s ± 0% 26.7MB/s ± 0% +0.25% (p=0.009 n=6+5) Revcomp-8 640MB/s ± 0% 641MB/s ± 0% ~ (p=0.429 n=6+5) Template-8 40.4MB/s ± 1% 40.3MB/s ± 0% ~ (p=0.222 n=5+5) [Geo mean] 175MB/s 177MB/s +1.05%
The current inlining cost model is simplistic. Every gc.Node in a function has a cost of one. However, the actual impact of each node varies. Some nodes (OKEY) are placeholders never generate any code. Some nodes (OAPPEND) generate lots of code.
In addition to leading to bad inlining decisions, this design means that any refactoring that changes the AST structure can have unexpected and significant impact on compiled code. See CL 31674 for an example.
Inlining occurs near the beginning of compilation, which makes good predictions hard. For example,
new
ormake
or&
might allocate (large runtime call, much code generated) or not (near zero code generated). As another example, code guarded byif false
still gets counted. As another example, we don't know whether bounds checks (which generate lots of code) will be eliminated or not.One approach is to hand-write a better cost model: append is very expensive, things that might end up in a runtime call are moderately expensive, pure structure and variable/const declarations are cheap or free.
Another approach is to compile lots of code and generate a simple machine-built model (e.g. linear regression) from it.
I have tried both of these approaches, and believe both of them to be improvements, but did not mail either of them, for two reasons:
Three other related ideas:
cc @dr2chase @randall77 @ianlancetaylor @mdempsky
The text was updated successfully, but these errors were encountered: