-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: SSA performance regression due to array zeroing #15925
Comments
/cc @randall77 |
The SSA backend doesn't handle arrays terribly well, as they are not SSAable. This leads to some extra copies that the old backend was able to avoid.
Note the unnecessary zeroings of ~r1 and autotmp_5, and the copy autotmp5 -> ~r1 -> tmp.
The old compiler does the direct write to the destination. This is on my radar to fix, but it isn't easy. Hopefully for 1.8. |
If I changed for vec4{x,y,z,w float32} would that solve the problem ? |
It might. Another workaround is to not pass around vectors by value. Use |
I tested it, it's definitely faster with structs |
Why are structs and small arrays treated differently? Is blasting apart small arrays (the same way small structs are blasted apart) an easy optimization to add? I believe the old back ends did this for arrays of size <= 4 or something like that. |
I believe it does not decompose arrays because it doesn't know how to handle non-constant indexing. There is a TODO in gc/ssa.go:
|
Interesting. So the old back end, because it just worked on stack words, saw that these words were being used, never had their address taken, and completely registerized them. And SSA isn't smart enough to do that because it decides about variable tracking early? The implication here, though, is that all the array refs are constant indexes. Maybe if you know the array is only ever accessed by constant index (and address not taken, but that's needed for structs already) then it can be split apart? |
Yes, if we knew that all accesses were constant indexes we could registerize it. |
I was measuring different approaches to implementing One specific case stood out: https://play.golang.org/p/vzC344lntX. There is a 20x perfomance difference.
Exhaustive benchmark of different approaches: https://play.golang.org/p/6diXIPOWu5
(go version devel +11eaf42 Sat Apr 29 04:15:49 2017 +0000 windows/amd64) |
I'm working on a change that allows const indexed small arrays to be SSA'd. I'm not 100% happy with the approach (storing a flag on the array type instead of on the node to indicate that it was variably indexed), so I'm taking another look but it does work and will solve this:
|
@tzneal wouldn't it be possible to rewrite variably indexed statements into switches? Just as a proof of concept: https://play.golang.org/p/m0n1E7Q2kg (I'm sure there is a way to avoid the performance hit introduced by the panic, when you have access to codegen)
|
I tested a scenario using In the following version, the accumulator values always write back to the stack in the loop func SumFloat64Unroll_array(buf []float64) float64 {
var (
acc [4]float64
)
for i := 0; i < len(buf); i += 4 {
bb := (*[4]float64)(unsafe.Pointer(&buf[i]))
acc[0] += bb[0]
acc[1] += bb[1]
acc[2] += bb[2]
acc[3] += bb[3]
}
return acc[0] + acc[1] + acc[2] + acc[3]
} per movsd (CX)(DX*8), X0
addsd "".acc(SP), X0
movsd X0, "".acc(SP)
movsd 8(CX)(DX*8), X0
addsd "".acc+8(SP), X0
movsd X0, "".acc+8(SP)
movsd 16(CX)(DX*8), X0
addsd "".acc+16(SP), X0
movsd X0, "".acc+16(SP)
movsd 24(CX)(DX*8), X0
addsd "".acc+24(SP), X0
movsd X0, "".acc+24(SP) vs this version, each accumulator remains in a register for the loop func SumFloat64Unroll_vars(buf []float64) float64 {
var (
acc0, acc1, acc2, acc3 float64
)
for i := 0; i < len(buf); i += 4 {
bb := (*[4]float64)(unsafe.Pointer(&buf[i]))
acc0 += bb[0]
acc1 += bb[1]
acc2 += bb[2]
acc3 += bb[3]
}
return acc0 + acc1 + acc2 + acc3
} per movsd (CX)(DX*8), X4
addsd X4, X0
movsd 8(CX)(DX*8), X4
addsd X4, X1
movsd 16(CX)(DX*8), X4
addsd X4, X2
movsd 24(CX)(DX*8), X4
addsd X4, X3 |
Change https://golang.org/cl/106495 mentions this issue: |
Propagate values through some wide Zero/Move operations. Among other things this allows us to optimize some kinds of array initialization. For example, the following code no longer requires a temporary be allocated on the stack. Instead it writes the values directly into the return value. func f(i uint32) [4]uint32 { return [4]uint32{i, i+1, i+2, i+3} } The return value is unnecessarily cleared but removing that is probably a task for dead store analysis (I think it needs to be able to match multiple Store ops to wide Zero ops). In order to reliably remove stack variables that are rendered unnecessary by these new rules I've added a new generic version of the unread autos elimination pass. These rules are triggered more than 5000 times when building and testing the standard library. Updates #15925 (fixes for arrays of up to 4 elements). Updates #24386 (fixes for up to 4 kept elements). Updates #24416. compilebench results: name old time/op new time/op delta Template 353ms ± 5% 359ms ± 3% ~ (p=0.143 n=10+10) Unicode 219ms ± 1% 217ms ± 4% ~ (p=0.740 n=7+10) GoTypes 1.26s ± 1% 1.26s ± 2% ~ (p=0.549 n=9+10) Compiler 6.00s ± 1% 6.08s ± 1% +1.42% (p=0.000 n=9+8) SSA 15.3s ± 2% 15.6s ± 1% +2.43% (p=0.000 n=10+10) Flate 237ms ± 2% 240ms ± 2% +1.31% (p=0.015 n=10+10) GoParser 285ms ± 1% 285ms ± 1% ~ (p=0.878 n=8+8) Reflect 797ms ± 3% 807ms ± 2% ~ (p=0.065 n=9+10) Tar 334ms ± 0% 335ms ± 4% ~ (p=0.460 n=8+10) XML 419ms ± 0% 423ms ± 1% +0.91% (p=0.001 n=7+9) StdCmd 46.0s ± 0% 46.4s ± 0% +0.85% (p=0.000 n=9+9) name old user-time/op new user-time/op delta Template 337ms ± 3% 346ms ± 5% ~ (p=0.053 n=9+10) Unicode 205ms ±10% 205ms ± 8% ~ (p=1.000 n=10+10) GoTypes 1.22s ± 2% 1.21s ± 3% ~ (p=0.436 n=10+10) Compiler 5.85s ± 1% 5.93s ± 0% +1.46% (p=0.000 n=10+8) SSA 14.9s ± 1% 15.3s ± 1% +2.62% (p=0.000 n=10+10) Flate 229ms ± 4% 228ms ± 6% ~ (p=0.796 n=10+10) GoParser 271ms ± 3% 275ms ± 4% ~ (p=0.165 n=10+10) Reflect 779ms ± 5% 775ms ± 2% ~ (p=0.971 n=10+10) Tar 317ms ± 4% 319ms ± 5% ~ (p=0.853 n=10+10) XML 404ms ± 4% 409ms ± 5% ~ (p=0.436 n=10+10) name old alloc/op new alloc/op delta Template 34.9MB ± 0% 35.0MB ± 0% +0.26% (p=0.000 n=10+10) Unicode 29.3MB ± 0% 29.3MB ± 0% +0.02% (p=0.000 n=10+10) GoTypes 115MB ± 0% 115MB ± 0% +0.30% (p=0.000 n=10+10) Compiler 519MB ± 0% 521MB ± 0% +0.30% (p=0.000 n=10+10) SSA 1.55GB ± 0% 1.57GB ± 0% +1.34% (p=0.000 n=10+9) Flate 24.1MB ± 0% 24.2MB ± 0% +0.10% (p=0.000 n=10+10) GoParser 28.1MB ± 0% 28.1MB ± 0% +0.07% (p=0.000 n=10+10) Reflect 78.7MB ± 0% 78.7MB ± 0% +0.03% (p=0.000 n=8+10) Tar 34.4MB ± 0% 34.5MB ± 0% +0.12% (p=0.000 n=10+10) XML 43.2MB ± 0% 43.2MB ± 0% +0.13% (p=0.000 n=10+10) name old allocs/op new allocs/op delta Template 330k ± 0% 330k ± 0% -0.01% (p=0.017 n=10+10) Unicode 337k ± 0% 337k ± 0% +0.01% (p=0.000 n=9+10) GoTypes 1.15M ± 0% 1.15M ± 0% +0.03% (p=0.000 n=10+10) Compiler 4.77M ± 0% 4.77M ± 0% +0.03% (p=0.000 n=9+10) SSA 12.5M ± 0% 12.6M ± 0% +1.16% (p=0.000 n=10+10) Flate 221k ± 0% 221k ± 0% +0.05% (p=0.000 n=9+10) GoParser 275k ± 0% 275k ± 0% +0.01% (p=0.014 n=10+9) Reflect 944k ± 0% 944k ± 0% -0.02% (p=0.000 n=10+10) Tar 324k ± 0% 323k ± 0% -0.12% (p=0.000 n=10+10) XML 384k ± 0% 384k ± 0% -0.01% (p=0.001 n=10+10) name old object-bytes new object-bytes delta Template 476kB ± 0% 476kB ± 0% -0.04% (p=0.000 n=10+10) Unicode 218kB ± 0% 218kB ± 0% ~ (all equal) GoTypes 1.58MB ± 0% 1.58MB ± 0% -0.04% (p=0.000 n=10+10) Compiler 6.25MB ± 0% 6.24MB ± 0% -0.09% (p=0.000 n=10+10) SSA 15.9MB ± 0% 16.1MB ± 0% +1.22% (p=0.000 n=10+10) Flate 304kB ± 0% 304kB ± 0% -0.13% (p=0.000 n=10+10) GoParser 370kB ± 0% 370kB ± 0% -0.00% (p=0.000 n=10+10) Reflect 1.27MB ± 0% 1.27MB ± 0% -0.12% (p=0.000 n=10+10) Tar 421kB ± 0% 419kB ± 0% -0.64% (p=0.000 n=10+10) XML 518kB ± 0% 517kB ± 0% -0.12% (p=0.000 n=10+10) name old export-bytes new export-bytes delta Template 16.7kB ± 0% 16.7kB ± 0% ~ (all equal) Unicode 6.52kB ± 0% 6.52kB ± 0% ~ (all equal) GoTypes 29.2kB ± 0% 29.2kB ± 0% ~ (all equal) Compiler 88.0kB ± 0% 88.0kB ± 0% ~ (all equal) SSA 109kB ± 0% 109kB ± 0% ~ (all equal) Flate 4.49kB ± 0% 4.49kB ± 0% ~ (all equal) GoParser 8.10kB ± 0% 8.10kB ± 0% ~ (all equal) Reflect 7.71kB ± 0% 7.71kB ± 0% ~ (all equal) Tar 9.15kB ± 0% 9.15kB ± 0% ~ (all equal) XML 12.3kB ± 0% 12.3kB ± 0% ~ (all equal) name old text-bytes new text-bytes delta HelloSize 676kB ± 0% 672kB ± 0% -0.59% (p=0.000 n=10+10) CmdGoSize 7.26MB ± 0% 7.24MB ± 0% -0.18% (p=0.000 n=10+10) name old data-bytes new data-bytes delta HelloSize 10.2kB ± 0% 10.2kB ± 0% ~ (all equal) CmdGoSize 248kB ± 0% 248kB ± 0% ~ (all equal) name old bss-bytes new bss-bytes delta HelloSize 125kB ± 0% 125kB ± 0% ~ (all equal) CmdGoSize 145kB ± 0% 145kB ± 0% ~ (all equal) name old exe-bytes new exe-bytes delta HelloSize 1.46MB ± 0% 1.45MB ± 0% -0.31% (p=0.000 n=10+10) CmdGoSize 14.7MB ± 0% 14.7MB ± 0% -0.17% (p=0.000 n=10+10) Change-Id: Ic72b0c189dd542f391e1c9ab88a76e9148dc4285 Reviewed-on: https://go-review.googlesource.com/106495 Run-TryBot: Michael Munday <mike.munday@ibm.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>
…oat64 this is for performance reasons, see: - golang/go#15925
…oat64 this is for performance reasons, see: - golang/go#15925
Small arrays are not SSA-able. By moving too regular variables these values are kept in registers rather than on the stack.. golang/go#15925 name old time/op new time/op delta Hash/100-8 20.2ns ± 3% 17.6ns ± 4% -12.89% (p=0.000 n=29+30) Hash/1024-8 109ns ± 2% 84ns ± 2% -23.51% (p=0.000 n=30+26) Hash/8192-8 824ns ± 3% 594ns ± 1% -27.98% (p=0.000 n=29+26) name old speed new speed delta Hash/100-8 4.95GB/s ± 2% 5.69GB/s ± 4% +14.86% (p=0.000 n=28+29) Hash/1024-8 9.35GB/s ± 3% 12.24GB/s ± 2% +30.88% (p=0.000 n=30+26) Hash/8192-8 9.94GB/s ± 3% 13.80GB/s ± 1% +38.83% (p=0.000 n=29+26)
go version devel +bbd1dcd Wed Jun 1 21:32:46 2016 +0000 darwin/amd64
https://github.com/hydroflame/go1.7bench
This very simple program take 2 4 element vector and performs component wise addition.
If you run the benchmark with SSA and without you will see that it is much faster without SSA.
I expect that the binary generated with SSA would be at least the same speed as the old one.
This particular example is extracted from one of my bigger library:
github.com/luxengine/glm
for more example of SSA bad performance run benchmarks in
The text was updated successfully, but these errors were encountered: