Skip to content

Commit

Permalink
runtime: implement GC stack barriers
Browse files Browse the repository at this point in the history
This commit implements stack barriers to minimize the amount of
stack re-scanning that must be done during mark termination.

Currently the GC scans stacks of active goroutines twice during every
GC cycle: once at the beginning during root discovery and once at the
end during mark termination. The second scan happens while the world
is stopped and guarantees that we've seen all of the roots (since
there are no write barriers on writes to local stack
variables). However, this means pause time is proportional to stack
size. In particularly recursive programs, this can drive pause time up
past our 10ms goal (e.g., it takes about 150ms to scan a 50MB heap).

Re-scanning the entire stack is rarely necessary, especially for large
stacks, because usually most of the frames on the stack were not
active between the first and second scans and hence any changes to
these frames (via non-escaping pointers passed down the stack) were
tracked by write barriers.

To efficiently track how far a stack has been unwound since the first
scan (and, hence, how much needs to be re-scanned), this commit
introduces stack barriers. During the first scan, at exponentially
spaced points in each stack, the scan overwrites return PCs with the
PC of the stack barrier function. When "returned" to, the stack
barrier function records how far the stack has unwound and jumps to
the original return PC for that point in the stack. Then the second
scan only needs to proceed as far as the lowest barrier that hasn't
been hit.

For deeply recursive programs, this substantially reduces mark
termination time (and hence pause time). For the goscheme example
linked in issue #10898, prior to this change, mark termination times
were typically between 100 and 500ms; with this change, mark
termination times are typically between 10 and 20ms. As a result of
the reduced stack scanning work, this reduces overall execution time
of the goscheme example by 20%.

Fixes #10898.

The effect of this on programs that are not deeply recursive is
minimal:

name                   old time/op    new time/op    delta
BinaryTree17              3.16s ± 2%     3.26s ± 1%  +3.31%  (p=0.000 n=19+19)
Fannkuch11                2.42s ± 1%     2.48s ± 1%  +2.24%  (p=0.000 n=17+19)
FmtFprintfEmpty          50.0ns ± 3%    49.8ns ± 1%    ~     (p=0.534 n=20+19)
FmtFprintfString          173ns ± 0%     175ns ± 0%  +1.49%  (p=0.000 n=16+19)
FmtFprintfInt             170ns ± 1%     175ns ± 1%  +2.97%  (p=0.000 n=20+19)
FmtFprintfIntInt          288ns ± 0%     295ns ± 0%  +2.73%  (p=0.000 n=16+19)
FmtFprintfPrefixedInt     242ns ± 1%     252ns ± 1%  +4.13%  (p=0.000 n=18+18)
FmtFprintfFloat           324ns ± 0%     323ns ± 0%  -0.36%  (p=0.000 n=20+19)
FmtManyArgs              1.14µs ± 0%    1.12µs ± 1%  -1.01%  (p=0.000 n=18+19)
GobDecode                8.88ms ± 1%    8.87ms ± 0%    ~     (p=0.480 n=19+18)
GobEncode                6.80ms ± 1%    6.85ms ± 0%  +0.82%  (p=0.000 n=20+18)
Gzip                      363ms ± 1%     363ms ± 1%    ~     (p=0.077 n=18+20)
Gunzip                   90.6ms ± 0%    90.0ms ± 1%  -0.71%  (p=0.000 n=17+18)
HTTPClientServer         51.5µs ± 1%    50.8µs ± 1%  -1.32%  (p=0.000 n=18+18)
JSONEncode               17.0ms ± 0%    17.1ms ± 0%  +0.40%  (p=0.000 n=18+17)
JSONDecode               61.8ms ± 0%    63.8ms ± 1%  +3.11%  (p=0.000 n=18+17)
Mandelbrot200            3.84ms ± 0%    3.84ms ± 1%    ~     (p=0.583 n=19+19)
GoParse                  3.71ms ± 1%    3.72ms ± 1%    ~     (p=0.159 n=18+19)
RegexpMatchEasy0_32       100ns ± 0%     100ns ± 1%  -0.19%  (p=0.033 n=17+19)
RegexpMatchEasy0_1K       342ns ± 1%     331ns ± 0%  -3.41%  (p=0.000 n=19+19)
RegexpMatchEasy1_32      82.5ns ± 0%    81.7ns ± 0%  -0.98%  (p=0.000 n=18+18)
RegexpMatchEasy1_1K       505ns ± 0%     494ns ± 1%  -2.16%  (p=0.000 n=18+18)
RegexpMatchMedium_32      137ns ± 1%     137ns ± 1%  -0.24%  (p=0.048 n=20+18)
RegexpMatchMedium_1K     41.6µs ± 0%    41.3µs ± 1%  -0.57%  (p=0.004 n=18+20)
RegexpMatchHard_32       2.11µs ± 0%    2.11µs ± 1%  +0.20%  (p=0.037 n=17+19)
RegexpMatchHard_1K       63.9µs ± 2%    63.3µs ± 0%  -0.99%  (p=0.000 n=20+17)
Revcomp                   560ms ± 1%     522ms ± 0%  -6.87%  (p=0.000 n=18+16)
Template                 75.0ms ± 0%    75.1ms ± 1%  +0.18%  (p=0.013 n=18+19)
TimeParse                 358ns ± 1%     364ns ± 0%  +1.74%  (p=0.000 n=20+15)
TimeFormat                360ns ± 0%     372ns ± 0%  +3.55%  (p=0.000 n=20+18)

Change-Id: If8a9bfae6c128d15a4f405e02bcfa50129df82a2
Reviewed-on: https://go-review.googlesource.com/10314
Reviewed-by: Russ Cox <rsc@golang.org>
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
  • Loading branch information
aclements committed Jun 2, 2015
1 parent 724f829 commit faa7a7e
Show file tree
Hide file tree
Showing 13 changed files with 481 additions and 23 deletions.
34 changes: 32 additions & 2 deletions src/runtime/asm_386.s
Original file line number Diff line number Diff line change
Expand Up @@ -341,6 +341,22 @@ TEXT runtime·morestack_noctxt(SB),NOSPLIT,$0-0
MOVL $0, DX
JMP runtime·morestack(SB)

TEXT runtime·stackBarrier(SB),NOSPLIT,$0
// We came here via a RET to an overwritten return PC.
// AX may be live. Other registers are available.

// Get the original return PC, g.stkbar[g.stkbarPos].savedLRVal.
get_tls(CX)
MOVL g(CX), CX
MOVL (g_stkbar+slice_array)(CX), DX
MOVL g_stkbarPos(CX), BX
IMULL $stkbar__size, BX // Too big for SIB.
MOVL stkbar_savedLRVal(DX)(BX*1), BX
// Record that this stack barrier was hit.
ADDL $1, g_stkbarPos(CX)
// Jump to the original return PC.
JMP BX

// reflectcall: call a function with the given argument list
// func call(argtype *_type, f *FuncVal, arg *byte, argsize, retoffset uint32).
// we don't have variable-sized frames, so we use a small number
Expand Down Expand Up @@ -860,17 +876,31 @@ TEXT runtime·stackcheck(SB), NOSPLIT, $0-0
INT $3
RET

TEXT runtime·getcallerpc(SB),NOSPLIT,$0-8
TEXT runtime·getcallerpc(SB),NOSPLIT,$4-8
MOVL argp+0(FP),AX // addr of first arg
MOVL -4(AX),AX // get calling pc
CMPL AX, runtime·stackBarrierPC(SB)
JNE nobar
// Get original return PC.
CALL runtime·nextBarrierPC(SB)
MOVL 0(SP), AX
nobar:
MOVL AX, ret+4(FP)
RET

TEXT runtime·setcallerpc(SB),NOSPLIT,$0-8
TEXT runtime·setcallerpc(SB),NOSPLIT,$4-8
MOVL argp+0(FP),AX // addr of first arg
MOVL pc+4(FP), BX
MOVL -4(AX), CX
CMPL CX, runtime·stackBarrierPC(SB)
JEQ setbar
MOVL BX, -4(AX) // set calling pc
RET
setbar:
// Set the stack barrier return PC.
MOVL BX, 0(SP)
CALL runtime·setNextBarrierPC(SB)
RET

TEXT runtime·getcallersp(SB), NOSPLIT, $0-8
MOVL argp+0(FP), AX
Expand Down
34 changes: 32 additions & 2 deletions src/runtime/asm_amd64.s
Original file line number Diff line number Diff line change
Expand Up @@ -336,6 +336,22 @@ TEXT runtime·morestack_noctxt(SB),NOSPLIT,$0
MOVL $0, DX
JMP runtime·morestack(SB)

TEXT runtime·stackBarrier(SB),NOSPLIT,$0
// We came here via a RET to an overwritten return PC.
// AX may be live. Other registers are available.

// Get the original return PC, g.stkbar[g.stkbarPos].savedLRVal.
get_tls(CX)
MOVQ g(CX), CX
MOVQ (g_stkbar+slice_array)(CX), DX
MOVQ g_stkbarPos(CX), BX
IMULQ $stkbar__size, BX // Too big for SIB.
MOVQ stkbar_savedLRVal(DX)(BX*1), BX
// Record that this stack barrier was hit.
ADDQ $1, g_stkbarPos(CX)
// Jump to the original return PC.
JMP BX

// reflectcall: call a function with the given argument list
// func call(argtype *_type, f *FuncVal, arg *byte, argsize, retoffset uint32).
// we don't have variable-sized frames, so we use a small number
Expand Down Expand Up @@ -860,17 +876,31 @@ TEXT runtime·stackcheck(SB), NOSPLIT, $0-0
INT $3
RET

TEXT runtime·getcallerpc(SB),NOSPLIT,$0-16
TEXT runtime·getcallerpc(SB),NOSPLIT,$8-16
MOVQ argp+0(FP),AX // addr of first arg
MOVQ -8(AX),AX // get calling pc
CMPQ AX, runtime·stackBarrierPC(SB)
JNE nobar
// Get original return PC.
CALL runtime·nextBarrierPC(SB)
MOVQ 0(SP), AX
nobar:
MOVQ AX, ret+8(FP)
RET

TEXT runtime·setcallerpc(SB),NOSPLIT,$0-16
TEXT runtime·setcallerpc(SB),NOSPLIT,$8-16
MOVQ argp+0(FP),AX // addr of first arg
MOVQ pc+8(FP), BX
MOVQ -8(AX), CX
CMPQ CX, runtime·stackBarrierPC(SB)
JEQ setbar
MOVQ BX, -8(AX) // set calling pc
RET
setbar:
// Set the stack barrier return PC.
MOVQ BX, 0(SP)
CALL runtime·setNextBarrierPC(SB)
RET

TEXT runtime·getcallersp(SB),NOSPLIT,$0-16
MOVQ argp+0(FP), AX
Expand Down
35 changes: 33 additions & 2 deletions src/runtime/asm_amd64p32.s
Original file line number Diff line number Diff line change
Expand Up @@ -289,6 +289,23 @@ TEXT runtime·morestack_noctxt(SB),NOSPLIT,$0
MOVL $0, DX
JMP runtime·morestack(SB)

TEXT runtime·stackBarrier(SB),NOSPLIT,$0
// We came here via a RET to an overwritten return PC.
// AX may be live. Other registers are available.

// Get the original return PC, g.stkbar[g.stkbarPos].savedLRVal.
get_tls(CX)
MOVL g(CX), CX
MOVL (g_stkbar+slice_array)(CX), DX
MOVL g_stkbarPos(CX), BX
IMULL $stkbar__size, BX // Too big for SIB.
ADDL DX, BX
MOVL stkbar_savedLRVal(BX), BX
// Record that this stack barrier was hit.
ADDL $1, g_stkbarPos(CX)
// Jump to the original return PC.
JMP BX

// reflectcall: call a function with the given argument list
// func call(argtype *_type, f *FuncVal, arg *byte, argsize, retoffset uint32).
// we don't have variable-sized frames, so we use a small number
Expand Down Expand Up @@ -616,17 +633,31 @@ TEXT runtime·memclr(SB),NOSPLIT,$0-8
STOSB
RET

TEXT runtime·getcallerpc(SB),NOSPLIT,$0-12
TEXT runtime·getcallerpc(SB),NOSPLIT,$8-12
MOVL argp+0(FP),AX // addr of first arg
MOVL -8(AX),AX // get calling pc
CMPL AX, runtime·stackBarrierPC(SB)
JNE nobar
// Get original return PC.
CALL runtime·nextBarrierPC(SB)
MOVL 0(SP), AX
nobar:
MOVL AX, ret+8(FP)
RET

TEXT runtime·setcallerpc(SB),NOSPLIT,$0-8
TEXT runtime·setcallerpc(SB),NOSPLIT,$8-8
MOVL argp+0(FP),AX // addr of first arg
MOVL pc+4(FP), BX // pc to set
MOVL -8(AX), CX
CMPL CX, runtime·stackBarrierPC(SB)
JEQ setbar
MOVQ BX, -8(AX) // set calling pc
RET
setbar:
// Set the stack barrier return PC.
MOVL BX, 0(SP)
CALL runtime·setNextBarrierPC(SB)
RET

TEXT runtime·getcallersp(SB),NOSPLIT,$0-12
MOVL argp+0(FP), AX
Expand Down
41 changes: 37 additions & 4 deletions src/runtime/asm_arm.s
Original file line number Diff line number Diff line change
Expand Up @@ -309,6 +309,23 @@ TEXT runtime·morestack_noctxt(SB),NOSPLIT,$-4-0
MOVW $0, R7
B runtime·morestack(SB)

TEXT runtime·stackBarrier(SB),NOSPLIT,$0
// We came here via a RET to an overwritten LR.
// R0 may be live. Other registers are available.

// Get the original return PC, g.stkbar[g.stkbarPos].savedLRVal.
MOVW (g_stkbar+slice_array)(g), R4
MOVW g_stkbarPos(g), R5
MOVW $stkbar__size, R6
MUL R5, R6
ADD R4, R6
MOVW stkbar_savedLRVal(R6), R6
// Record that this stack barrier was hit.
ADD $1, R5
MOVW R5, g_stkbarPos(g)
// Jump to the original return PC.
B (R6)

// reflectcall: call a function with the given argument list
// func call(argtype *_type, f *FuncVal, arg *byte, argsize, retoffset uint32).
// we don't have variable-sized frames, so we use a small number
Expand Down Expand Up @@ -645,14 +662,30 @@ TEXT setg<>(SB),NOSPLIT,$-4-0
MOVW g, R0
RET

TEXT runtime·getcallerpc(SB),NOSPLIT,$-4-8
MOVW 0(R13), R0
TEXT runtime·getcallerpc(SB),NOSPLIT,$4-8
MOVW 8(R13), R0 // LR saved by caller
MOVW runtime·stackBarrierPC(SB), R1
CMP R0, R1
BNE nobar
// Get original return PC.
BL runtime·nextBarrierPC(SB)
MOVW 4(R13), R0
nobar:
MOVW R0, ret+4(FP)
RET

TEXT runtime·setcallerpc(SB),NOSPLIT,$-4-8
TEXT runtime·setcallerpc(SB),NOSPLIT,$4-8
MOVW pc+4(FP), R0
MOVW R0, 0(R13)
MOVW 8(R13), R1
MOVW runtime·stackBarrierPC(SB), R2
CMP R1, R2
BEQ setbar
MOVW R0, 8(R13) // set LR in caller
RET
setbar:
// Set the stack barrier return PC.
MOVW R0, 4(R13)
BL runtime·setNextBarrierPC(SB)
RET

TEXT runtime·getcallersp(SB),NOSPLIT,$-4-8
Expand Down
41 changes: 37 additions & 4 deletions src/runtime/asm_arm64.s
Original file line number Diff line number Diff line change
Expand Up @@ -307,6 +307,23 @@ TEXT runtime·morestack_noctxt(SB),NOSPLIT,$-4-0
MOVW $0, R26
B runtime·morestack(SB)

TEXT runtime·stackBarrier(SB),NOSPLIT,$0
// We came here via a RET to an overwritten LR.
// R0 may be live (see return0). Other registers are available.

// Get the original return PC, g.stkbar[g.stkbarPos].savedLRVal.
MOVD (g_stkbar+slice_array)(g), R4
MOVD g_stkbarPos(g), R5
MOVD $stkbar__size, R6
MUL R5, R6
ADD R4, R6
MOVD stkbar_savedLRVal(R6), R6
// Record that this stack barrier was hit.
ADD $1, R5
MOVD R5, g_stkbarPos(g)
// Jump to the original return PC.
B (R6)

// reflectcall: call a function with the given argument list
// func call(argtype *_type, f *FuncVal, arg *byte, argsize, retoffset uint32).
// we don't have variable-sized frames, so we use a small number
Expand Down Expand Up @@ -743,14 +760,30 @@ TEXT setg_gcc<>(SB),NOSPLIT,$8
MOVD savedR27-8(SP), R27
RET

TEXT runtime·getcallerpc(SB),NOSPLIT,$-8-16
MOVD 0(RSP), R0
TEXT runtime·getcallerpc(SB),NOSPLIT,$8-16
MOVD 16(RSP), R0 // LR saved by caller
MOVD runtime·stackBarrierPC(SB), R1
CMP R0, R1
BNE nobar
// Get original return PC.
BL runtime·nextBarrierPC(SB)
MOVD 8(RSP), R0
nobar:
MOVD R0, ret+8(FP)
RET

TEXT runtime·setcallerpc(SB),NOSPLIT,$-8-16
TEXT runtime·setcallerpc(SB),NOSPLIT,$8-16
MOVD pc+8(FP), R0
MOVD R0, 0(RSP) // set calling pc
MOVD 16(RSP), R1
MOVD runtime·stackBarrierPC(SB), R2
CMP R1, R2
BEQ setbar
MOVD R0, 16(RSP) // set LR in caller
RET
setbar:
// Set the stack barrier return PC.
MOVD R0, 8(RSP)
BL runtime·setNextBarrierPC(SB)
RET

TEXT runtime·getcallersp(SB),NOSPLIT,$0-16
Expand Down
42 changes: 38 additions & 4 deletions src/runtime/asm_ppc64x.s
Original file line number Diff line number Diff line change
Expand Up @@ -304,6 +304,24 @@ TEXT runtime·morestack_noctxt(SB),NOSPLIT,$-8-0
MOVD R0, R11
BR runtime·morestack(SB)

TEXT runtime·stackBarrier(SB),NOSPLIT,$0
// We came here via a RET to an overwritten LR.
// R3 may be live. Other registers are available.

// Get the original return PC, g.stkbar[g.stkbarPos].savedLRVal.
MOVD (g_stkbar+slice_array)(g), R4
MOVD g_stkbarPos(g), R5
MOVD $stkbar__size, R6
MULLD R5, R6
ADD R4, R6
MOVD stkbar_savedLRVal(R6), R6
// Record that this stack barrier was hit.
ADD $1, R5
MOVD R5, g_stkbarPos(g)
// Jump to the original return PC.
MOVD R6, CTR
BR (CTR)

// reflectcall: call a function with the given argument list
// func call(argtype *_type, f *FuncVal, arg *byte, argsize, retoffset uint32).
// we don't have variable-sized frames, so we use a small number
Expand Down Expand Up @@ -883,15 +901,31 @@ TEXT setg_gcc<>(SB),NOSPLIT,$-8-0
MOVD R4, LR
RET

TEXT runtime·getcallerpc(SB),NOSPLIT,$-8-16
MOVD 0(R1), R3
TEXT runtime·getcallerpc(SB),NOSPLIT,$8-16
MOVD 16(R1), R3 // LR saved by caller
MOVD runtime·stackBarrierPC(SB), R4
CMP R3, R4
BNE nobar
// Get original return PC.
BL runtime·nextBarrierPC(SB)
MOVD 8(R1), R3
nobar:
MOVD R3, ret+8(FP)
RETURN

TEXT runtime·setcallerpc(SB),NOSPLIT,$-8-16
TEXT runtime·setcallerpc(SB),NOSPLIT,$8-16
MOVD pc+8(FP), R3
MOVD R3, 0(R1) // set calling pc
MOVD 16(R1), R4
MOVD runtime·stackBarrierPC(SB), R5
CMP R4, R5
BEQ setbar
MOVD R3, 16(R1) // set LR in caller
RETURN
setbar:
// Set the stack barrier return PC.
MOVD R3, 8(R1)
BL runtime·setNextBarrierPC(SB)
RET

TEXT runtime·getcallersp(SB),NOSPLIT,$0-16
MOVD argp+0(FP), R3
Expand Down
18 changes: 17 additions & 1 deletion src/runtime/mbarrier.go
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,9 @@ import "unsafe"
// slot is the destination (dst) in go code
// ptr is the value that goes into the slot (src) in the go code
//
//
// Dealing with memory ordering:
//
// Dijkstra pointed out that maintaining the no black to white
// pointers means that white to white pointers not need
// to be noted by the write barrier. Furthermore if either
Expand Down Expand Up @@ -54,7 +57,20 @@ import "unsafe"
// Peterson/Dekker algorithms for mutual exclusion). Rather than require memory
// barriers, which will slow down both the mutator and the GC, we always grey
// the ptr object regardless of the slot's color.
//go:nowritebarrier
//
//
// Stack writes:
//
// The compiler omits write barriers for writes to the current frame,
// but if a stack pointer has been passed down the call stack, the
// compiler will generate a write barrier for writes through that
// pointer (because it doesn't know it's not a heap pointer).
//
// One might be tempted to ignore the write barrier if slot points
// into to the stack. Don't do it! Mark termination only re-scans
// frames that have potentially been active since the concurrent scan,
// so it depends on write barriers to track changes to pointers in
// stack frames that have not been active. go:nowritebarrier
func gcmarkwb_m(slot *uintptr, ptr uintptr) {
switch gcphase {
default:
Expand Down
Loading

0 comments on commit faa7a7e

Please sign in to comment.