Skip to content

Commit

Permalink
runtime/cgo: store M for C-created thread in pthread key
Browse files Browse the repository at this point in the history
This reapplies CL 485500, with a fix drafted in CL 492987 incorporated.

CL 485500 is reverted due to #60004 and #60007. #60004 is fixed in
CL 492743. #60007 is fixed in CL 492987 (incorporated in this CL).

[Original CL 485500 description]

This reapplies CL 481061, with the followup fixes in CL 482975, CL 485315, and
CL 485316 incorporated.

CL 481061, by doujiang24 <doujiang24@gmail.com>, speed up C to Go
calls by binding the M to the C thread. See below for its
description.
CL 482975 is a followup fix to a C declaration in testprogcgo.
CL 485315 is a followup fix for x_cgo_getstackbound on Illumos.
CL 485316 is a followup cleanup for ppc64 assembly.

CL 479915 passed the G to _cgo_getstackbound for direct updates to
gp.stack.lo. A G can be reused on a new thread after the previous thread
exited. This could trigger the C TSAN race detector because it couldn't
see the synchronization in Go (lockextra) preventing the same G from
being used on multiple threads at the same time.

We work around this by passing the address of a stack variable to
_cgo_getstackbound rather than the G. The stack is generally unique per
thread, so TSAN won't see the same address from multiple threads. Even
if stacks are reused across threads by pthread, C TSAN should see the
synchonization in the stack allocator.

A regression test is added to misc/cgo/testsanitizer.

[Original CL 481061 description]

This reapplies CL 392854, with the followup fixes in CL 479255,
CL 479915, and CL 481057 incorporated.

CL 392854, by doujiang24 <doujiang24@gmail.com>, speed up C to Go
calls by binding the M to the C thread. See below for its
description.
CL 479255 is a followup fix for a small bug in ARM assembly code.
CL 479915 is another followup fix to address C to Go calls after
the C code uses some stack, but that CL is also buggy.
CL 481057, by Michael Knyszek, is a followup fix for a memory leak
bug of CL 479915.

[Original CL 392854 description]

In a C thread, it's necessary to acquire an extra M by using needm while invoking a Go function from C. But, needm and dropm are heavy costs due to the signal-related syscalls.
So, we change to not dropm while returning back to C, which means binding the extra M to the C thread until it exits, to avoid needm and dropm on each C to Go call.
Instead, we only dropm while the C thread exits, so the extra M won't leak.

When invoking a Go function from C:
Allocate a pthread variable using pthread_key_create, only once per shared object, and register a thread-exit-time destructor.
And store the g0 of the current m into the thread-specified value of the pthread key,  only once per C thread, so that the destructor will put the extra M back onto the extra M list while the C thread exits.

When returning back to C:
Skip dropm in cgocallback, when the pthread variable has been created, so that the extra M will be reused the next time invoke a Go function from C.

This is purely a performance optimization. The old version, in which needm & dropm happen on each cgo call, is still correct too, and we have to keep the old version on systems with cgo but without pthreads, like Windows.

This optimization is significant, and the specific value depends on the OS system and CPU, but in general, it can be considered as 10x faster, for a simple Go function call from a C thread.

For the newly added BenchmarkCGoInCThread, some benchmark results:
1. it's 28x faster, from 3395 ns/op to 121 ns/op, in darwin OS & Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
2. it's 6.5x faster, from 1495 ns/op to 230 ns/op, in Linux OS & Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz

[CL 479915 description]

Currently, when C calls into Go the first time, we grab an M
using needm, which sets m.g0's stack bounds using the SP. We don't
know how big the stack is, so we simply assume 32K. Previously,
when the Go function returns to C, we drop the M, and the next
time C calls into Go, we put a new stack bound on the g0 based on
the current SP. After CL 392854, we don't drop the M, and the next
time C calls into Go, we reuse the same g0, without recomputing
the stack bounds. If the C code uses quite a bit of stack space
before calling into Go, the SP may be well below the 32K stack
bound we assumed, so the runtime thinks the g0 stack overflows.

This CL makes needm get a more accurate stack bound from
pthread. (In some platforms this may still be a guess as we don't
know exactly where we are in the C stack), but it is probably
better than simply assuming 32K.

[CL 492987 description]

On the first call into Go from a C thread, currently we set the g0
stack's high bound imprecisely based on the SP. With CL 485500, we
keep the M and don't recompute the stack bounds when it calls into
Go again. If the first call is made when the C thread uses some
deep stack, but a subsequent call is made with a shallower stack,
the SP may be above g0.stack.hi.

This is usually okay as we don't check usually stack.hi. One place
where we do check for stack.hi is in the signal handler, in
adjustSignalStack. In particular, C TSAN delivers signals on the
g0 stack (instead of the usual signal stack). If the SP is above
g0.stack.hi, we don't see it is on the g0 stack, and throws.

This CL makes it get an accurate stack upper bound with the
pthread API (on the platforms where it is available).

Also add some debug print for the "handler not on signal stack"
throw.

Fixes #51676.
Fixes #59294.
Fixes #59678.
Fixes #60007.

Change-Id: Ie51c8e81ade34ec81d69fd7bce1fe0039a470776
Reviewed-on: https://go-review.googlesource.com/c/go/+/495855
Run-TryBot: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Michael Pratt <mpratt@google.com>
  • Loading branch information
cherrymui committed May 17, 2023
1 parent 2693ade commit c426c87
Show file tree
Hide file tree
Showing 46 changed files with 1,018 additions and 69 deletions.
7 changes: 4 additions & 3 deletions src/cmd/cgo/internal/test/cgo_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,7 @@ func TestThreadLock(t *testing.T) { testThreadLockFunc(t) }
func TestUnsignedInt(t *testing.T) { testUnsignedInt(t) }
func TestZeroArgCallback(t *testing.T) { testZeroArgCallback(t) }

func BenchmarkCgoCall(b *testing.B) { benchCgoCall(b) }
func BenchmarkGoString(b *testing.B) { benchGoString(b) }
func BenchmarkCGoCallback(b *testing.B) { benchCallback(b) }
func BenchmarkCgoCall(b *testing.B) { benchCgoCall(b) }
func BenchmarkGoString(b *testing.B) { benchGoString(b) }
func BenchmarkCGoCallback(b *testing.B) { benchCallback(b) }
func BenchmarkCGoInCThread(b *testing.B) { benchCGoInCthread(b) }
24 changes: 24 additions & 0 deletions src/cmd/cgo/internal/test/cthread_unix.c
Original file line number Diff line number Diff line change
Expand Up @@ -32,3 +32,27 @@ doAdd(int max, int nthread)
for(i=0; i<nthread; i++)
pthread_join(thread_id[i], 0);
}

static void*
goDummyCallbackThread(void* p)
{
int i, max;

max = *(int*)p;
for(i=0; i<max; i++)
goDummy();
return NULL;
}

int
callGoInCThread(int max)
{
pthread_t thread;

if (pthread_create(&thread, NULL, goDummyCallbackThread, (void*)(&max)) != 0)
return -1;
if (pthread_join(thread, NULL) != 0)
return -1;

return max;
}
22 changes: 22 additions & 0 deletions src/cmd/cgo/internal/test/cthread_windows.c
Original file line number Diff line number Diff line change
Expand Up @@ -35,3 +35,25 @@ doAdd(int max, int nthread)
CloseHandle((HANDLE)thread_id[i]);
}
}

__stdcall
static unsigned int
goDummyCallbackThread(void* p)
{
int i, max;

max = *(int*)p;
for(i=0; i<max; i++)
goDummy();
return 0;
}

int
callGoInCThread(int max)
{
uintptr_t thread_id;
thread_id = _beginthreadex(0, 0, goDummyCallbackThread, &max, 0, 0);
WaitForSingleObject((HANDLE)thread_id, INFINITE);
CloseHandle((HANDLE)thread_id);
return max;
}
14 changes: 14 additions & 0 deletions src/cmd/cgo/internal/test/testx.go
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ import (
/*
// threads
extern void doAdd(int, int);
extern int callGoInCThread(int);
// issue 1328
void IntoC(void);
Expand Down Expand Up @@ -146,6 +147,10 @@ func Add(x int) {
*p = 2
}

//export goDummy
func goDummy() {
}

func testCthread(t *testing.T) {
if (runtime.GOOS == "darwin" || runtime.GOOS == "ios") && runtime.GOARCH == "arm64" {
t.Skip("the iOS exec wrapper is unable to properly handle the panic from Add")
Expand All @@ -159,6 +164,15 @@ func testCthread(t *testing.T) {
}
}

// Benchmark measuring overhead from C to Go in a C thread.
// Create a new C thread and invoke Go function repeatedly in the new C thread.
func benchCGoInCthread(b *testing.B) {
n := C.callGoInCThread(C.int(b.N))
if int(n) != b.N {
b.Fatal("unmatch loop times")
}
}

// issue 1328

//export BackIntoGo
Expand Down
54 changes: 54 additions & 0 deletions src/cmd/cgo/internal/testcarchive/carchive_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -1265,3 +1265,57 @@ func TestPreemption(t *testing.T) {
t.Error(err)
}
}

// Issue 59294. Test calling Go function from C after using some
// stack space.
func TestDeepStack(t *testing.T) {
t.Parallel()

if !testWork {
defer func() {
os.Remove("testp9" + exeSuffix)
os.Remove("libgo9.a")
os.Remove("libgo9.h")
}()
}

cmd := exec.Command("go", "build", "-buildmode=c-archive", "-o", "libgo9.a", "./libgo9")
out, err := cmd.CombinedOutput()
t.Logf("%v\n%s", cmd.Args, out)
if err != nil {
t.Fatal(err)
}
checkLineComments(t, "libgo9.h")
checkArchive(t, "libgo9.a")

// build with -O0 so the C compiler won't optimize out the large stack frame
ccArgs := append(cc, "-O0", "-o", "testp9"+exeSuffix, "main9.c", "libgo9.a")
out, err = exec.Command(ccArgs[0], ccArgs[1:]...).CombinedOutput()
t.Logf("%v\n%s", ccArgs, out)
if err != nil {
t.Fatal(err)
}

argv := cmdToRun("./testp9")
cmd = exec.Command(argv[0], argv[1:]...)
sb := new(strings.Builder)
cmd.Stdout = sb
cmd.Stderr = sb
if err := cmd.Start(); err != nil {
t.Fatal(err)
}

timer := time.AfterFunc(time.Minute,
func() {
t.Error("test program timed out")
cmd.Process.Kill()
},
)
defer timer.Stop()

err = cmd.Wait()
t.Logf("%v\n%s", cmd.Args, sb)
if err != nil {
t.Error(err)
}
}
14 changes: 14 additions & 0 deletions src/cmd/cgo/internal/testcarchive/testdata/libgo9/a.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
// Copyright 2023 The Go Authors. All rights reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.

package main

import "runtime"

import "C"

func main() {}

//export GoF
func GoF() { runtime.GC() }
24 changes: 24 additions & 0 deletions src/cmd/cgo/internal/testcarchive/testdata/main9.c
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
// Copyright 2023 The Go Authors. All rights reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.

#include "libgo9.h"

void use(int *x) { (*x)++; }

void callGoFWithDeepStack() {
int x[10000];

use(&x[0]);
use(&x[9999]);

GoF();

use(&x[0]);
use(&x[9999]);
}

int main() {
GoF(); // call GoF without using much stack
callGoFWithDeepStack(); // call GoF with a deep stack
}
53 changes: 53 additions & 0 deletions src/cmd/cgo/internal/testsanitizers/testdata/tsan14.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
// Copyright 2023 The Go Authors. All rights reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.

package main

// This program failed when run under the C/C++ ThreadSanitizer.
//
// cgocallback on a new thread calls into runtime.needm -> _cgo_getstackbound
// to update gp.stack.lo with the stack bounds. If the G itself is passed to
// _cgo_getstackbound, then writes to the same G can be seen on multiple
// threads (when the G is reused after thread exit). This would trigger TSAN.

/*
#include <pthread.h>
void go_callback();
static void *thr(void *arg) {
go_callback();
return 0;
}
static void foo() {
pthread_t th;
pthread_attr_t attr;
pthread_attr_init(&attr);
pthread_attr_setstacksize(&attr, 256 << 10);
pthread_create(&th, &attr, thr, 0);
pthread_join(th, 0);
}
*/
import "C"

import (
"time"
)

//export go_callback
func go_callback() {
}

func main() {
for i := 0; i < 2; i++ {
go func() {
for {
C.foo()
}
}()
}

time.Sleep(1000*time.Millisecond)
}
1 change: 1 addition & 0 deletions src/cmd/cgo/internal/testsanitizers/tsan_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ func TestTSAN(t *testing.T) {
{src: "tsan11.go", needsRuntime: true},
{src: "tsan12.go", needsRuntime: true},
{src: "tsan13.go", needsRuntime: true},
{src: "tsan14.go", needsRuntime: true},
}
for _, tc := range cases {
tc := tc
Expand Down
41 changes: 35 additions & 6 deletions src/runtime/asm_386.s
Original file line number Diff line number Diff line change
Expand Up @@ -689,7 +689,20 @@ nosave:
TEXT ·cgocallback(SB),NOSPLIT,$12-12 // Frame size must match commented places below
NO_LOCAL_POINTERS

// If g is nil, Go did not create the current thread.
// Skip cgocallbackg, just dropm when fn is nil, and frame is the saved g.
// It is used to dropm while thread is exiting.
MOVL fn+0(FP), AX
CMPL AX, $0
JNE loadg
// Restore the g from frame.
get_tls(CX)
MOVL frame+4(FP), BX
MOVL BX, g(CX)
JMP dropm

loadg:
// If g is nil, Go did not create the current thread,
// or if this thread never called into Go on pthread platforms.
// Call needm to obtain one for temporary use.
// In this case, we're running on the thread stack, so there's
// lots of space, but the linker doesn't know. Hide the call from
Expand All @@ -707,9 +720,9 @@ TEXT ·cgocallback(SB),NOSPLIT,$12-12 // Frame size must match commented places
MOVL BP, savedm-4(SP) // saved copy of oldm
JMP havem
needm:
MOVL $runtime·needm(SB), AX
MOVL $runtime·needAndBindM(SB), AX
CALL AX
MOVL $0, savedm-4(SP) // dropm on return
MOVL $0, savedm-4(SP)
get_tls(CX)
MOVL g(CX), BP
MOVL g_m(BP), BP
Expand Down Expand Up @@ -784,13 +797,29 @@ havem:
MOVL 0(SP), AX
MOVL AX, (g_sched+gobuf_sp)(SI)

// If the m on entry was nil, we called needm above to borrow an m
// for the duration of the call. Since the call is over, return it with dropm.
// If the m on entry was nil, we called needm above to borrow an m,
// 1. for the duration of the call on non-pthread platforms,
// 2. or the duration of the C thread alive on pthread platforms.
// If the m on entry wasn't nil,
// 1. the thread might be a Go thread,
// 2. or it's wasn't the first call from a C thread on pthread platforms,
// since the we skip dropm to resue the m in the first call.
MOVL savedm-4(SP), DX
CMPL DX, $0
JNE 3(PC)
JNE droppedm

// Skip dropm to reuse it in the next call, when a pthread key has been created.
MOVL _cgo_pthread_key_created(SB), DX
// It means cgo is disabled when _cgo_pthread_key_created is a nil pointer, need dropm.
CMPL DX, $0
JEQ dropm
CMPL (DX), $0
JNE droppedm

dropm:
MOVL $runtime·dropm(SB), AX
CALL AX
droppedm:

// Done!
RET
Expand Down
38 changes: 33 additions & 5 deletions src/runtime/asm_amd64.s
Original file line number Diff line number Diff line change
Expand Up @@ -918,7 +918,20 @@ GLOBL zeroTLS<>(SB),RODATA,$const_tlsSize
TEXT ·cgocallback(SB),NOSPLIT,$24-24
NO_LOCAL_POINTERS

// If g is nil, Go did not create the current thread.
// Skip cgocallbackg, just dropm when fn is nil, and frame is the saved g.
// It is used to dropm while thread is exiting.
MOVQ fn+0(FP), AX
CMPQ AX, $0
JNE loadg
// Restore the g from frame.
get_tls(CX)
MOVQ frame+8(FP), BX
MOVQ BX, g(CX)
JMP dropm

loadg:
// If g is nil, Go did not create the current thread,
// or if this thread never called into Go on pthread platforms.
// Call needm to obtain one m for temporary use.
// In this case, we're running on the thread stack, so there's
// lots of space, but the linker doesn't know. Hide the call from
Expand Down Expand Up @@ -956,9 +969,9 @@ needm:
// a bad value in there, in case needm tries to use it.
XORPS X15, X15
XORQ R14, R14
MOVQ $runtime·needm<ABIInternal>(SB), AX
MOVQ $runtime·needAndBindM<ABIInternal>(SB), AX
CALL AX
MOVQ $0, savedm-8(SP) // dropm on return
MOVQ $0, savedm-8(SP)
get_tls(CX)
MOVQ g(CX), BX
MOVQ g_m(BX), BX
Expand Down Expand Up @@ -1047,11 +1060,26 @@ havem:
MOVQ 0(SP), AX
MOVQ AX, (g_sched+gobuf_sp)(SI)

// If the m on entry was nil, we called needm above to borrow an m
// for the duration of the call. Since the call is over, return it with dropm.
// If the m on entry was nil, we called needm above to borrow an m,
// 1. for the duration of the call on non-pthread platforms,
// 2. or the duration of the C thread alive on pthread platforms.
// If the m on entry wasn't nil,
// 1. the thread might be a Go thread,
// 2. or it's wasn't the first call from a C thread on pthread platforms,
// since the we skip dropm to resue the m in the first call.
MOVQ savedm-8(SP), BX
CMPQ BX, $0
JNE done

// Skip dropm to reuse it in the next call, when a pthread key has been created.
MOVQ _cgo_pthread_key_created(SB), AX
// It means cgo is disabled when _cgo_pthread_key_created is a nil pointer, need dropm.
CMPQ AX, $0
JEQ dropm
CMPQ (AX), $0
JNE done

dropm:
MOVQ $runtime·dropm(SB), AX
CALL AX
#ifdef GOOS_windows
Expand Down
Loading

0 comments on commit c426c87

Please sign in to comment.