-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime/pprof: SIGPROF interrupt causes infinite loop #55243
Comments
I would certainly recommend I don't see a strong need for a runtime package here. It seems like a very special purpose requirement. |
That would be great news! So no concerns around profile accuracy or interactions with non-cooperative preemption? We don't want to force usage of CGo, can we call sigprocmask directly? Or do we have use |
I suppose the profile could be slightly inaccurate, but so it goes. Unless your program is dominated by BPF installs I don't see how it matters. Blocking all signals will also block the preemption signal. That won't break anything, although it does mean that if a slow BPF install occurs at just the wrong time other threads may pause for the duration waiting for the garbage collector to wait for the BPF thread. You should be able to call the |
Sounds like there is a workaround. Is there anything still needed to be addressed in the runtime? Thanks. |
No, we'll use Ian's feedback to implement a fix. Thanks very much! |
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes
What did you do?
I work on a library that ends up calling the bpf(2) syscall via
unix.Syscall
. bpf(2) has a bunch of subcommands, one of which allows loading BPF bytecode into the kernel.The kernel does analysis on the bytecode, which can be "slow" going from milliseconds to seconds depending on kernel config. Since Linux 4.20 the kernel interrupts the verification process if the calling process has signals pending.
This leads to an unfortunate interaction with the runtime CPU profiler, since it uses SIGPROF to trigger sampling. If the interval between SIGPROF is smaller than the time it takes to fully analyze a BPF program we enter an infinite loop:
https://github.com/cilium/ebpf/blob/713c8dc84f60f2b599caabd6dfb2fb33b7878bc1/internal/sys/syscall.go#L19-L37
One of our users encountered this in the wild and blogged about it: https://dxuuu.xyz/bpf-go-pprof.html You can find a small reproducer for x86 / arm64 Linux here: https://github.com/lmb/sigprof-repro
There was a similar issue (I think?) in the runtime related to fork: #5517 Other profilers / runtimes also experience similar problems with other syscalls: async-profiler/async-profiler#97 Somewhat related, non-cooperative preemption of goroutines also uses signals, and therefore can trigger the same problem. We've not received bug reports for that though.
We've considered limiting the number of syscall retries on interruption, and returning an error if we retry too often. This has the very unfortunate side effect that enabling CPU profiling can make your application fail, where it otherwise wouldn't have. As a library that is really undesirable, since we have no idea who can trigger profiling. The Go runtime itself also retries an infinite number of times for some syscalls that return EINTR so there is some precedent for infinite retry.
This leaves us with trying to avoid receiving SIGPROF (and other?) signals on the thread calling bpf(2) in the first place, kind of similar to
syscall.runtime_BeforeFork
. We're not sure how best to go about this however, since there is no suitable API from the runtime side to block SIGPROF only for the syscall thread. We could call runtime.LockOSThread and invoke sigprocmask directly, but that has a good chance of breaking in a subtle way due to runtime / CGO / libc interactions.My questions are therefore:
The text was updated successfully, but these errors were encountered: