Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/pprof: SIGPROF interrupt causes infinite loop #55243

Closed
lmb opened this issue Sep 20, 2022 · 5 comments
Closed

runtime/pprof: SIGPROF interrupt causes infinite loop #55243

lmb opened this issue Sep 20, 2022 · 5 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.

Comments

@lmb
Copy link
Contributor

lmb commented Sep 20, 2022

What version of Go are you using (go version)?

$ go version
go version go1.19.1 linux/arm64

Does this issue reproduce with the latest release?

Yes

What did you do?

I work on a library that ends up calling the bpf(2) syscall via unix.Syscall. bpf(2) has a bunch of subcommands, one of which allows loading BPF bytecode into the kernel.
The kernel does analysis on the bytecode, which can be "slow" going from milliseconds to seconds depending on kernel config. Since Linux 4.20 the kernel interrupts the verification process if the calling process has signals pending.

This leads to an unfortunate interaction with the runtime CPU profiler, since it uses SIGPROF to trigger sampling. If the interval between SIGPROF is smaller than the time it takes to fully analyze a BPF program we enter an infinite loop:

https://github.com/cilium/ebpf/blob/713c8dc84f60f2b599caabd6dfb2fb33b7878bc1/internal/sys/syscall.go#L19-L37

One of our users encountered this in the wild and blogged about it: https://dxuuu.xyz/bpf-go-pprof.html You can find a small reproducer for x86 / arm64 Linux here: https://github.com/lmb/sigprof-repro

There was a similar issue (I think?) in the runtime related to fork: #5517 Other profilers / runtimes also experience similar problems with other syscalls: async-profiler/async-profiler#97 Somewhat related, non-cooperative preemption of goroutines also uses signals, and therefore can trigger the same problem. We've not received bug reports for that though.

We've considered limiting the number of syscall retries on interruption, and returning an error if we retry too often. This has the very unfortunate side effect that enabling CPU profiling can make your application fail, where it otherwise wouldn't have. As a library that is really undesirable, since we have no idea who can trigger profiling. The Go runtime itself also retries an infinite number of times for some syscalls that return EINTR so there is some precedent for infinite retry.

This leaves us with trying to avoid receiving SIGPROF (and other?) signals on the thread calling bpf(2) in the first place, kind of similar to syscall.runtime_BeforeFork. We're not sure how best to go about this however, since there is no suitable API from the runtime side to block SIGPROF only for the syscall thread. We could call runtime.LockOSThread and invoke sigprocmask directly, but that has a good chance of breaking in a subtle way due to runtime / CGO / libc interactions.

My questions are therefore:

  1. Can we work around this using the proposed LockOSThread + sigprocmask without breaking the runtime?
  2. Does blocking SIGPROF screw up collected profiles significantly?
  3. Do we need to worry about signals from non-cooperative preemption as well?
  4. Could there be a blessed mechanism for "syscall that should be signalled as little as possible" in the runtime somewhere?
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Sep 20, 2022
@ianlancetaylor
Copy link
Contributor

I would certainly recommend runtime.LockOSThread and pthread_sigmask. That sounds like exactly the right approach. The thread should just block all signals for the duration of the BPF system call.

I don't see a strong need for a runtime package here. It seems like a very special purpose requirement.

@lmb
Copy link
Contributor Author

lmb commented Sep 20, 2022

That would be great news! So no concerns around profile accuracy or interactions with non-cooperative preemption?

We don't want to force usage of CGo, can we call sigprocmask directly? Or do we have use pthread_sigmask iff CGo is enabled?

@ianlancetaylor
Copy link
Contributor

I suppose the profile could be slightly inaccurate, but so it goes. Unless your program is dominated by BPF installs I don't see how it matters.

Blocking all signals will also block the preemption signal. That won't break anything, although it does mean that if a slow BPF install occurs at just the wrong time other threads may pause for the duration waiting for the garbage collector to wait for the BPF thread.

You should be able to call the rt_sigprocmask system call directly, rather than calling the C function pthread_sigmask, although your code would be Linux-specific.

@cherrymui
Copy link
Member

Sounds like there is a workaround. Is there anything still needed to be addressed in the runtime? Thanks.

@cherrymui cherrymui added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Sep 21, 2022
@lmb
Copy link
Contributor Author

lmb commented Sep 22, 2022

No, we'll use Ian's feedback to implement a fix. Thanks very much!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Projects
None yet
Development

No branches or pull requests

4 participants