don't swallow panics from spawned threads #1763

jyn514 · 2024-07-08T15:02:24Z

pgrx has somewhat complex panic handling. it looks something like this:

when a thread panics, the panic hook captures a backtrace and saves it in a thread-local for later.
the thread unwinds until it hits an FFI boundary (usually run_guarded). that downcasts the panic, takes the backtrace out of the thread-local, and hooks into postgres' longjmp mechanism
i forget what happens after this, i think it resumes unwinding once it's past the FFI barrier

there is a slight problem here: we are using a thread-local to store the backtrace. if the panic does not happen on the main thread (for example, because a spawned thread tries to call into postgres and hits the check in check_active_thread), the backtrace will be lost. worse, if the main thread then unwinds in response to the panic, pgrx will use its backtrace instead of that of the worker thread.

there are two main approaches we considered to fixing this:

fix the backtrace not to use a thread-local, so we can attach panics in spawned threads to a pgrx connection the way we would for the main thread.
stop handling panics in spawned threads altogether (and use the default hook).

the downside of approach 1 is that there may not be a pgrx connection to attach to. the connection may have already closed, or the active connection may not be related to the thread that panicked, or we may be shutting down and will never check for the panic. in those cases the panic information will be missing or wrong.

the downside of approach 2 is that it does not integrate with postgres' error handling mechanism, and in particular is not reported to psql. however, it does allow for developers using pgrx to handle the panic themselves, for example by handling the result from JoinHandle::join, in which case it will be reported to psql.

this takes approach 2. we may want to reconsider this in the future, or perhaps add a helper library so that it's easy for applications to pass the panic into the main thread.

note that the default panic handler in the standard library behaves quite poorly when multiple threads panic at once (it's sound, but the output is very hard to read). this being fixed in a separate PR upstream; see rust-lang/rust#127397.

i tested locally that this correctly shows the backtrace from the right thread now:

thread 'Builder #1' panicked at /home/jyn/work/pgrx/pgrx/src/fcinfo.rs:338:18:
/home/jyn/work/pgrx/pgrx/src/fcinfo.rs:338:18:  postgres FFI may not be called from multiple threads.
stack backtrace:
   0: rust_begin_unwind
             at /rustc/8c127df75fde3d5ad8ef9af664962a7676288b52/library/std/src/panicking.rs:652:5
   1: core::panicking::panic_fmt
             at /rustc/8c127df75fde3d5ad8ef9af664962a7676288b52/library/core/src/panicking.rs:72:14
   2: pgrx_pg_sys::submodules::thread_check::thread_id_check_failed
             at /home/jyn/work/pgrx/pgrx-pg-sys/src/submodules/thread_check.rs:118:5
   3: pgrx_pg_sys::submodules::thread_check::check_active_thread
             at /home/jyn/work/pgrx/pgrx-pg-sys/src/submodules/thread_check.rs:37:17
   4: pgrx_pg_sys::submodules::ffi::pg_guard_ffi_boundary_impl
             at /home/jyn/work/pgrx/pgrx-pg-sys/src/submodules/ffi.rs:104:5
   5: pgrx_pg_sys::submodules::ffi::pg_guard_ffi_boundary
             at /home/jyn/work/pgrx/pgrx-pg-sys/src/submodules/ffi.rs:94:14
   6: pgrx_pg_sys::include::pg16::palloc0
             at /home/jyn/.local/lib/cargo/target/debug/build/pgrx-pg-sys-43293caed29be4fd/out/pg16.rs:30298:1
   7: pgrx::fcinfo::direct_function_call_as_datum_internal
             at /home/jyn/work/pgrx/pgrx/src/fcinfo.rs:338:18
   8: pgrx::fcinfo::direct_function_call_as_datum
             at /home/jyn/work/pgrx/pgrx/src/fcinfo.rs:330:5
   9: pgrx::fcinfo::direct_function_call
             at /home/jyn/work/pgrx/pgrx/src/fcinfo.rs:283:5
[redacted]

but i am not sure how to write an automated test. i don't mind adding one if there's already a similar test but i would prefer not to write a whole test suite from scratch.

pgrx has somewhat complex panic handling. it looks something like this: 1. when a thread panics, the panic hook captures a backtrace and saves it in a thread-local for later. 2. the thread unwinds until it hits an FFI boundary (usually `run_guarded`). that downcasts the panic, takes the backtrace out of the thread-local, and hooks into postgres' `longjmp` mechanism 3. i forget what happens after this, i think it resumes unwinding once it's past the FFI barrier there is a slight problem here: we are using a thread-local to store the backtrace. if the panic does not happen on the main thread (for example, because a spawned thread tries to call into postgres and hits the check in `check_active_thread`), the backtrace will be lost. worse, if the main thread then unwinds in response to the panic, pgrx will use *its* backtrace instead of that of the worker thread. there are two main approaches we considered to fixing this: 1. fix the backtrace not to use a thread-local, so we can attach panics in spawned threads to a pgrx connection the way we would for the main thread. 2. stop handling panics in spawned threads altogether (and use the default hook). the downside of approach 1 is that there may not *be* a pgrx connection to attach to. the connection may have already closed, or the active connection may not be related to the thread that panicked, or we may be shutting down and will never check for the panic. in those cases the panic information will be missing or wrong. the downside of approach 2 is that it does not integrate with postgres' error handling mechanism, and in particular is not reported to psql. however, it does allow for developers using pgrx to handle the panic themselves, for example by handling the result from `JoinHandle::join`, in which case it *will* be reported to psql. this takes approach 2. we may want to reconsider this in the future, or perhaps add a helper library so that it's easy for applications to pass the panic into the main thread. --- note that the default panic handler in the standard library behaves quite poorly when multiple threads panic at once (it's sound, but the output is very hard to read). this being fixed in a separate PR upstream; see rust-lang/rust 127397.

workingjubilee · 2024-07-08T18:36:34Z

~~Please open an issue describing the necessary test and note that it is blocked on adopting ui_test or equivalent.~~ Scratch that, managed to reabsorb enough context to be sure of what the requirements for the test are.

workingjubilee

Thanks!

jyn514 and others added 2 commits July 3, 2024 13:31

fix typo in thread_check

d49940f

workingjubilee merged commit fd4e0e4 into pgcentralfoundation:develop Jul 8, 2024
12 checks passed

workingjubilee reviewed Jul 8, 2024

View reviewed changes

jyn514 mentioned this pull request Jul 9, 2024

Multithreaded panic test #1764

Open

workingjubilee mentioned this pull request Jul 9, 2024

0.12.0-beta prerelease bug tracking #1769

Open

jyn514 deleted the spawned-thread-panics branch July 15, 2024 22:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

don't swallow panics from spawned threads #1763

don't swallow panics from spawned threads #1763

jyn514 commented Jul 8, 2024

workingjubilee commented Jul 8, 2024 •

edited

Loading

workingjubilee left a comment

don't swallow panics from spawned threads #1763

don't swallow panics from spawned threads #1763

Conversation

jyn514 commented Jul 8, 2024

workingjubilee commented Jul 8, 2024 • edited Loading

workingjubilee left a comment

Choose a reason for hiding this comment

workingjubilee commented Jul 8, 2024 •

edited

Loading