Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seg fault with Postgres #137

Closed
eloff opened this issue Dec 4, 2022 · 17 comments
Closed

Seg fault with Postgres #137

eloff opened this issue Dec 4, 2022 · 17 comments

Comments

@eloff
Copy link

eloff commented Dec 4, 2022

I posted this to diesel, but the issues appears to be with r2d2. All the core dumps are r2d2 workers.

diesel-rs/diesel#3441

Repro (see README for screenshot of one stacktrace): https://github.com/eloff/diesel-segfault

Versions

  • Rust:
    rustc 1.65.0 (897e37553 2022-11-02)
  • Diesel:
    2.0.2
  • Database:
    (PostgreSQL) 15.1 (Ubuntu 15.1-1.pgdg20.04+1)
  • Operating System
    Ubuntu 20.04 5.15.0-56-generic

I can't reproduce this issue in the docker-compose I provided, just on my Ubuntu 20.04 dev machine. Any idea why that might be? Anything I can test?

@sfackler
Copy link
Owner

sfackler commented Dec 4, 2022

There is zero unsafe code in this library. It is extremely unlikely the crash is due to a bug in r2d2.

The screenshot of the stacktrace in that repo is of a thread blocked in a futex_wait call, which is not a location would be triggering a segfault. You should find the faulting thread.

@eloff
Copy link
Author

eloff commented Dec 4, 2022

With no unsafe code, it does seem unlikely. It must be related to the interplay between the libpq based driver in diesel and r2d2. All the core dumps are for r2d2 worker threads.

Note that the crash doesn't happen if I comment out either of the min_idle or max_size lines:

Pool::builder()
        .min_idle(Some(10))
        .max_size(96)
        .build(manager)
        .expect("Failed to create pool.")

@sfackler
Copy link
Owner

sfackler commented Dec 4, 2022

All the core dumps are for r2d2 worker threads.

What specifically do you mean by that? Core dumps are normally a dump of process memory, including all threads in the application at the time of the crash.

@eloff
Copy link
Author

eloff commented Dec 4, 2022

I use Python in my day job, so apologies if I'm uneducated here:

cat /proc/sys/kernel/core_pattern
core.%e.%p
ls core.*
core.r2d2-worker-0.245652  core.r2d2-worker-1.121711  core.r2d2-worker-1.260255
core.r2d2-worker-0.254359  core.r2d2-worker-1.208703  core.r2d2-worker-2.208734
core.r2d2-worker-0.254489  core.r2d2-worker-1.245542
core.r2d2-worker-0.254602  core.r2d2-worker-1.245628

@sfackler
Copy link
Owner

sfackler commented Dec 4, 2022

I believe that indicates that the crashing thread was an r2d2 worker thread, but not the one that is included in the screenshot.

@eloff
Copy link
Author

eloff commented Dec 4, 2022

All the core dumps are for futex_wait, called from run() -> condvar -> parking_lot:

See: https://github.com/eloff/diesel-segfault/blob/main/coredump.png

@sfackler
Copy link
Owner

sfackler commented Dec 4, 2022

There are multiple thread in each core dump. Click the drop-down that says Thread-4.

@sfackler
Copy link
Owner

sfackler commented Dec 4, 2022

I have already said that thread-4 is not the one that crashed. I'm not sure how much more explicit I can be.

Thread 1 is trying to take a lock inside of OpenSSL while thread 2 is performing cleanup inside of an atexit handle set by OpenSSL. This is probably openssl/openssl#6214.

@eloff
Copy link
Author

eloff commented Dec 4, 2022

I have already said that thread-4 is not the one that crashed. I'm not sure how much more explicit I can be.

Sorry, I thought the little lightning bolt there meant this was the thread that caused the problem. I don't load coredumps often.

You're right and the stacktrace looks similar to the ones in that openssl thread:

* thread #2, name = 'r2d2-worker-0', stop reason = signal SIGSEGV: invalid address (fault address: 0x18)
  * frame #0: 0x00007ffff7f21d46 libpthread.so.0`__GI___pthread_rwlock_wrlock at pthread_rwlock_common.c:604:7
    frame #1: 0x00007ffff7f21d46 libpthread.so.0`__GI___pthread_rwlock_wrlock(rwlock=0x0000000000000000) at pthread_rwlock_wrlock.c:27
    frame #2: 0x00007ffff7867889 libcrypto.so.1.1`CRYPTO_THREAD_write_lock(lock=<unavailable>) at threads_pthread.c:78
    frame #3: 0x00007ffff7839532 libcrypto.so.1.1`RAND_get_rand_method at rand_lib.c:849
    frame #4: 0x00007ffff78399f9 libcrypto.so.1.1`RAND_status at rand_lib.c:958
    frame #5: 0x00007ffff7f7f545 libpq.so.5`___lldb_unnamed_symbol294$$libpq.so.5 + 37
    frame #6: 0x00007ffff7f5ef2e libpq.so.5`___lldb_unnamed_symbol8$$libpq.so.5 + 142
    frame #7: 0x00007ffff7f76be1 libpq.so.5`___lldb_unnamed_symbol139$$libpq.so.5 + 1873
    frame #8: 0x00007ffff7f6362a libpq.so.5`PQconnectPoll + 3578
    frame #9: 0x00007ffff7f64797 libpq.so.5`___lldb_unnamed_symbol35$$libpq.so.5 + 279
    frame #10: 0x00007ffff7f679e8 libpq.so.5`PQconnectdb + 56
    frame #11: 0x00005555555c78ff acme`diesel::pg::connection::raw::RawConnection::establish::hf66f33a54d350dd8(database_url=(data_ptr = "postgres://acme:1021f155ba05dd9cbfa2a955@/acme?sslmode=disable", length = 62)) at raw.rs:23:39
    frame #12: 0x00005555555cfbf1 acme`_$LT$diesel..pg..connection..PgConnection$u20$as$u20$diesel..connection..Connection$GT$::establish::h55fc3bc2dc70ac2f(database_url=(data_ptr = "postgres://acme:1021f155ba05dd9cbfa2a955@/acme?sslmode=disable", length = 62)) at mod.rs:170:9
    frame #13: 0x000055555557a8aa acme`_$LT$diesel..r2d2..ConnectionManager$LT$T$GT$$u20$as$u20$r2d2..ManageConnection$GT$::connect::hb76e3f16e2530650(self=0x00005555556a0888) at r2d2.rs:192:9
    frame #14: 0x0000555555578e5c acme`r2d2::add_connection::inner::_$u7b$$u7b$closure$u7d$$u7d$::he22fd9af265b9649 at lib.rs:241:24
    frame #15: 0x000055555557a608 acme`scheduled_thread_pool::thunk::Thunk$LT$$LP$$RP$$C$R$GT$::new::_$u7b$$u7b$closure$u7d$$u7d$::hde78cba3c8d5d267((null)=<unavailable>) at thunk.rs:20:35
    frame #16: 0x000055555557a64d acme`_$LT$F$u20$as$u20$scheduled_thread_pool..thunk..Invoke$LT$A$C$R$GT$$GT$::invoke::h6651609b8afd4a0d(self=0x0000555555685680, arg=<unavailable>) at thunk.rs:50:9
    frame #17: 0x00005555555de540 acme`scheduled_thread_pool::thunk::Thunk$LT$A$C$R$GT$::invoke::h0d3f0d3691c28468(self=Thunk<(), ()> @ 0x00007ffff6af7ee0, arg=<unavailable>) at thunk.rs:35:9
    frame #18: 0x00005555555e1007 acme`scheduled_thread_pool::Worker::run_job::h204e5ba542e425b0(self=0x00007ffff6af8940, job=Job @ 0x00007ffff6af83b8) at lib.rs:364:33
    frame #19: 0x00005555555e0a82 acme`scheduled_thread_pool::Worker::run::_$u7b$$u7b$closure$u7d$$u7d$::h9f5de07f20041366 at lib.rs:326:61
    frame #20: 0x00005555555dd750 acme`_$LT$core..panic..unwind_safe..AssertUnwindSafe$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$$LP$$RP$$GT$$GT$::call_once::h922b39ce782e6da9(self=<unavailable>, _args=<unavailable>) at unwind_safe.rs:271:9
    frame #21: 0x00005555555e6efa acme`std::panicking::try::do_call::h353343075a44a7a2(data="\b\x88\xaf���) at panicking.rs:492:40
    frame #22: 0x00005555555e7b5b acme`__rust_try + 27
    frame #23: 0x00005555555e6ce6 acme`std::panicking::try::hbd1af4a4897eae11(f=<unavailable>) at panicking.rs:456:19
    frame #24: 0x00005555555dcba1 acme`std::panic::catch_unwind::h8487ef104e1ed0b7(f=<unavailable>) at panic.rs:137:14
    frame #25: 0x00005555555e0a0d acme`scheduled_thread_pool::Worker::run::h0b0fddce95686635(self=0x00007ffff6af8940) at lib.rs:326:21
    frame #26: 0x00005555555e08b0 acme`scheduled_thread_pool::Worker::start::_$u7b$$u7b$closure$u7d$$u7d$::h685180c0a62bc588 at lib.rs:320:30
    frame #27: 0x00005555555dcb6e acme`std::sys_common::backtrace::__rust_begin_short_backtrace::h148f92b33cf1c7fb(f={closure_env#0} @ 0x00007ffff6af8968) at backtrace.rs:122:18
    frame #28: 0x00005555555eb49b acme`std::thread::Builder::spawn_unchecked_::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::h6c24d17a05224da0 at mod.rs:514:17
    frame #29: 0x00005555555dd71f acme`_$LT$core..panic..unwind_safe..AssertUnwindSafe$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$$LP$$RP$$GT$$GT$::call_once::h329ec1e2ffc93f6a(self=AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<scheduled_thread_pool::{impl#8}::start::{closure_env#0}, ()>> @ 0x00007ffff6af89b0, _args=<unavailable>) at unwind_safe.rs:271:9
    frame #30: 0x00005555555e6f61 acme`std::panicking::try::do_call::h7ad9760e4b97e4ac(data="�,jUUU") at panicking.rs:492:40
    frame #31: 0x00005555555e7b5b acme`__rust_try + 27
    frame #32: 0x00005555555e6d8f acme`std::panicking::try::hf0bb4c1fa863e31d(f=AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<scheduled_thread_pool::{impl#8}::start::{closure_env#0}, ()>> @ 0x00007ffff6af8a80) at panicking.rs:456:19
    frame #33: 0x00005555555dcbcf acme`std::panic::catch_unwind::h86a8b4bcde1746d1(f=AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<scheduled_thread_pool::{impl#8}::start::{closure_env#0}, ()>> @ 0x00007ffff6af8ad0) at panic.rs:137:14
    frame #34: 0x00005555555eb26d acme`std::thread::Builder::spawn_unchecked_::_$u7b$$u7b$closure$u7d$$u7d$::h2c1a5cd6dd347e00 at mod.rs:513:30
    frame #35: 0x00005555555e482f acme`core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h662c6075d9edf8bc((null)=0x00005555556a0990, (null)=<unavailable>) at function.rs:248:5
    frame #36: 0x0000555555623c63 acme`std::sys::unix::thread::Thread::new::thread_start::h62ca48b42d48a8fc [inlined] _$LT$alloc..boxed..Box$LT$F$C$A$GT$$u20$as$u20$core..ops..function..FnOnce$LT$Args$GT$$GT$::call_once::h49f797984e2121bf at boxed.rs:1940:9
    frame #37: 0x0000555555623c5d acme`std::sys::unix::thread::Thread::new::thread_start::h62ca48b42d48a8fc [inlined] _$LT$alloc..boxed..Box$LT$F$C$A$GT$$u20$as$u20$core..ops..function..FnOnce$LT$Args$GT$$GT$::call_once::hfa4f3d0ee6440e0b at boxed.rs:1940
    frame #38: 0x0000555555623c56 acme`std::sys::unix::thread::Thread::new::thread_start::h62ca48b42d48a8fc at thread.rs:108
    frame #39: 0x00007ffff7f1c609 libpthread.so.0`start_thread(arg=<unavailable>) at pthread_create.c:477:8
    frame #40: 0x00007ffff7cec133 libc.so.6`__clone at clone.S:95

@eloff
Copy link
Author

eloff commented Dec 4, 2022

In that issue a dev states:

All threads using openssl must be done before main thread exits. I'm not so sure that's unreasonable, or out of step with other things. But it's not going to be addressed in this release. Maybe not ever, just to set expectations accurately.

Is there a way I can join() all the worker threads for the pool r2d2 is using at the end of main() to meet that requirement?

@sfackler
Copy link
Owner

sfackler commented Dec 5, 2022

I'd recommend disabling the atexit callback instead. One way to do this is to call openssl-sys's init function at the very beginning of your program.

@eloff
Copy link
Author

eloff commented Dec 5, 2022

I'd recommend disabling the atexit callback instead. One way to do this is to call openssl-sys's init function at the very beginning of your program.

I'll give that a try and report back.

@eloff
Copy link
Author

eloff commented Dec 6, 2022

You called it right. Bringing in the latest openssl crate (0.10.43) for rust as an explicit dependency, and calling openssl::init() at the top of main() fixes the crash.

This recent commit to openssl rust here is the one that fixes it: https://github.com/sfackler/rust-openssl/pull/1649/files#r1040260747, I guess I have openssl 111b on my machine. I checked under the debugger that it is that line being executed that disables the atexit handler.

@eloff eloff closed this as completed Dec 6, 2022
@liskin
Copy link

liskin commented Nov 11, 2024

Note that this bug reappeared with the update to OpenSSL 3.x, as there's no flag like OPENSSL_INIT_NO_ATEXIT and thus the original fix for this issue in openssl-sys no longer works.

I don't suppose there's any chance we could ask the worker threads to exit cleanly and join them? (related: #99)

@sfackler
Copy link
Owner

Note that this bug reappeared with the update to OpenSSL 3.x, as there's no flag like OPENSSL_INIT_NO_ATEXIT and thus the original fix for this issue in openssl-sys no longer works.

That's not correct. OPENSSL_INIT_NO_ATEXIT exists in OpenSSL 3.x.

@liskin
Copy link

liskin commented Nov 12, 2024

Note that this bug reappeared with the update to OpenSSL 3.x, as there's no flag like OPENSSL_INIT_NO_ATEXIT and thus the original fix for this issue in openssl-sys no longer works.

That's not correct. OPENSSL_INIT_NO_ATEXIT exists in OpenSSL 3.x.

Oh, indeed, apologies. I should get more sleep before commenting next time. Nevermind then. Sorry again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants