-
-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Seg fault with Postgres #137
Comments
There is zero unsafe code in this library. It is extremely unlikely the crash is due to a bug in r2d2. The screenshot of the stacktrace in that repo is of a thread blocked in a futex_wait call, which is not a location would be triggering a segfault. You should find the faulting thread. |
With no unsafe code, it does seem unlikely. It must be related to the interplay between the libpq based driver in diesel and r2d2. All the core dumps are for r2d2 worker threads. Note that the crash doesn't happen if I comment out either of the min_idle or max_size lines:
|
What specifically do you mean by that? Core dumps are normally a dump of process memory, including all threads in the application at the time of the crash. |
I use Python in my day job, so apologies if I'm uneducated here:
|
I believe that indicates that the crashing thread was an r2d2 worker thread, but not the one that is included in the screenshot. |
All the core dumps are for futex_wait, called from run() -> condvar -> parking_lot: See: https://github.com/eloff/diesel-segfault/blob/main/coredump.png |
There are multiple thread in each core dump. Click the drop-down that says |
That is thread-4, the one that crashed. Here are the others: https://github.com/eloff/diesel-segfault/blob/main/thread-1.png |
I have already said that thread-4 is not the one that crashed. I'm not sure how much more explicit I can be. Thread 1 is trying to take a lock inside of OpenSSL while thread 2 is performing cleanup inside of an atexit handle set by OpenSSL. This is probably openssl/openssl#6214. |
Sorry, I thought the little lightning bolt there meant this was the thread that caused the problem. I don't load coredumps often. You're right and the stacktrace looks similar to the ones in that openssl thread:
|
In that issue a dev states: All threads using openssl must be done before main thread exits. I'm not so sure that's unreasonable, or out of step with other things. But it's not going to be addressed in this release. Maybe not ever, just to set expectations accurately. Is there a way I can join() all the worker threads for the pool r2d2 is using at the end of main() to meet that requirement? |
I'd recommend disabling the atexit callback instead. One way to do this is to call openssl-sys's |
I'll give that a try and report back. |
You called it right. Bringing in the latest openssl crate (0.10.43) for rust as an explicit dependency, and calling openssl::init() at the top of main() fixes the crash. This recent commit to openssl rust here is the one that fixes it: https://github.com/sfackler/rust-openssl/pull/1649/files#r1040260747, I guess I have openssl 111b on my machine. I checked under the debugger that it is that line being executed that disables the atexit handler. |
Note that this bug reappeared with the update to OpenSSL 3.x, as there's no flag like OPENSSL_INIT_NO_ATEXIT and thus the original fix for this issue in openssl-sys no longer works. I don't suppose there's any chance we could ask the worker threads to exit cleanly and join them? (related: #99) |
That's not correct. OPENSSL_INIT_NO_ATEXIT exists in OpenSSL 3.x. |
Oh, indeed, apologies. I should get more sleep before commenting next time. Nevermind then. Sorry again! |
I posted this to diesel, but the issues appears to be with r2d2. All the core dumps are r2d2 workers.
diesel-rs/diesel#3441
Repro (see README for screenshot of one stacktrace): https://github.com/eloff/diesel-segfault
Versions
rustc 1.65.0 (897e37553 2022-11-02)
2.0.2
(PostgreSQL) 15.1 (Ubuntu 15.1-1.pgdg20.04+1)
Ubuntu 20.04 5.15.0-56-generic
I can't reproduce this issue in the docker-compose I provided, just on my Ubuntu 20.04 dev machine. Any idea why that might be? Anything I can test?
The text was updated successfully, but these errors were encountered: