Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix pgrx install causing postgresql coredump #1263

Merged

Conversation

Sasasu
Copy link
Contributor

@Sasasu Sasasu commented Aug 17, 2023

this is because pgrx will overwrite the .so file in place.

dlopen will use MAP_PRIVATE to open the file and map a read-only memory use mmap(2). usually this memory has copy-on-write. if others is modifying the file. the previously mapped memory should not change.

but there is a undefined behavior, from man mmap(2):

MAP_PRIVATE: It is unspecified whether changes made to the file after the mmap()
call are visible in the mapped region.

what actually happens is that the read-only memory in postgresql is modified, and all pointers in the .TEXT segment is mashed up. the call stack looks like

 #0  0x0000000000731510 in ?? ()
 #1  0x00007f7a94515132 in core::sync::atomic::AtomicUsize::store (self=0x7f7a95110318 <pgrx_pg_sys::submodules::thread_check::ACTIVE_THREAD::h60448dcb81097e92>, val=0, order=core::sync::atomic::Ordering::Relaxed) at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/core/src/sync/atomic.rs:2291
 #2  0x00007f7a944f574b in pgrx_pg_sys::submodules::thread_check::init_active_thread::clear_in_child () at src/submodules/thread_check.rs:39
 #3  0x00007f7a962f8a88 in __run_postfork_handlers (who=who@entry=atfork_run_child, do_locking=do_locking@entry=false, lastrun=lastrun@entry=2) at register-atfork.c:187
 #4  0x00007f7a962df773 in __libc_fork () at fork.c:109
 #5  0x00005555e66ad948 in fork_process () at fork_process.c:61
 #6  0x00005555e66a5d48 in StartAutoVacWorker () at autovacuum.c:1543
 #7  0x00005555e66c43f7 in StartAutovacuumWorker () at postmaster.c:6155
 #8  0x00005555e66c3d21 in sigusr1_handler (postgres_signal_arg=10) at postmaster.c:5820
 #9  <signal handler called>
 #10 0x00007f7a9630eb84 in __GI___select (nfds=6, readfds=0x7ffc61c2fa40, writefds=0x0, exceptfds=0x0, timeout=0x7ffc61c2f9b0) at ../sysdeps/unix/sysv/linux/select.c:69
 #11 0x00005555e66be343 in ServerLoop () at postmaster.c:1950
 #12 0x00005555e66bdb0f in PostmasterMain (argc=5, argv=0x5555e8fb5490) at postmaster.c:1631
 #13 0x00005555e6560e41 in main (argc=5, argv=0x5555e8fb5490) at main.c:240

the is pgrx_pg_sys try to run the postfork hook. but the variable ACTIVE_THREAD and the code binary does not in the previous place.

this is because pgrx will overwrite the .so file in place.

dlopen will use mmap with MAP_PRIVATE. but if other process is modifying
the file.

dlopen will use MAP_PRIVATE to open the file and map a read-only memory
use mmap(2). usually this memory has copy-on-write. if others is modifying
the file. the previously mapped memory should not change.

but there is a undefined behavior, from man mmap(2):

```
MAP_PRIVATE: It is unspecified whether changes made to the file after the mmap()
call are visible in the mapped region.
```

what actually happens is that the read-only memory in postgresql is modified,
and all pointers in the .TEXT segment is mashed up. the call stack looks like

```
 #0  0x0000000000731510 in ?? ()
 pgcentralfoundation#1  0x00007f7a94515132 in core::sync::atomic::AtomicUsize::store (self=0x7f7a95110318 <pgrx_pg_sys::submodules::thread_check::ACTIVE_THREAD::h60448dcb81097e92>, val=0, order=core::sync::atomic::Ordering::Relaxed) at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/core/src/sync/atomic.rs:2291
 pgcentralfoundation#2  0x00007f7a944f574b in pgrx_pg_sys::submodules::thread_check::init_active_thread::clear_in_child () at src/submodules/thread_check.rs:39
 pgcentralfoundation#3  0x00007f7a962f8a88 in __run_postfork_handlers (who=who@entry=atfork_run_child, do_locking=do_locking@entry=false, lastrun=lastrun@entry=2) at register-atfork.c:187
 pgcentralfoundation#4  0x00007f7a962df773 in __libc_fork () at fork.c:109
 pgcentralfoundation#5  0x00005555e66ad948 in fork_process () at fork_process.c:61
 pgcentralfoundation#6  0x00005555e66a5d48 in StartAutoVacWorker () at autovacuum.c:1543
 pgcentralfoundation#7  0x00005555e66c43f7 in StartAutovacuumWorker () at postmaster.c:6155
 pgcentralfoundation#8  0x00005555e66c3d21 in sigusr1_handler (postgres_signal_arg=10) at postmaster.c:5820
 pgcentralfoundation#9  <signal handler called>
 pgcentralfoundation#10 0x00007f7a9630eb84 in __GI___select (nfds=6, readfds=0x7ffc61c2fa40, writefds=0x0, exceptfds=0x0, timeout=0x7ffc61c2f9b0) at ../sysdeps/unix/sysv/linux/select.c:69
 pgcentralfoundation#11 0x00005555e66be343 in ServerLoop () at postmaster.c:1950
 pgcentralfoundation#12 0x00005555e66bdb0f in PostmasterMain (argc=5, argv=0x5555e8fb5490) at postmaster.c:1631
 pgcentralfoundation#13 0x00005555e6560e41 in main (argc=5, argv=0x5555e8fb5490) at main.c:240
```

the is `pgrx_pg_sys` try to run the postfork hook. but the variable `ACTIVE_THREAD`
and the code binary does not in the previous place.
@workingjubilee
Copy link
Member

Wow that's cursed.

@workingjubilee
Copy link
Member

workingjubilee commented Aug 17, 2023

Hmm, can you give more exact repro instructions?

This sounds like something we should be testing under valgrind, so if it's possible to replicate this consistently, jamming out a script to use this as a regression test would be nice.

@Sasasu
Copy link
Contributor Author

Sasasu commented Aug 17, 2023

  1. run cargo pgrx install
  2. set shared_preload_libraries = <the pgrx project>
  3. reboot PostgreSQL to take effect
  4. run cargo pgrx install, and wait for auto vacuum or create a new session.
  5. PostgreSQL crash

sorry I forget to mention the repro need shared_preload_libraries

@Sasasu
Copy link
Contributor Author

Sasasu commented Aug 17, 2023

the key point is ensure libc::pthread_atfork(None, None, Some(clear_in_child)) is executed in postmaster. (call it in pg_init with shared_preload_libraries)

and try to add some static variable to the pgrx project between <3> and <4> if the .so has the same memory layout with the previous build


another repro is to write some simple UDF, and install it. load and run it.
then add more UDF, install, and run it again. if PostgreSQL does not reload .so file. the memory layout should be different. will crash when calling UDF or a wrong UDF gets executed.


or try pgml if a non-minimal repro is acceptable

@beeender
Copy link

run cargo pgrx install, and wait for auto vacuum or create a new session.

I think it can be reproduced by simply creating any new session, like psql xxx.

@Sasasu
Copy link
Contributor Author

Sasasu commented Aug 18, 2023

postgresml/postgresml#530

this person copies the .so file. and get crashed on every UDF call. after rebooting PostgreSQL it returns to normal.
he reproduced this bug, but not by pgrx install.

@@ -161,10 +161,14 @@ pub(crate) fn install_extension(
};
dest.push(format!("{}.so", so_name));

if cfg!(target_os = "macos") {
if cfg!(target_os = "macos") || cfg!(target_os = "linux") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me we should just always do it this way, keeping the comments as to why. macos and linux are our two primary target platforms and if they both exhibit similar problems with overwriting the existing .so, I'm sure other platforms do too.

@Sasasu
Copy link
Contributor Author

Sasasu commented Aug 20, 2023

BTW, have you reproduced the problem?

I find install(1) will also remove the .so file before creating it. and PGXS is using install(1)

touch a b
strace -e newfstatat,openat,unlinkat,ioctl install a b  

openat(AT_FDCWD, "b", O_RDONLY|O_PATH|O_DIRECTORY) = -1 ENOTDIR (Not a directory)
newfstatat(AT_FDCWD, "a", {st_mode=S_IFREG|0644, st_size=4, ...}, 0) = 0
newfstatat(AT_FDCWD, "b", {st_mode=S_IFREG|0755, st_size=4, ...}, AT_SYMLINK_NOFOLLOW) = 0
unlinkat(AT_FDCWD, "b", 0)              = 0
openat(AT_FDCWD, "a", O_RDONLY)         = 3
newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=4, ...}, AT_EMPTY_PATH) = 0
openat(AT_FDCWD, "b", O_WRONLY|O_CREAT|O_EXCL, 0600) = 4
ioctl(4, BTRFS_IOC_CLONE or FICLONE, 3) = 0

@workingjubilee
Copy link
Member

That is exciting.

@workingjubilee
Copy link
Member

2023-08-22 18:28:34.336 PDT [395747] LOG:  from bgworker: (Some("Hi"), Some(8), Some("(8,42)"))
2023-08-22 18:28:34.336 PDT [395747] LOG:  from bgworker: (Some("Hi"), Some(9), Some("(9,42)"))
2023-08-22 18:28:34.336 PDT [395747] LOG:  from bgworker: (Some("Hi"), Some(10), Some("(10,42)"))
2023-08-22 18:28:44.634 PDT [395740] LOG:  background worker "bgworkers" (PID 395747) was terminated by signal 11: Segmentation fault
2023-08-22 18:28:44.634 PDT [395740] LOG:  terminating any other active server processes
2023-08-22 18:28:44.733 PDT [395740] LOG:  statistics collector process (PID 395746) was terminated by signal 11: Segmentation fault
2023-08-22 18:28:44.733 PDT [395740] LOG:  all server processes terminated; reinitializing
2023-08-22 18:28:45.054 PDT [395740] LOG:  startup process (PID 398010) was terminated by signal 11: Segmentation fault
2023-08-22 18:28:45.054 PDT [395740] LOG:  aborting startup due to startup process failure
2023-08-22 18:28:45.069 PDT [395740] LOG:  database system is shut down

Successful repro.

There seems to have been more than just one segfault, even. Great. Excellent. Fantastic. Merging this and I will look into a regression test after we expand cargo-pgrx's capabilities a bit to make such easier.

@workingjubilee workingjubilee merged commit f659f69 into pgcentralfoundation:develop Aug 23, 2023
8 checks passed
@workingjubilee workingjubilee changed the title Fix pgrx install will case postgresql coredump Fix pgrx install causing postgresql coredump Aug 23, 2023
workingjubilee added a commit that referenced this pull request Aug 23, 2023
The fifth beta release of 0.10.0!

* `#[pg_test]`-compatible proptest support in
#1258 courtesy of moi
* PostgresEq (#1261) and
PostgresOrd (#1262) now
explicitly require their "base" Rust traits, which may hypothetically
break some impls, but in exchange your equality implementations are now
presumed to be commutative.
* Work on improving/simplifying/optimizing our SPI continues, with
@vrmiguel contributing some polish to Datum preparation in
#1256
* We have been finding some extra-spicy edge cases in how PGRX interacts
with a continuously-operational Postgres installation which might
feature extensions being installed and reinstalled lately! @Sasasu fixed
at least one core-dump-causing case from `cargo pgrx install` in
#1263

## New Contributors
* @vrmiguel made their first contribution in
#1256

**Full Changelog**:
v0.10.0-beta.3...v0.10.0-beta.4
@Sasasu Sasasu deleted the sa/fix_install_2s_coredump branch August 23, 2023 02:50
eeeebbbbrrrr pushed a commit to eeeebbbbrrrr/pgrx that referenced this pull request Sep 5, 2023
this is because pgrx will overwrite the .so file in place.

dlopen will use MAP_PRIVATE to open the file and map a read-only memory
using mmap(2). usually this memory is copy-on-write. if others are
modifying the file, the previously mapped memory should not change.

but there is an undefined behavior, from `man mmap(2)`:

> MAP_PRIVATE: It is unspecified whether changes made to the file after the mmap()
> call are visible in the mapped region.

what actually happens is that the read-only memory in postgresql is
modified, and all pointers in the .text segment can get mashed up.
the call stack looks like

```
 0x0000000000731510 in ?? ()
 0x00007f7a94515132 in core::sync::atomic::AtomicUsize::store (self=0x7f7a95110318 <pgrx_pg_sys::submodules::thread_check::ACTIVE_THREAD::h60448dcb81097e92>, val=0, order=core::sync::atomic::Ordering::Relaxed) at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/core/src/sync/atomic.rs:2291
 0x00007f7a944f574b in pgrx_pg_sys::submodules::thread_check::init_active_thread::clear_in_child () at src/submodules/thread_check.rs:39
 0x00007f7a962f8a88 in __run_postfork_handlers (who=who@entry=atfork_run_child, do_locking=do_locking@entry=false, lastrun=lastrun@entry=2) at register-atfork.c:187
 0x00007f7a962df773 in __libc_fork () at fork.c:109
 0x00005555e66ad948 in fork_process () at fork_process.c:61
 0x00005555e66a5d48 in StartAutoVacWorker () at autovacuum.c:1543
 0x00005555e66c43f7 in StartAutovacuumWorker () at postmaster.c:6155
 0x00005555e66c3d21 in sigusr1_handler (postgres_signal_arg=10) at postmaster.c:5820
 <signal handler called>
 0x00007f7a9630eb84 in __GI___select (nfds=6, readfds=0x7ffc61c2fa40, writefds=0x0, exceptfds=0x0, timeout=0x7ffc61c2f9b0) at ../sysdeps/unix/sysv/linux/select.c:69
 0x00005555e66be343 in ServerLoop () at postmaster.c:1950
 0x00005555e66bdb0f in PostmasterMain (argc=5, argv=0x5555e8fb5490) at postmaster.c:1631
 0x00005555e6560e41 in main (argc=5, argv=0x5555e8fb5490) at main.c:240
```

this is `pgrx_pg_sys` trying to run the postfork hook, but the variable
`ACTIVE_THREAD` and the code binary will mismatch.
eeeebbbbrrrr pushed a commit to eeeebbbbrrrr/pgrx that referenced this pull request Sep 5, 2023
The fifth beta release of 0.10.0!

* `#[pg_test]`-compatible proptest support in
pgcentralfoundation#1258 courtesy of moi
* PostgresEq (pgcentralfoundation#1261) and
PostgresOrd (pgcentralfoundation#1262) now
explicitly require their "base" Rust traits, which may hypothetically
break some impls, but in exchange your equality implementations are now
presumed to be commutative.
* Work on improving/simplifying/optimizing our SPI continues, with
@vrmiguel contributing some polish to Datum preparation in
pgcentralfoundation#1256
* We have been finding some extra-spicy edge cases in how PGRX interacts
with a continuously-operational Postgres installation which might
feature extensions being installed and reinstalled lately! @Sasasu fixed
at least one core-dump-causing case from `cargo pgrx install` in
pgcentralfoundation#1263

## New Contributors
* @vrmiguel made their first contribution in
pgcentralfoundation#1256

**Full Changelog**:
pgcentralfoundation/pgrx@v0.10.0-beta.3...v0.10.0-beta.4
eeeebbbbrrrr added a commit that referenced this pull request Sep 5, 2023
This is the final release of v0.10.0. Thanks everyone for the beta
testing, pull requests, issues, and patience.

As always, install `cargo-pgrx` with `cargo install cargo-pgrx --locked`
and update your extension Cargo.toml files to use the `0.10.0` pgrx
dependencies.

This release includes support for Postgres 16RC1. Support for the
previous betas has been removed. As such, a fresh `cargo pgrx init` is
required.

## What's Changed Since v0.10.0-beta.4

* Fix `GetMemoryChunkContext` port by @workingjubilee in
#1273
* Better error messages when `pg_config` isn't found. by @eeeebbbbrrrr
in #1271
* Make `PostgresHash` also need `Eq` by @workingjubilee in
#1264
* Memoize git hash and extension metadata by @levkk in
#1274
* move to pg16rc1 by @eeeebbbbrrrr in
#1276
* Fix bgworker template up to 0.10.0-beta.4 by @workingjubilee in
#1270

## New Contributors
* @levkk made their first contribution in
#1274

**Changelog**:
v0.10.0-beta.4...v0.10.0

---

v0.10.0's full set of changes throughout the entire beta period are:

* Postgres 16beta1 Support by @eeeebbbbrrrr in
#1169
* Support building against macOS universal binaries by @clowder in
#1166
* list specific versions in feature gates by @eeeebbbbrrrr in
#1175
* Fix bug with converting a `pg_sys::Datum` into a `pgrx::Date` by
@eeeebbbbrrrr in #1177
* Fix Arrays with leading nulls by @eeeebbbbrrrr in
#1180
* Disable hello_versioned_so test by @workingjubilee in
#1192
* doc: fix link broken by @yihong0618 in
#1181
* fcinfo: fix incorrect length set in unsafe code by @Sasasu in
#1190
* update to pg16beta2 support by @eeeebbbbrrrr in
#1188
* Array-walking is aligned by @workingjubilee in
#1191
* Implement PGRXSharedMemory for Deque by @feikesteenbergen in
#1170
* Include security labels header by @daamien in
#1189
* Fixes macos-11 tests by @BradyBonnette in
#1197
* Pgcentralfoundation updates again by @eeeebbbbrrrr in
#1200
* Update version to 0.10.0-beta.0 by @eeeebbbbrrrr in
#1201
* Testing help by @eeeebbbbrrrr in
#1203
* Type testability cleanup by @eeeebbbbrrrr in
#1204
* Try to smartly propagate fs errors by @workingjubilee in
#1186
* Fix issue #1209 by @eeeebbbbrrrr in
#1210
* Type roundtrip tests by @eeeebbbbrrrr in
#1185
* Update version to 0.10.0-beta.1 by @eeeebbbbrrrr in
#1213
* Add a workaround for the pg16/homebrew/icu4c situation by @thomcc in
#1206
* Add security policy by @johnrballard in
#1207
* `AnyNumeric` is no longer backed by Postgres-allocated memory by
@eeeebbbbrrrr in #1216
* Modularize pgrx::spi by @workingjubilee in
#1219
* Stop SpiClient soundness from regressing by @workingjubilee in
#1214
* Add foreign table headers by @workingjubilee in
#1226
* Modularize the interior of pgrx-pg-sys by @workingjubilee in
#1227
* Initial valgrind support by @thomcc in
#1218
* Add support for handling SIGINT and SIGCHLD from bgworker by @JelteF
in #1229
* Ignores UI tests for MUSL environments by @BradyBonnette in
#1235
* Add a env flag that can be set to skip `#[pg_test]`-generated tests.
by @thomcc in #1239
* Fix issue #1076: Properly handle dependency graph of `Result<T, _>` by
@eeeebbbbrrrr in #1241
* Cleanup the error when cargo-pgrx version doesn't match Cargo.toml by
@eeeebbbbrrrr in #1240
* Add operator and cache related api by @VoVAllen in
#1242
* Addresses cargo-pgrx error reporting by @BradyBonnette in
#1238
* Update version to 0.10.0-beta.2 by @eeeebbbbrrrr in
#1244
* Bump cargo-metadata and clap-cargo by @thomcc in
#1246
* Derive Clone for Inet by @JelteF in
#1251
* Correct docs for datetime `From` impls by @workingjubilee in
#1253
* Only enable line tables for profile.dev by @thomcc in
#1249
* Remove references to master branch by @thomcc in
#1243
* Ensure bindgen gets all the `cppflags` it needs (on macOS, anyway) by
@thomcc in #1247
* update for pg16beta3 support by @eeeebbbbrrrr in
#1254
* Update version to 0.10.0-beta.3 by @eeeebbbbrrrr in
#1255
* Add proptest support by @workingjubilee in
#1258
* Misc reformatting and typo fixes by @workingjubilee in
#1260
* spi: simplify (optimize?) Datum preparation by @vrmiguel in
#1256
* Assume commutation when deriving PostgresEq by @workingjubilee in
#1261
* Demand Ord for PostgresOrd by @workingjubilee in
#1262
* Fix pgrx install causing postgresql coredump by @Sasasu in
#1263
* Update version to 0.10.0-beta.4 by @workingjubilee in
#1267

## New Contributors
* @clowder made their first contribution in
#1166
* @yihong0618 made their first contribution in
#1181
* @Sasasu made their first contribution in
#1190
* @daamien made their first contribution in
#1189
* @johnrballard made their first contribution in
#1207
* @VoVAllen made their first contribution in
#1242
* @vrmiguel made their first contribution in
#1256


**Full Changelog**:
v0.9.8...v0.10.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants