Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ark: Weird crash when restarting R 4.1 repeatedly #720

Closed
Tracked by #770
lionel- opened this issue Jun 9, 2023 · 6 comments
Closed
Tracked by #770

ark: Weird crash when restarting R 4.1 repeatedly #720

lionel- opened this issue Jun 9, 2023 · 6 comments
Assignees
Labels
Milestone

Comments

@lionel-
Copy link
Contributor

lionel- commented Jun 9, 2023

Restarting R 4.1 repeatedly causes a segfault (tested locally on arm64 with 4.1.2 and on Davis' intel computer with 4.1.3 - we couldn't reproduce with 4.2 and 4.3).

Screen.Recording.2023-06-09.at.15.32.16.mov

Using posit-dev/ark#22 and #708, I get this backtrace (after a round of c++filt):

2023-06-07T16:01:07.576771000Z [ark-unknown] INFO crates/ark/src/main.rs:146: Loaded connection information from front-end in /var/folders/cr/whn38wbj72zd8h7wwwtnch500000gn/T/kernel-g9DuBp/connection.json
2023-06-07T16:01:07.577814000Z [ark-unknown] DEBUG crates/ark/src/main.rs:150: Connection data: ConnectionFile { control_port: 54973, shell_port: 42280, stdin_port: 60834, iopub_port: 52645, hb_port: 51043, transport: "tcp", signature_scheme: "hmac-sha256", ip: "127.0.0.1", key: "69292e51da84173afb44144559766844" }
2023-06-07T16:01:07.578049000Z [ark-unknown] TRACE crates/ark/src/interface.rs:493: Waiting for R's initial input prompt...
2023-06-07T16:01:07.578123000Z [ark-unknown] TRACE crates/amalthea/src/socket/socket.rs:56: Binding to ZeroMQ 'Shell' socket at tcp://127.0.0.1:42280
2023-06-07T16:01:07.578441000Z [ark-unknown] TRACE crates/ark/src/interface.rs:394: R_HOME: "/Library/Frameworks/R.framework/Resources"
2023-06-07T16:01:07.849303000Z [ark-unknown] INFO crates/stdext/src/spawn.rs:51: >>> Backtrace for signal 11
[R] >>> Backtrace for signal 11
[R] >>> In thread main
[R]    0: backtrace::backtrace::libunwind::trace
[R]              at /Users/lionel/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.67/src/backtrace/libunwind.rs:93:5
[R]       backtrace::backtrace::trace_unsynchronized
[R]              at /Users/lionel/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.67/src/backtrace/mod.rs:66:5
[R]    1: backtrace::backtrace::trace
[R]              at /Users/lionel/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.67/src/backtrace/mod.rs:53:14
[R]    2: backtrace::capture::Backtrace::create
[R]              at /Users/lionel/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.67/src/capture.rs:176:9
[R]    3: backtrace::capture::Backtrace::new
[R]              at /Users/lionel/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.67/src/capture.rs:140:22
[R]    4: stdext::spawn::backtrace_handler
[R]              at crates/stdext/src/spawn.rs:50:14
[R]    5: _OSAtomicTestAndClearBarrier
[R]    6: <unknown>
[R]    7: <unknown>
[R]    8: <unknown>
[R]    9: <unknown>
[R]   10: <unknown>
[R]   11: <unknown>
[R]   12: <unknown>
[R]   13: zmq::tcp_address_t::to_string(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >&) const
[R]              at /Users/lionel/.cargo/registry/src/github.com-1ecc6299db9ec823/zeromq-src-0.2.5+4.3.4/vendor/src/tcp_address.cpp:146:20
[R]   14: std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > zmq::get_socket_name<zmq::tcp_address_t>(int, zmq::socket_end_t)
[R]              at /Users/lionel/.cargo/registry/src/github.com-1ecc6299db9ec823/zeromq-src-0.2.5+4.3.4/vendor/src/address.hpp:157:10
[R]   15: zmq::tcp_listener_t::get_socket_name(int, zmq::socket_end_t) const
[R]              at /Users/lionel/.cargo/registry/src/github.com-1ecc6299db9ec823/zeromq-src-0.2.5+4.3.4/vendor/src/tcp_listener.cpp:101:12
[R]   16: zmq::tcp_listener_t::set_local_address(char const*)
[R]              at /Users/lionel/.cargo/registry/src/github.com-1ecc6299db9ec823/zeromq-src-0.2.5+4.3.4/vendor/src/tcp_listener.cpp:183:17
[R]   17: zmq::socket_base_t::bind(char const*)
[R]              at /Users/lionel/.cargo/registry/src/github.com-1ecc6299db9ec823/zeromq-src-0.2.5+4.3.4/vendor/src/socket_base.cpp:653:24
[R]   18: zmq_bind
[R]              at /Users/lionel/.cargo/registry/src/github.com-1ecc6299db9ec823/zeromq-src-0.2.5+4.3.4/vendor/src/zmq.cpp:326:15
[R]   19: zmq::Socket::bind
[R]              at /Users/lionel/.cargo/registry/src/github.com-1ecc6299db9ec823/zmq-0.10.0/src/lib.rs:646:27
[R]   20: amalthea::socket::socket::Socket::new
[R]              at crates/amalthea/src/socket/socket.rs:57:35
[R]   21: amalthea::kernel::Kernel::connect
[R]              at crates/amalthea/src/kernel.rs:118:28
[R]   22: ark::start_kernel
[R]              at crates/ark/src/main.rs:84:11
[R]   23: ark::parse_file
[R]              at crates/ark/src/main.rs:151:13
[R]   24: ark::main
[R]              at crates/ark/src/main.rs:268:9
[R]   25: core::ops::function::FnOnce::call_once
[R]              at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/ops/function.rs:250:5
[R]   26: std::sys_common::backtrace::__rust_begin_short_backtrace
[R]              at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/sys_common/backtrace.rs:134:18
[R]   27: std::rt::lang_start::{{closure}}
[R]              at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/rt.rs:166:18
[R]   28: core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once
[R]              at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/ops/function.rs:287:13
[R]       std::panicking::try::do_call
[R]              at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/panicking.rs:487:40
[R]       std::panicking::try
[R]              at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/panicking.rs:451:19
[R]       std::panic::catch_unwind
[R]              at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/panic.rs:140:14
[R]       std::rt::lang_start_internal::{{closure}}
[R]              at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/rt.rs:148:48
[R]       std::panicking::try::do_call
[R]              at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/panicking.rs:487:40
[R]       std::panicking::try
[R]              at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/panicking.rs:451:19
[R]       std::panic::catch_unwind
[R]              at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/panic.rs:140:14
[R]       std::rt::lang_start_internal
[R]              at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/rt.rs:148:20
[R]   29: std::rt::lang_start
[R]              at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/rt.rs:165:17
[R]   30: _main

The _OSAtomicTestAndClearBarrier frame is part of the backtrace capture, so the crash happens in the 0MQ stack, here precisely https://github.com/zeromq/libzmq/blob/5bf04ee2ff207f0eaf34298658fe354ea61e1839/src/tcp_address.cpp#L121 (line number differs because our version of 0MQ is 3 years old, but the function hasn't changed).

const int rc = getnameinfo (addr (), addrlen (), hbuf, sizeof (hbuf), NULL,
                            0, NI_NUMERICHOST);

I thought perhaps there was some ordering issue that caused us to pass uninitialised memory to getnameinfo(). I used a local 0MQ with some printf to make sure, and this doesn't seem to be the case:

addrlen: 16

// struct sockaddr_in {
//   __uint8_t       sin_len;
//   sa_family_t     sin_family;
//   in_port_t       sin_port;
//   struct  in_addr sin_addr;
//   char            sin_zero[8];
// };

addr: {
  sin_len: 16
  sin_family: 2
  sin_port: 61348
  sin_addr: { 16777343 }  // 127.0.0.1
  sin_zero: 0
}

I couldn't reproduce the crash when I was attaching Ark on startup so I thought there might be a race somewhere. It seems that it is the case as sleeping a bit before starting up solves the crash:

modified   Cargo.lock
@@ -51,6 +51,7 @@ dependencies = [
  "stdext",
  "strum",
  "strum_macros",
+ "time 0.3.21",
  "uuid",
  "zmq",
 ]
@@ -303,7 +304,7 @@ dependencies = [
  "js-sys",
  "num-integer",
  "num-traits",
- "time",
+ "time 0.1.45",
  "wasm-bindgen",
  "winapi",
 ]
@@ -2066,6 +2067,22 @@ dependencies = [
  "winapi",
 ]
 
+[[package]]
+name = "time"
+version = "0.3.21"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "8f3403384eaacbca9923fa06940178ac13e4edb725486d70e8e15881d0c836cc"
+dependencies = [
+ "serde",
+ "time-core",
+]
+
+[[package]]
+name = "time-core"
+version = "0.1.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "7300fbefb4dadc1af235a9cef3737cea692a9d97e1b9cbcd4ebdae6f8868e6fb"
+
 [[package]]
 name = "tinyvec"
 version = "1.6.0"
modified   crates/amalthea/Cargo.toml
@@ -25,6 +25,7 @@ zmq = "0.10.0"
 strum = "0.24"
 strum_macros = "0.24"
 crossbeam = { version = "0.8.2", features = ["crossbeam-channel"] }
+time = "0.3.21"
 
 [dev-dependencies]
 rand = "0.8.5"
modified   crates/amalthea/src/socket/socket.rs
@@ -25,6 +25,8 @@ pub struct Socket {
     socket: zmq::Socket,
 }
 
+use std::time;
+
 impl Socket {
     /// Create a new Socket instance from a kernel session and a ZeroMQ context.
     pub fn new(
@@ -53,6 +55,7 @@ impl Socket {
         // will be the server, and the `connect()` side will be the client.
         match kind {
             zmq::SocketType::ROUTER | zmq::SocketType::PUB | zmq::SocketType::REP => {
+                std::thread::sleep(time::Duration::from_millis(50));
                 trace!("Binding to ZeroMQ '{}' socket at {}", name, endpoint);
                 if let Err(err) = socket.bind(&endpoint) {
                     return Err(Error::SocketBindError(name, endpoint, err));

Since there is nothing wrong with the data passed to getnameinfo(), and since sleeping a bit fixes the crash, I'm tempted to say this is a weird race condition in the network stack of macOS? It would be interesting to see if this reproduces on Linux or Windows.

I also wondered if properly closing all the 0mq sockets and context on shutdown would help, but didn't go through with this because it seems like a lot of work to propagate the shutdown to all the threads and coordinate the closing of everything. Also since we're closing the process anyway, it shouldn't matter if we're leaving the sockets open?

So I have two questions:

  • What else can I try to figure out this crash?
  • Is the sleep workaround a reasonable stopgap that is worth a PR?
@kevinushey
Copy link
Contributor

We bumped into some very similar bugs on the RStudio side that effectively came down to various networking function calls using getenv(), and those calls being unsafe in a multithreaded context.

https://github.com/rstudio/rstudio-pro/pull/4648

@kevinushey
Copy link
Contributor

https://man7.org/linux/man-pages/man3/getnameinfo.3.html suggests (at least with glibc) that the function is:

MT-Safe env locale

which effectively means the function makes (unsynchronized) calls to access facets of the environment / locale; so things could definitely go awry if we were modifying those in a separate thread.

@lionel- lionel- self-assigned this Jun 13, 2023
@lionel-
Copy link
Contributor Author

lionel- commented Jun 14, 2023

Using #742 (well a local version that actually works, updates incoming) I managed to get a backtrace for the R thread:

__ulock_wait (@__ulock_wait:5)
_os_unfair_lock_lock_slow (@_os_unfair_lock_lock_slow:54)
localeconv_l (@localeconv_l:19)
__vfprintf (@__vfprintf:45)
_vsnprintf (@_vsnprintf:59)
snprintf (@snprintf:20)
R_reInitTempDir (@R_reInitTempDir:50)
Rf_InitTempDir (@Rf_InitTempDir:4)
setup_Rmainloop (@setup_Rmainloop:121)
ark::interface::start_r (/Users/lionel/Sync/Projects/Positron/positron/extensions/positron-r/amalthea/crates/ark/src

I believe this is from this line:

snprintf(buf, len, "R_SESSION_TMPDIR=%s", tmp);

I can't find much information about thread-safety of localeconv_l(), macOS seems to have a different implementation anyway. Interesting article: https://aras-p.info/blog/2022/02/25/Curious-lack-of-sprintf-scaling/

To fix this I'll just delay initialisation of R after that of 0MQ as discussed yesterday.

Edit: On the main thread side, the debugger shows a more complete backtrace:

localeconv_l (@localeconv_l:23)
__vfprintf (@__vfprintf:45)
_vsnprintf (@_vsnprintf:59)
snprintf (@snprintf:20)
inet_ntop4 (@inet_ntop4:43)
si_nameinfo (@si_nameinfo:197)
getnameinfo (@getnameinfo:72)
zmq::tcp_address_t::to_string(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>&) const (/Users/lionel/.cargo/registry/src/index.crates.io-6f17d22bba15001f/zeromq-src-0.2.5+4.3.4/vendor/src/tcp_address.cpp:146)

So it looks like printf() is not thread-safe on macOS...

@lionel-
Copy link
Contributor Author

lionel- commented Jun 15, 2023

@DavisVaughan ran into this crash with R 4.3 as well.

I used the debugger to pause on invocations of snprintf() outside of the R thread. Many of them happen during 0MQ initialization, e.g. the call stack of the above report, and things like:

snprintf (@snprintf:3)
inet_ntop4 (@inet_ntop4:30)
si_nameinfo (@si_nameinfo:197)
getnameinfo (@getnameinfo:72)
zmq::tcp_address_t::to_string(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>&) const (/Users/lionel/.cargo/registry/src/index.crates.io-6f17d22bba15001f/zeromq-src-0.2.5+4.3.4/vendor/src/tcp_address.cpp:146)
std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> zmq::get_socket_name<zmq::tcp_address_t>(int, zmq::socket_end_t) (/Users/lionel/.cargo/registry/src/index.crates.io-6f17d22bba15001f/zeromq-src-0.2.5+4.3.4/vendor/src/address.hpp:157)
zmq::tcp_listener_t::get_socket_name(int, zmq::socket_end_t) const (/Users/lionel/.cargo/registry/src/index.crates.io-6f17d22bba15001f/zeromq-src-0.2.5+4.3.4/vendor/src/tcp_listener.cpp:101)
zmq::stream_listener_base_t::create_engine(int) (/Users/lionel/.cargo/registry/src/index.crates.io-6f17d22bba15001f/zeromq-src-0.2.5+4.3.4/vendor/src/stream_listener_base.cpp:103)
zmq::tcp_listener_t::in_event() (/Users/lionel/.cargo/registry/src/index.crates.io-6f17d22bba15001f/zeromq-src-0.2.5+4.3.4/vendor/src/tcp_listener.cpp:94)
zmq::kqueue_t::loop() (/Users/lionel/.cargo/registry/src/index.crates.io-6f17d22bba15001f/zeromq-src-0.2.5+4.3.4/vendor/src/kqueue.cpp:218)
zmq::worker_poller_base_t::worker_routine(void*) (/Users/lionel/.cargo/registry/src/index.crates.io-6f17d22bba15001f/zeromq-src-0.2.5+4.3.4/vendor/src/poller_base.cpp:146)
thread_routine(void*) (/Users/lionel/.cargo/registry/src/index.crates.io-6f17d22bba15001f/zeromq-src-0.2.5+4.3.4/vendor/src/thread.cpp:257)
_pthread_start (@_pthread_start:40)

Once we're past the startup, most snprintf calls in other threads happen within the R lock so they are safe. I did see these while creating plots:

snprintf (@snprintf:3)
hexNumber(unsigned long long) (@hexNumber(unsigned long long):19)
asString(void const*) (@asString(void const*):1006)
asString(void const*) (@asString(void const*):246)
copyApplicationInformationDictionaryGivenASNUsingLocalCache(__LSSharedMemoryPage const*, __LSASN const*) (@copyApplicationInformationDictionaryGivenASNUsingLocalCache(__LSSharedMemoryPage const*, __LSASN const*):348)
_LSCopyApplicationInformation (@_LSCopyApplicationInformation:207)
_LSCopyApplicationInformationItem (@_LSCopyApplicationInformationItem:40)
__105-[NSApplication(NSPersistentUISupport) _disableRelaunchOnLoginIfNotLaunchedByLaunchServicesOrBlacklisted]_block_invoke (@__105-[NSApplication(NSPersistentUISupport) _disableRelaunchOnLoginIfNotLaunchedByLaunchServicesOrBlacklisted]_block_invoke:14)
___NSPersistentUIDispatchQueueAsync_block_invoke (@___NSPersistentUIDispatchQueueAsync_block_invoke:20)
_dispatch_call_block_and_release (@_dispatch_call_block_and_release:11)
_dispatch_client_callout (@_dispatch_client_callout:8)
_dispatch_root_queue_drain (@_dispatch_root_queue_drain:174)
_dispatch_worker_thread2 (@_dispatch_worker_thread2:44)
_pthread_wqthread (@_pthread_wqthread:60)

---

snprintf (@snprintf:3)
_xpc_connection_create (@_xpc_connection_create:54)
_xpc_connection_mach_event (@_xpc_connection_mach_event:506)
_dispatch_client_callout4 (@_dispatch_client_callout4:8)
_dispatch_mach_msg_invoke (@_dispatch_mach_msg_invoke:120)
_dispatch_lane_serial_drain (@_dispatch_lane_serial_drain:97)
_dispatch_mach_invoke (@_dispatch_mach_invoke:115)
_dispatch_lane_serial_drain (@_dispatch_lane_serial_drain:97)
_dispatch_lane_invoke (@_dispatch_lane_invoke:99)
_dispatch_workloop_worker_thread (@_dispatch_workloop_worker_thread:166)
_pthread_wqthread (@_pthread_wqthread:75)

But these seem to be transient threads spawned by the OS, presumably when R is waiting. So these should be safe. If these weren't safe, we'd have seen problems outside Ark.

So overall, delaying initialisation of R until 0MQ init is complete should get us out of most trouble.

That said this investigation only concerns snprintf() and there are other unsafe functions that could cause problems. I worry that Ark users will encounter rare crashes and lose trust in their application. So if we end up seeing random crashes I wonder if we should consider an alternative design in the longer term, to make Ark as robust as it can be. I was thinking about something like:

  • Ark would be split up in two processes communicating with some RPC protocol. The first process would look much like the current one and would be multithreaded. The second process would run R in a fully single threaded way.

  • Execution and introspection requests would be pulled (rather than pushed) at yield time (read-console and interrupt) from the main process via IPC and run sequentially. This seems mostly equivalent to what we have today with the R lock since there is only one thread that can query R at any time.

@DavisVaughan
Copy link
Contributor

I restarted R 4.1.3 roughly 70 times with this fix installed and didn't see this anymore. I think we are pretty confident about the source of this, and that posit-dev/ark#43 fixes it, so I'll close. Thanks @lionel-!

@kevinushey
Copy link
Contributor

roughly 70 times

I appreciate the dedication 😄

@DavisVaughan DavisVaughan mentioned this issue Jun 20, 2023
4 tasks
@petetronic petetronic added this to the Private Alpha milestone Aug 26, 2023
@wesm wesm added the lang: r label Feb 29, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 22, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

5 participants