Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ark: Should Ark be split in two processes? #802

Closed
lionel- opened this issue Jun 29, 2023 · 2 comments
Closed

ark: Should Ark be split in two processes? #802

lionel- opened this issue Jun 29, 2023 · 2 comments
Labels

Comments

@lionel-
Copy link
Contributor

lionel- commented Jun 29, 2023

This issue summarises a discussion we had this week at the Ark-Sync meeting about splitting Ark into two processes:

  • One process would be fully single threaded and run R. In this process, Ark would only run synchronously with R when called by frontend methods.

  • The other process would be multi-threaded and manage the sockets and parts of LSP that require constant work.

The two processes would communicate via an RPC protocol over IPC. The Ark-R process would pull tasks from the Ark-host process and return results using this protocol.

Advantages of a single threaded process

This reorganisation would have costs in terms of overall complexity and heavier message-passing. However it would have several advantages:

  • It would ensure thread-safety, and alleviate any concerns caused by the crash documented in ark: Weird crash when restarting R 4.1 repeatedly #720 and fixed in Wait until 0MQ sockets are created before starting R ark#43. The investigation of this crash indicated that localeconv_l(), called by printf() is not thread-safe on macOS.

    That printf() might not be thread safe is scary but also very surprising, and it's possible a third cause escaping the investigation caused the crash instead. Nevertheless, running R in parallel to other threads always carries a risk because R manipulates global state such as environment variables or locales, and this could cause functions reading this state from other threads to crash.

  • Kevin mentioned signal handling to be another area that is made more difficult by having R running in a multithreaded context.

  • The introspection routines of Ark can't run in parallel to R and so they need to get a lock at yield time (interrupt/read-console). This causes safety issue because we can't currently guarantee that the R API is only called under this lock. Also the details of how the lock works caused various issues in the past (reentrency, fairness). Davis worked on this in Use a ReentrantMutex for the R lock ark#48, POC for r_lock! enforcement on the R API ark#26, and Use bump() to yield the lock in a fair manner ark#49. These lock issues would be solved if our introspection routines were called from R at yield time instead of trying to get a lock.

Routines that need to run on the multithreaded process

  • Some sockets necessarily need their own thread, for instance the control socket. In general socket handling is made easier with a multithreaded design (though it makes it harder to deal with certain message passing races).

  • The LSP maintains in-memory models of the opened documents. It is constantly receiving document changes events from the frontend and processing them. This work can't realistically be done at interrupt time and also we don't want to pass all this data across another IPC layer.

Conclusion

The main reason to make that change is thread safety. But overall it doesn't seem worth it to make this change just to be defensive. The crash mentioned above was fixed by ensuring R only starts at a safer point, after initialisation of 0MQ state. But we should be on the lookout for random crashes in the wild that could be explained by thread unsafety.

@kevinushey
Copy link
Contributor

kevinushey commented Jun 29, 2023

For what it's worth, the RStudio rsession works in a somewhat similar way, in that the main thread runs R, and the majority of the tooling around it relies on R's event loop (e.g. R_PolledEvents) to e.g. respond to HTTP / JSON RPC requests and whatnot. We do spawn some separate child threads, but those never try to interact with R. The main problem with RStudio's model is that, since all RPCs are handled by the R main thread, you're stuck if R is busy -- e.g. you might not be able to save a document if R is busy.

We later resolved this by having some subset of RPCs be handled by a separate thread which listens to RPC requests -- it can either handle those eagerly, or push those into a queue for the main R thread to handle when ready.

Ideally, I think the R main thread and the LSP should live in the same process (or, potentially, run on the same thread), since the LSP will want to introspect R state -- e.g. the debugger might need to peek into the R context stack, and (if R and the LSP were in different processes) serializing + deserializing that could be expensive + more cumbersome to work with.

Kevin mentioned signal handling to be another area that is made more difficult by having R running in a multithreaded context.

FWIW from https://linux.die.net/man/7/pthreads:

POSIX.1 distinguishes the notions of signals that are directed to the process as a whole and signals that are directed to individual threads. According to POSIX.1, a process-directed signal (sent using kill(2), for example) should be handled by a single, arbitrarily selected thread within the process. LinuxThreads does not support the notion of process-directed signals: signals may only be sent to specific threads.

We'd need to be cognizant about how we handle signals in the various threads we're spawning. There are some different conventions, but some people recommend having a dedicated signal-handling thread, and to block signals in all other threads. See e.g. https://stackoverflow.com/a/22005820/1342082.

See also:

https://github.com/posit-dev/amalthea/blob/06dda252f99ccfe2f53065a3d7a8b20269d57d3c/crates/ark/src/interface.rs#L116-L134

https://github.com/posit-dev/amalthea/blob/06dda252f99ccfe2f53065a3d7a8b20269d57d3c/crates/ark/src/interface.rs#L602-L612

@lionel-
Copy link
Contributor Author

lionel- commented Feb 23, 2024

Haven't seen any more instability since then. I think we can close this for now and reopen the issue if it comes up again in the future.

@lionel- lionel- closed this as completed Feb 23, 2024
@wesm wesm added the lang: r label Feb 29, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 27, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants