-
-
Notifications
You must be signed in to change notification settings - Fork 159
Coprocess Protocol Proposal
April 2021: This is OLD (2018). See Capers
This document sketches a protocol to allow coprocesses to substitute for normal "batch" processes in shell scripts. A coprocess can be thought of as a single-threaded server that reads and writes from pipes.
The goal is to make shell scripts faster. It can also make interactive completion faster, since completion scripts often invoke (multiple) external tools.
Many language runtimes start up slowly, e.g. when they include a JIT compiler or when many libraries are loaded: Python, Ruby, R, Julia, the JVM (including Clojure), etc.
This problem seems to be getting worse. Python 3 is faster than Python 2 in nearly all dimensions except startup time.
Let's call the protocol FCLI for now. There's a rough analogy to FastCGI and CGI: CGI starts one process per request, while FastCGI handles multiple requests in a process. (I think FastCGI is threaded unlike FCLI, but let's ignore that for now.)
Suppose we have a Python command line tool that copies files to a cloud file system. It works like this:
cloudcopy foo.jpg //remote/myhome/mydir/
(This could also be an R tool that does a linear regression, but let's use the cloudcopy
example to be concrete. The idea is that a lot of the work is "startup time" like initializing libraries, not "actual work".)
It could be converted to a FCLI coprocess by wrapping main()
in a while True
loop.
A shell would invoke such a process with these environment variables:
-
FCLI_VERSION
-- the process should try to become a coprocess. Some scripts may ignore this! That is OK; the shell/client should handle it. -
FCLI_REQUEST_FIFO
-- read requests from this file system path (a named pipe) -
FCLI_RESPONSE_FIFO
-- write responses to this file system path (a named pipe)
For worker #9, the shell might set variables like this:
FCLI_REQUEST_FIFO=/tmp/cloudcopy-pool/request-fifo-9 \
FCLI_RESPONSE_FIFO=/tmp/cloudcopy-pool/response-fifo-9 \
cloudcopy # no args; they'll be sent as "argv" requests
The requests and responses will look like this. Note the actual encoding will likely not be JSON, but I'm writing in JSON syntax for convenience.
# written by the shell to request-fifo-9
{ argv: ["cloudcopy", "bar.jpg", "//remote/myhome/mydir"]
env: {"PYTHONPATH": "."} # optional ENV to override actual env. May be ignored by some processes.
}
->
# written by the cloudcopy process to response-fifo-9
{ "status": 0 } # 0 on success, 1 on failure
stderr
is for logging. stdin
/ stdout
are used as usual. We probably need to instruct the server to flush its streams in order to properly delimit requests (?). We won't get an EOF because the pipes are open across multiple requests.
If you wanted to copy 1,000 files, you could start a pool of 20 or so coprocesses and drive them from an event loop. You would only pay the startup time 20 times instead of 1000 times.
In some cases, it would be possible to add a --num-threads
option to your cloudcopy
tool. But there are many cases where something like FCLI would be easier to implement. Wrapping main()
is a fairly basic change.
The process may also just exit 1
or exit 123
, and that will be treated as {"status": 123}
. A new coprocess will be started for the next request.
-
argv
-- run a new command and print a response to the fifo. Use stdin/stdout/stderr as normal. -
flush
-- flush stdout and stderr. I think this will make it easier to delimit responses from adjacent commands. -
echo
-- for testing protocol conformance? -
version
-- maybe? -
cd
-- instruct the process to change directories? This should be straightforward in most (all?) languages. -
env
-- should this be a separate request, and not part of theargv
request? Not sure.
Shells are usually thought of as clients that drive coprocess "tools" in parallel. But they can also be servers, i.e. processing multiple invocations of sh -c
in a single process.
Shells are often invoked recursively (including by redo).
Internally, a shell can use a mechanism similar to subshells like ( myfunc )
and myfunc | tee foo.txt
. That is myfunc
has to be run in a subprocess.
So we can have a proxy process that is passed the file descriptors for a coprocess. And then the shell can interact with the proxy process normally. It can wait()
on it, and it can redirect its output.
Waiting simultaneously for a process exit and an event from a pipe is somewhat annoying in Unix, requiring DJB's "self-pipe trick". This turns the exit event into a I/O event.
In a sense, this strategy is the opposite here. We're turning an I/O event (coprocess prints {"status": 0}
into a process exit event!
The key is that fork()
is very fast, but starting Python interpreters and JVMs is slow. So this will still be a big win.
Because it will be easier for existing command line tools to implement this protocol. Many tools are written with global variables, or they are written in languages that don't freely thread anyway (Python, R, etc.).
- I could have used this for RAPPOR and several other "data science" projects in R.
- The redo build system starts many short-lived processes.
- it starts many shell processes to intrepret rules, and many "tool" processes.
- Shellac Protocol Proposal -- this protocol for shell-independent command completion can build on top of the coprocess protocol. It has more of a structured request/response flavor than some command line tools, but that's fine. FCLI works for both use cases.
Bash coprocesses communicate structured data over two file descriptors / pipes:
http://wiki.bash-hackers.org/syntax/keywords/coproc
They are not drop-in replacements for command line tools.
FCLI uses at least 4 one-way pipes, in order to separate control (argv, status) from data (stdin/stdout).
It would be nice for adoption to distribute a script like fcli-lib.sh
or fcli-lib.bash
that could call coprocesses in a transparent fashion.
However bash can't even determine the length of a byte string, which limits the kind of protocols you can construct with it (i.e. length-prefixed). (It counts unicode characters unreliably.)
So bash will not be a client, but it can easily invoke a client, e.g. fcli-driver
.
Oil can be a "first-class" client. That is, coprocesses can be substituted for batch processes without a syntax change.
foo() { foo-batch "$@"; }
seq 3 | foo x y z >out.txt 2>err.txt # runs batch job
foo() { foo-coprocess "$@"; }
seq 3 | foo x y z >out.txt 2>err.txt # runs coprocess
Don't many tools read until EOF? Consider a simple Python filter:
for line in sys.stdin:
print(line.upper())
It is somewhat hard to turn this into a coprocess, because the iterator wants an EOF event. Won't it block forever waiting for the next line? I guess that is why we need the FIFOs.
TODO: Should the shell capture stderr? Or just use it as the normal logging/error stream? Usage errors could be printed there.
Do processes have to change directories? It shouldn't be super hard for them to implement a cd
command. (The shell can optimize that away in some cases.)
Process startup time is slow on Windows. I think it has named pipes, but they might not be on the file system? They might have their own namespace.
- If you start a coprocess pool, some requests might have affinity for certain replicas, i.e. to try to reuse a certain network connection. The shell could allow the user to specify this logic in a small shell function.
I wrote something like this a few years ago, but it assumed too much about the process. It assumed that you controlled all I/O in the process.
Places where you might not:
- On errors, the Python interpreter prints a stack trace to stderr
- R will randomly print warnings and other info to stderr !!!
- Some libraries print to stderr on errors.
It seems like this is mostly a problem for stderr.