Consoles, consoles, consoles. #1018

cyphar · 2016-09-04T12:01:10Z

This is the big console rewrite. For some design information, see this. TODO list:

Implement the console masterfd passing.
Rewrite parts of tty.go to work with the new setup.
Remove --console and add --console-socket.
- Need to figure out a sane protocol for sending things to --console-socket.
- Write a small implementation of a --console-socket consumer as a sanity check [can be done in a separate PR].
Figure out how to make the test suite behave.
- TestExecInTTY.
- bats integration.
Need to figure out why runC is blocking on recvfd().

Fixes #814, #910, moby/moby#11462, moby/moby#8755 and countless other issues.
~~Depends on #975, #977.~~
moby/moby#9212 and similar issues are not fixed by this patchset, as we still need to consider how to handle /dev/console in certain cases.

Signed-off-by: Aleksa Sarai asarai@suse.de

cyphar · 2016-09-05T14:38:40Z

Alright @opencontainers/runc-maintainers, this code is starting to get ready. I have some questions that we should address:

Currently detach && !process.terminal will result in runC duping its own stdio over the container's stdio. This results in quite a few odd things happening (such as your terminal becoming unusable due to read(2) races). Is there a reason we do this instead of doing something like duping /dev/null over the container's stdio? The OCI spec doesn't really specify anything about what we should do here and IMO duping the current terminal's stdio isn't really a very logical thing to do (not least of all because it's possible to use that to carry out terminal attacks against the session the process was started in).
Does it make any sense to allow --console-socket without detach? Personally I don't see what a user would expect runC to do in such a case (and thus have currently disabled it). The same question goes for --console-socket and !process.terminal.
For the protocol of --console-socket, what should we send and how should we send it? Personally I would prefer that we send things using iov in the message packet to ensure that nothing fishy goes on (and also to not require you to read things in a particular way so you can get the fd). What we should send is another question -- should we just send container.ID() or the whole state JSON?
Can someone please help me fix the tests? They're broken and I've not been able to get them to work (it doesn't help that I'm not sure why they worked in the first place -- what does --console /dev/pts/ptmx even mean given the old console implementation). The more worrying failing test is the ExecInTTY test (which is really meant to be testing the new API and it isn't working ... gulp).
Does anyone have an issue with me adding contrib/cmd/recvtty? IMO we need to have a PoC which actually works with the --console-socket API. Maybe @crosbymichael can add his old console handling code to the PoC so people know how to have a proper terminal setup (though it'll probably be a clone of the runC code).
Should we make --console-socket a slightly more generic interface? Something like --events where we send different file descriptors down the socket to provide information to higher level orchestrators? We wouldn't have to implement anything other than consoles, but if we ever wanted to have this kind of functionality we should probably consider future-proofing this interface now.

cyphar · 2016-09-06T16:45:47Z

ExecInTTY passes on my local machine now. I'm confused why it's failing in Jenkins...

mrunalp · 2016-09-06T17:55:16Z

From my quick testing, there are some issues. runc run works sometimes and just hangs other times. Could be a race. I'll review the code.

mrunalp · 2016-09-06T21:36:35Z

Currently detach && !process.terminal will result in runC duping its own stdio over the container's stdio. This results in quite a few odd things happening (such as your terminal becoming unusable due to read(2) races). Is there a reason we do this instead of doing something like duping /dev/null over the container's stdio? The OCI spec doesn't really specify anything about what we should do here and IMO duping the current terminal's stdio isn't really a very logical thing to do (not least of all because it's possible to use that to carry out terminal attacks against the session the process was started in).

This has been the docker default where we don't need to allocate a tty but just use the parent tty. I don't have strong opinions either way but changing this to dup /dev/null would mean that users would now be expected to set terminal to true when they didn't have to in the past. One more useful thing that we could do is allow setting the stdio to some FDs instead which would simplify logging from containers.

Does it make any sense to allow --console-socket without detach? Personally I don't see what a user would expect runC to do in such a case (and thus have currently disabled it). The same question goes for --console-socket and !process.terminal.

I don't see a reason either.

For the protocol of --console-socket, what should we send and how should we send it? Personally I would prefer that we send things using iov in the message packet to ensure that nothing fishy goes on (and also to not require you to read things in a particular way so you can get the fd). What we should send is another question -- should we just send container.ID() or the whole state JSON?

How about we just send the minimal required which is just the fd and the container ID (for logging)?

Can someone please help me fix the tests? They're broken and I've not been able to get them to work (it doesn't help that I'm not sure why they worked in the first place -- what does --console /dev/pts/ptmx even mean given the old console implementation). The more worrying failing test is the ExecInTTY test (which is really meant to be testing the new API and it isn't working ... gulp).

I don't see how we can do the tests easily in bash. At minimum we will need a binary that calls create, receives the FD over the socket and copies data back and forth between master fd and stdio. Next would be to call start (from a fork/thread). Maybe this could be the example binary that illustrates the new API.

Does anyone have an issue with me adding contrib/cmd/recvtty? IMO we need to have a PoC which actually works with the --console-socket API. Maybe @crosbymichael can add his old console handling code to the PoC so people know how to have a proper terminal setup (though it'll probably be a clone of the runC code).

I think using this binary for tests should work well as an example.

Should we make --console-socket a slightly more generic interface? Something like --events where we send different file descriptors down the socket to provide information to higher level orchestrators? We wouldn't have to implement anything other than consoles, but if we ever wanted to have this kind of functionality we should probably consider future-proofing this interface now.

On the fence for this. Maybe defer to post 1.0.

cyphar · 2016-09-07T00:23:04Z

From my quick testing, there are some issues. runc run works sometimes and just hangs other times. Could be a race. I'll review the code.

That's /very/ odd, since it doesn't happen on my machine. But that probably explains why Jenkins is still timing out even though I fixed the logic issue. Maybe there's some disk access that's killing the synchronisation between the processes? AFAIK the ordering of sendmsg and recvmsg doesn't matter because they'll always block until the other end is ready.

This has been the docker default where we don't need to allocate a tty but just use the parent tty. I don't have strong opinions either way but changing this to dup /dev/null would mean that users would now be expected to set terminal to true when they didn't have to in the past. One more useful thing that we could do is allow setting the stdio to some FDs instead which would simplify logging from containers.

The reason for my concern is that when you're running runc from a terminal with !process.terminal then you're giving the container process access to your current terminal in the host. lxc had some security issues due to a similar setup they were using, and I wouldn't be surprised if something similar was possible here too. Basically because you have access to the tty you can close stdin from inside the container so the only open stdin is the host console, and then you can run commands from the context of the host process.

wking · 2016-09-07T07:55:46Z

On Tue, Sep 06, 2016 at 02:36:37PM -0700, Mrunal Patel wrote:

Currently detach && !process.terminal will result in runC duping
its own stdio over the container's stdio. This results in quite a
few odd things happening (such as your terminal becoming unusable
due to read(2) races). Is there a reason we do this instead of
doing something like duping /dev/null over the container's stdio?
The OCI spec doesn't really specify anything about what we should
do here and IMO duping the current terminal's stdio isn't really a
very logical thing to do (not least of all because it's possible
to use that to carry out terminal attacks against the session the
process was started in).

This has been the docker default where we don't need to allocate a
tty but just use the parent tty. I don't have strong opinions either
way but changing this to dup /dev/null would mean that users would
now be expected to set terminal to true when they didn't have to in
the past.

When detach is true? I can see dupling the runtime stdio over the
container stdio when !detach && !process.terminal (e.g. if you're
using the host mount namespace or for some other reason would be using
the host devpts inside the container anyway); it saves the cost of a
new terminal. But if you're going to detach (presumably so you can
use your original terminal for something else), I don't see the point.

For the protocol of --console-socket, what should we send and how
should we send it?…

How about we just send the minimal required which is just the fd and
the container ID (for logging)?

And also (per the future-proofing point 6 1), some sort of “here's
the pseudoterminal master” token. I think you want some way to
version the socket messages from the beginning, and putting a
message-type string in a consistent place is an easy way to do that.

cyphar · 2016-09-07T08:52:36Z

@mrunalp Alright, I can reproduce the stalling if I use strace. It looks like it might be some sort of race condition triggered by some timing issue (it's blocking on the exec fifo read though).

~~Attached is the strace logs.~~ Github won't let me attach the logs. Oddly enough, if you kill just the strace process the command becomes un-stalled. Weird.

cyphar · 2016-09-07T14:05:59Z

Alright, after staring at strace logs and /proc/<pid>/task/*/status for a couple of hours I think I know what's going on. In the strace logs, the following call is what we stall on (in the runc init process):

[pid  3753] execve("/usr/bin/ls", ["ls", "-la", "/proc/self/fd/"], [/* 3 vars */] <unfinished ...>

Now, the semantics of execve are "fun" for multithreaded processes. Basically, only single-threaded processes are capable of undergoing an execve. So when you do an execve, the kernel is meant to kill every other task in the threadgroup. But that's not happening. Why? Because for some reason all of the other tasks in this thread group appear to be zombies. And if zombie movies have told us anything, you can't kill a zombie. This only happens sometimes and I'm not really sure why:

% grep -i state /proc/<pid>/task/*/status
State:  D (disk sleep)
State:  Z (zombie)
State:  Z (zombie)
State:  Z (zombie)
State:  Z (zombie)
State:  Z (zombie)

But this only happens when you hit the console changes, so clearly something changed in the process management that caused this. I'll have to look into it some more.

cyphar · 2016-09-07T14:11:44Z

@mrunalp Dammit, I just realised that this also happens on master if you use strace. It also happens on cc29e3d (which is what we're shipping in openSUSE right now). So this is not a new issue (though it is one we probably need to deal with). As far as I can tell, it's related to the exec fifo.

/cc @crosbymichael Any ideas? I'll file an issue about this (#1025).

cyphar · 2016-09-07T14:27:19Z

Alright. It looks like the above is a red herring, strace and runc don't seem to play well together before this PR. So now I have to debug this issue without strace... Brilliant.

EDIT: It looks like we're blocked on RecvFd, but SendFd didn't fail (and the child process is blocked on a read from the synchronisation pipe). I'm not entirely sure why this is happening ...

mrunalp · 2016-09-07T22:32:26Z

@cyphar One thing that could be causing issue is the socketpair using SOCK_STREAM. We could try using SOCK_SEQPACKET instead. (We might even need two different pairs).

mrunalp · 2016-09-07T23:37:21Z

@cyphar Here is the strace:

[pid 26445] 15:52:16 recvmsg(5,  <unfinished ...>
[pid 26444] 15:52:16 futex(0xc42006a110, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 26443] 15:52:16 futex(0xb16b00, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 26442] 15:52:16 futex(0xc420034910, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 26441] 15:52:16 restart_syscall(<... resuming interrupted futex ...> <unfinished ...>
[pid 26440] 15:52:16 futex(0xafc070, FUTEX_WAIT, 0, NULL

I am pretty certain now that the data is consumed by an earlier receive. We should probably look into another socketpair with SOCK_SEQPACKET for passing the fd.

cyphar · 2016-09-08T10:16:07Z

@mrunalp Actually, it's not a recv* which is gobbling up the out-of-band message. It's the synchronisation read (note the B at the end of the JSON payload, that is the iov I'm sending which is 0x42):

[pid  7981] write(3, "{\"type\":\"procConsole\"}", 22) = 22
[pid  7981] sendmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="B", iov_len=1}], msg_iovlen=1, msg_control=[{cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, cmsg_data=[6]}], msg_controllen=24, msg_flags=0}, 0 <unfinished ...>
[pid  7982] <... select resumed> )      = 0 (Timeout)
[pid  7981] <... sendmsg resumed> )     = 1
[pid  7982] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid  7981] read(3,  <unfinished ...>
[pid  7977] <... write resumed> )       = 21
[pid  7974] <... select resumed> )      = 0 (Timeout)
[pid  7977] read(5,  <unfinished ...>
[pid  7982] <... select resumed> )      = 0 (Timeout)
[pid  7974] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid  7982] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid  7977] <... read resumed> "{\"type\":\"procConsole\"}B", 512) = 23
[pid  7982] <... select resumed> )      = 0 (Timeout)
[pid  7982] futex(0xc820021108, FUTEX_WAKE, 1) = 1
[pid  7986] <... futex resumed> )       = 0
[pid  7982] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid  7986] futex(0xc820021108, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  7982] <... select resumed> )      = 0 (Timeout)
[pid  7977] recvmsg(5,  <unfinished ...>
[pid  7982] futex(0xc9cc10, FUTEX_WAIT, 0, {60, 0} <unfinished ...>
[pid  7974] <... select resumed> )      = 0 (Timeout)
[pid  7974] select(0, NULL, NULL, NULL, {0, 20}) = 0 (Timeout)

So to fix this I can add a two-way synchronisation before sending the console fd. No need to add another pipe (which would just make the code even uglier).

cyphar · 2016-09-08T10:29:05Z

@mrunalp I've fixed it with this commit. PTAL.

commit 86f76c33d818d929be75afc9a1c12a8e374d42ae
Author: Aleksa Sarai <asarai@suse.de>
Date:   Thu Sep 8 20:27:09 2016 +1000

    [squashme] console: use two-stage synchronisation

    This ensures that when the syscalls are reordered in a suboptimal way,
    an out-of-place read() on the parentPipe will not gobble the ancilliary
    information.

    Signed-off-by: Aleksa Sarai <asarai@suse.de>

mrunalp · 2016-09-08T14:44:56Z

@cyphar Makes sense 👍 I'll try this out later today.

mrunalp · 2016-09-08T22:32:24Z

This works fine for me with runc run/runc exec 👏

cyphar · 2016-09-09T01:11:44Z

Awesome. Now I just need to fix the bats stuff, and properly formalise the API for sending the master pty to the --console-socket. Currently cgo is being annoying (C.CString is breaking things). 😄

mrunalp · 2016-09-09T01:32:05Z

I can help with testing that as I am writing code for that integration in a monitoring agent.

On Sep 8, 2016, at 6:11 PM, Aleksa Sarai notifications@github.com wrote:

Awesome. Now I just need to fix the bats stuff, and properly formalise the API for sending the master pty to the --console-socket. Currently cgo is being annoying (C.CString is breaking things). 😄

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

cyphar · 2016-09-09T02:06:09Z

If you can help me figure out why 9ae9463 is currently segfaulting, I''ll buy you a drink when we eventually meet f2f. :P

mrunalp · 2016-09-09T02:15:20Z

Can't get to it before tomorrow 🙁
Do you have the line where it is set faulting?

On Sep 8, 2016, at 7:06 PM, Aleksa Sarai notifications@github.com wrote:

If you can help me figure out why 9ae9463 is currently segfaulting, I''ll buy you a drink when we eventually meet f2f. :P

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

cyphar · 2016-09-09T03:20:09Z

gdb tells me it's fauling on this line:

if (!fdptr || *fdptr < 0)

Basically, it looks like fdptr is 0x10 (the address, not the dereferenced value). Which I don't understand. Maybe I'm misusing the API?

wking · 2016-09-09T03:58:16Z

libcontainer/utils/cmsg.c

+	cmsg->cmsg_len = CMSG_LEN(sizeof(int));
+
+	fdptr = (int *) CMSG_DATA(cmsg);
+	memcpy(fdptr, &file.fd, sizeof(int));


I think you want this to be:

*fdptr = file.fd;

Never mind, both of those are equivalent.

memcpy is the superior assignment method. :P

Switch to the actual source of the official Docker library of images, so that we have a proper source for the test filesystem. In addition, update to the latest released version (1.25.0 [2016-06-23]) so that we can use more up-to-date applets in our tests (such as stat(3)). This patch is part of the console rewrite patchset. Signed-off-by: Aleksa Sarai <asarai@suse.de>

This fixes all of the tests that were broken as part of the console rewrite. This includes fixing the integration tests that used TTY handling inside libcontainer, as well as the bats integration tests that needed to be rewritten to use recvtty (as they rely on detached containers that are running). This patch is part of the console rewrite patchset. Signed-off-by: Aleksa Sarai <asarai@suse.de>

Add some tests to ensure that we always get a proper console (created inside the container). This is done by checking that the /proc/self/fd/[012] "symlinks" are always referencing something in /dev/pts/*. This patch is part of the console rewrite patchset. Signed-off-by: Aleksa Sarai <asarai@suse.de>

cyphar · 2016-12-01T04:52:37Z

Rebased.

/cc @crosbymichael Can you please take a look at this so we can (finally) merge it?

caniszczyk · 2016-12-06T14:58:28Z

ping @crosbymichael :)

crosbymichael · 2016-12-07T22:36:25Z

LGTM

mrunalp · 2016-12-07T22:37:13Z

LGTM

cyphar · 2016-12-07T23:04:30Z

🎉 As discussed on the call, either @crosbymichael or myself will work on removing the --console-socket argument.

# Commands ## create The --bundle [start-pr-bundle] and --pid-file options and ID argument [runc-start-id] match runC's interface. One benefit of the early-exit 'create' is that the exit code does not conflate container process exits with "failed to setup the sandbox" exits. We can take advantage of that and use non-zero 'create' exits to allow stderr writing (so the runtime can log errors while dying without having to successfully connect to syslog or some such). Trevor still likes the long-running 'create' API because it makes collecting the exit code easier, see the entry under rejected-for-now avenues at the end of this commit message. ### --pid-file You can get the PID by calling 'state' [container-pid-from-state], and container PIDs may not be portable [container-pid-not-portable]. But it's a common way for interfacing with init systems like systemd [systemd-pid], and for this first pass at the command line API folks are ok with some Linux-centrism [linux-centric]. ### Document LISTEN_FDS for passing open file descriptors This landed in runC with [runc-listen-fds], but the bundle-author <-> runtime specs explicitly avoided talking about how this is set (since the bundle-author didn't care about the runtime-caller <-> runtime interface) [runtime-spec-caller-api-agnostic]. This commit steps away from that agnosticism. Trevor left LISTEN_PID [sd_listen_fds,listen-fds-description] out, since he doesn't see how the runtime-caller would choose anything other than 1 for its value. It seems like something that a process would have to set for itself (because guessing the PID of a child before spawning it seems racy ;). In any event, the runC implementation seems to set this to 1 regardless of what systemd passes to it [listen-fds-description]. We've borrowed Shishir's wording for the example [listen-fds-description]. ## state [state-pr] Partially catch up with opencontainers/runtime-spec@7117ede7 (Expand on the definition of our ops, 2015-10-13, opencontainers#225, v0.4.0). The state example is adapted from runtime.md, but we defer the actual specification of the JSON to that file. The encoding for the output JSON (and all standard-stream activity) is covered by the "Character encodings" section. In cases where the runtime ignores the SHOULD (still technically compliant), RFC 7159 makes encoding detection reasonably straightforward [rfc7159-s8.1]. The obsolete RFC 4627 has some hints as well [rfc4627-s3] (although these were dropped in RFC 7518 [rfc7518-aA], probably as a result of removing the constraint that "JSON text" be an object or array [rfc7518-aA]). The hints should still apply to the state output, because we know it will be an object. If that ends up being too dicey and we want to certify runtimes that do not respect their operating-system conventions, we can add an --encoding option later. ## kill Partially catch up with opencontainers/runtime-spec@be594153 (Split create and start, 2016-04-01, opencontainers#384). The interface is based on POSIX [posix-kill], util-linux [util-linux-kill], and GNU coreutils [coreutils-kill]. The TERM/KILL requirement is a minimum portability requirement for soft/hard stops. Windows lacks POSIX signals [windows-signals], and currently supports soft stops in Docker with whatever is behind hcsshim.ShutdownComputeSystem [docker-hcsshim]. The docs we're landing here explicitly allow that sort of substitution, because we need to have soft/hard stop on those platforms but *can't* use POSIX signals. They borrow wording from opencontainers/runtime-spec@35b0e9ee (config: Clarify MUST for platform.os and .arch, 2016-05-19, opencontainers#441) to recommend runtime authors document the alternative technology so bundle-authors can prepare (e.g. by installing the equivalent to a SIGTERM signal handler). # Command style Use imperative phrasing for command summaries, to follow the practice recommended by Python's PEP 257 [pep257-docstring]: The docstring is a phrase ending in a period. It prescribes the function or method's effect as a command ("Do this", "Return that"), not as a description; e.g. don't write "Returns the pathname ...". The commands have the following layout: ### {command name} {one-line description} * *Options:* ... ... * *Exit code:* ... {additional notes} #### Example {example} The four-space list indents follow opencontainers/runtime-spec@7795661 (runtime.md: Fix sub-bullet indentation, 2016-06-08, opencontainers#495). From [markdown-syntax]: List items may consist of multiple paragraphs. Each subsequent paragraph in a list item must be indented by either 4 spaces or one tab... Trevor expects that's intended to be read with "block element" instead of "paragraph", in which case it applies to nested lists too. And while GitHub supports two-space indents [github-lists]: You can create nested lists by indenting lines by two spaces. it seems that pandoc does not. # Versioning The command-line interface is largely orthogonal to the config format, and config authors and runtime callers may be entirely different sets of people. Zhang Wei called for more explicit versioning for the CLI [interface-versioning], and the approach taken here follows the approach taken by Python's email package [python-email-version]. Wedging multiple, independently versioned entities into a single repository can be awkward, but earlier proposals to put the CLI in its own repository [separate-repository-proposed] were unsuccessful because compliance testing requires both a CLI and a config specification [separate-repository-refused]. Trevor doesn't think that's a solid reason [separate-repository-refusal-rebutted], but discussion along that line stalled out, so the approach taken here is to keep both independently versioned entities in the same repository. # Global options This section is intended to allow runtimes to extend the command line API with additional options and commands as they see fit without interfering with the commands and options specified in this document. The last line in this section makes it explicit that any later specification (e.g. "MUST print the state JSON to its stdout") do not apply to cases where the caller has included an unspecified option or command (e.g. --format=protobuf). For extensive discussion on this point see [extensions-unspecified]. With regard to the statement "Command names MUST NOT start with hyphens", the rationale behind this decision is to distinguish unrecognized commands from unrecognized options [distinguish-unrecognized-commands] because we want to allow (but not require) runtimes to fail fast when faced with an unrecognized command [optional-fail-fast]. # Long options Use GNU-style long options to avoid ambiguous, one-character options in the spec, while still allowing the runtime to support one-character options with packing. We don't specify one-character options in this spec, because portable callers can use the long form, and not specifying short forms leaves runtimes free to assign those as they see fit. # Character encodings Punt to the operating system for character encodings. Without this, the character set for the state JSON or other command output seemed too ambiguous. Trevor wishes there were cleaner references for the {language}.{encoding} locales like en_US.UTF-8 and UTF-8. But [wikipedia-utf-8,wikipedia-posix-locale] seems too glib, and he can't find a more targetted UTF-8 link than just dropping folks into a Unicode chapter (which is what [wikipedia-utf-8] does): The Unicode Standard, Version 6.0, §3.9 D92, §3.10 D95 (2011) With the current v8.0 (2015-06-17), it's still §3.9 D92 and §3.10 D95. The TR35 link is for: In addition, POSIX locales may also specify the character encoding, which requires the data to be transformed into that target encoding. and the POSIX §6.2 link is for: In other locales, the presence, meaning, and representation of any additional characters are locale-specific. # Standard streams The "MUST NOT attempt to read from its stdin" means a generic caller can safely exec the command with a closed or null stdin, and not have to worry about the command blocking or crashing because of that. The stdout spec for start/delete is more lenient, because runtimes are unlikely to change their behavior because they are unable to write to stdout. If this assumption proves troublesome, we may have to tighten it up later. Aleksa Sarai also raised concerns over the safety of potentially giving the container process access to terminal ioctl escapes [stdio-ioctl] and feels like the stdio file-descriptor pass-through is surprising [stdio-surprise]. # Console socket protocol Based on in-flight work by Aleksa in opencontainers/runc#1018, this commit makes the following choices: * SOCK_SEQPACKET instead of SOCK_STREAM, because this is a message-based protocol, so it seems more natural to use a message-oriented socket type. * A string 'type' field for all messages, so we can add additional message types in the future without breaking backwards compatibility (new console-socket servers will still support old clients). Aleksa favored splitting this identifier into an integer 'type' and 'version' fields [runc-socket-type-version], but I don't see the point if they're both opaque integers without internal structure. And I expect this protocol to be stable enough that it's not worth involving SemVer and its structured versioning. * Response messages, so the client can tell whether the request was received and processed successfully or not. That gives the client a way to bail out if, for example, the server does not support the 'terminal' message type. * Add a sub-package specs-go/socket. Even though there aren't many new types, these are fairly different from the rest of specs-go and that namespace was getting crowded. # Event triggers The "Callers MAY block..." wording is going to be hard to enforce, but with the runC model, clients rely on the command exits to trigger post-create and post-start activity. The longer the runtime hangs around after completing its action, the laggier those triggers will be. For an alternative event trigger approach, see the discussion of an 'event' command in the rejected-for-now avenues at the end of this commit message. # Lifecycle notes These aren't documented in the current runtime-spec, and may no longer be true. But they were true at one point, and informed the development of this specification. ## Process cleanup On IRC on 2015-09-15 (with PDT timestamps): 10:56 < crosbymichael> if the main process dies in the container, all other process are killed ... 10:58 < julz> crosbymichael: I'm assuming what you mean is you kill everything in the cgroup -> everything in the container dies? 10:58 < crosbymichael> julz: yes, that is how its implemented ... 10:59 < crosbymichael> julz: we actually freeze first, send the KILL, then unfreeze so we don't have races ## Container IDs for namespace joiners You can create a config that adds no isolation vs. the runtime namespace or completely joins another set of existing namespaces. It seems odd to call that a new "container", but the ID is really more of a process ID, and less of a container ID. The "container" phrasing is just a useful hint that there might be some isolation going on. And we're always creating a new "container process" with 'create'. # Other changes This commit also moves the file-descriptor docs from runtime-linux.md into runtime.md and the command-line docs. Both affect runtime authors, but: * The runtime.md entry is more useful for bundle authors than the old wording, because it gives them confidence that the runtime caller will have the power to set these up as they see fit (within POSIX's limits). It is also API-agnostic, so bundle authors know they won't have to worry about which API will be used to launch the container before deciding whether it is safe to rely on runtime-caller file-descriptor control. * The command line entry is more useful for runtime-callers than the old wording, because it tells you how to setup the file descriptors instead of just telling you that they MAY be setup. I moved the bundle-author language from runtime-linux.md to runtime.md because it's relying on POSIX primitives that aren't Linux-specific. # Avenues pursued but rejected (for now) * Early versions of this specification had 'start' taking '--config' and '--runtime', but this commit uses '--bundle' [start-pr-bundle]. The single config file change [single-config-proposal] went through, but Trevor would also like to be able to pipe a config into the 'funC start' command (e.g. via a /dev/fd/3 pseudo-filesystem path) [runc-config-via-stdin], and he has a working example that supports this without difficulty [ccon-config-via-stdin]. But since [runc-bundle-option] landed on 2015-11-16, runC has replaced their --config-file and --runtime-file flags with --bundle, and the current goal of this API is "keeping as much similarity with the existing runC command-line as possible", not "makes sense to Trevor" ;). It looks like runC was reacting [runc-required-config-file] to strict wording in the spec [runtime-spec-required-config-file], so we might be able to revisit this if/when we lift that restriction. * Having 'start' (now 'create') take a --state option to write state to a file [start-pr-state]. This is my preferred approach to sharing container state, since it punts a persistent state registry to higher-level tooling [punt-state-registry]. But runtime-spec currently requires the runtime to maintain such a registry [state-registry], and we don't need two ways to do that ;). On systems like Solaris, the kernel maintains a registry of container IDs directly, so they don't need an external registry [solaris-kernel-state]. * Having 'start' (now 'create') take an --id option instead of a required ID argument, and requiring the runtime to generate a unique ID if the option was not set. When there is a long-running host process waiting on the container process to perform cleanup, the runtime-caller may not need to know the container ID. However, runC has been requiring a user-specified ID since [runc-start-id], and this spec follows the early-exit 'create' from [runc-create-start], so we require one here. We can revisit this if we regain a long-running 'create' process. * Having 'create' take a '--console-socket PATH' option (required when process.terminal is true) with a path to a SOCK_SEQPACKET Unix socket for use with the console-socket protocol. The current 'LISTEN_FDS + 3' approach was proposed by Michael Crosby [console-socket-fd], but Trevor doesn't have a clear idea of the motivation for the change and would have preferred '--console-socket FD'. * Having a long-running 'create' process. Trevor is not a big fan of this early-exit 'create', which requires platform-specific magic to collect the container process's exit code. The ptrace idea in this commit is from Mrunal [mrunal-ptrace]. Trevor has a proposal for an 'event' operation [event] which would provide a convenient created trigger. With 'event' in place, we don't need the 'create' process exit to serve as that trigger, and could have a long-running 'create' that collects the container process exit code using the portable waitid() family. But the consensus after this week's meeting was to table that while we land docs for the runC API [mimic-runc]. * Having a 'version' command to make it easy for a caller to report which runtime they're using. But we don't have a use-case that makes it strictly necessary for interop, so we're leaving it out for now [no-version]. * Using 'sh' syntax highlighting [syntax-highlighting] for the fenced code blocks. The 'sh' keyword comes from [linguist-languages]. But the new fenced code blocks are shell sessions, not scripts, and we don't want shell-syntax highlighting in the command output. [ccon-config-via-stdin]: https://github.com/wking/ccon/tree/v0.4.0#configuration [console-socket-fd]: http://ircbot.wl.linuxfoundation.org/meetings/opencontainers/2016/opencontainers.2016-10-19-21.00.log.html#l-30 [container-pid-from-state]: https://github.com/opencontainers/runtime-spec/pull/511/files#r70353376 Subject: Add initial pass at a cmd line spec [container-pid-not-portable]: opencontainers#459 Subject: [ Runtime ] Allow for excluding pid from state [coreutils-kill]: http://www.gnu.org/software/coreutils/manual/html_node/kill-invocation.html [distinguish-unrecognized-commands]: https://github.com/wking/oci-command-line-api/pull/8/files#r46898167 Subject: Clarity for commands vs global options [docker-hcsshim]: https://github.com/docker/docker/pull/16997/files#diff-5d0b72cccc4809455d52aadc62329817R230 moby/moby@bc503ca8 (Windows: [TP4] docker kill handling, 2015-10-12, moby/moby#16997) [event]: opencontainers#508 Subject: runtime: Add an 'event' operation for subscribing to pushes [extensions-unspecified]: http://ircbot.wl.linuxfoundation.org/eavesdrop/%23opencontainers/%23opencontainers.2016-07-27.log.html#t2016-07-27T16:37:56 [github-lists]: https://help.github.com/articles/basic-writing-and-formatting-syntax/#lists [interface-versioning]: opencontainers#513 (comment) [linguist-languages]: https://github.com/github/linguist/blob/master/lib/linguist/languages.yml [linux-centric]: http://ircbot.wl.linuxfoundation.org/meetings/opencontainers/2016/opencontainers.2016-07-13-17.03.log.html#l-39 [listen-fds-description]: opencontainers/runc#231 (comment) Subject: Systemd integration with runc, for on-demand socket activation [markdown-syntax]: http://daringfireball.net/projects/markdown/syntax#list [mimic-runc]: http://ircbot.wl.linuxfoundation.org/meetings/opencontainers/2016/opencontainers.2016-07-13-17.03.log.html#l-15 [mrunal-ptrace]: http://ircbot.wl.linuxfoundation.org/eavesdrop/%23opencontainers/%23opencontainers.2016-07-13.log.html#t2016-07-13T18:58:54 [no-version]: http://ircbot.wl.linuxfoundation.org/meetings/opencontainers/2016/opencontainers.2016-07-20-21.03.log.html#l-75 [optional-fail-fast]: wking/oci-command-line-api@527f3c6#commitcomment-14835617 Subject: Use RFC 2119's keywords (MUST, MAY, ...) [pep257-docstring]: https://www.python.org/dev/peps/pep-0257/#one-line-docstrings [posix-kill]: http://pubs.opengroup.org/onlinepubs/9699919799/utilities/kill.html [punt-state-registry]: http://ircbot.wl.linuxfoundation.org/meetings/opencontainers/2015/opencontainers.2015-12-02-18.01.log.html#l-79 [python-email-version]: https://docs.python.org/3/library/email.html#package-history [rfc4627-s3]: https://tools.ietf.org/html/rfc4627#section-3 [rfc7158-aA]: https://tools.ietf.org/html/rfc7158#appendix-A RFC 7518 is currently identical to 7519. [rfc7159-s8.1]: https://tools.ietf.org/html/rfc7159#section-8.1 [runc-bundle-option]: opencontainers/runc#373 Subject: adding support for --bundle [runc-config-via-stdin]: opencontainers/runc#202 Subject: Can runc take its configuration on stdin? [runc-listen-fds]: opencontainers/runc#231 Subject: Systemd integration with runc, for on-demand socket activation [runc-required-config-file]: opencontainers/runc#310 (comment) Subject: specifying a spec file on cmd line? [runc-socket-type-version]: opencontainers/runc#1018 (comment) Subject: Consoles, consoles, consoles. [runc-start-id]: opencontainers/runc#541 opencontainers/runc@a7278cad (Require container id as arg1, 2016-02-08, opencontainers/runc#541) [runtime-spec-caller-api-agnostic]: opencontainers#113 (comment) Subject: Add fd section for linux container process [runtime-spec-required-config-file]: https://github.com/opencontainers/runtime-spec/pull/210/files#diff-8b310563f1c6f616aa98e6aeffc4d394R14 106ec2d (Cleanup bundle.md, 2015-10-02, opencontainers#210) [sd_listen_fds]: http://www.freedesktop.org/software/systemd/man/sd_listen_fds.html [separate-repository-proposed]: opencontainers#513 (comment) [separate-repository-refused]: opencontainers#513 (comment) [separate-repository-refusal-rebutted]: opencontainers#513 (comment) [single-config-proposal]: https://groups.google.com/a/opencontainers.org/forum/#!topic/dev/0QbyJDM9fWY Subject: Single, unified config file (i.e. rolling back specs#88) Date: Wed, 4 Nov 2015 09:53:20 -0800 Message-ID: <20151104175320.GC24652@odin.tremily.us> [solaris-kernel-state]: wking/oci-command-line-api#3 (comment) Subject: Drop exec, pause, resume, and signal [start-pr-bundle]: wking/oci-command-line-api#11 Subject: start: Change --config and --runtime to --bundle [start-pr-state]: wking/oci-command-line-api#14 Subject: start: Add a --state option [state-pr]: wking/oci-command-line-api#16 Subject: runtime: Add a 'state' command [state-registry]: https://github.com/opencontainers/runtime-spec/pull/225/files#diff-b84a8d65d8ed53f4794cd2db7e8ea731R61 7117ede (Expand on the definition of our ops, 2015-10-13, opencontainers#225) [stdio-ioctl]: opencontainers#513 (comment) [stdio-surprise]: opencontainers#513 (comment) [syntax-highlighting]: https://help.github.com/articles/github-flavored-markdown/#syntax-highlighting [systemd-pid]: http://ircbot.wl.linuxfoundation.org/meetings/opencontainers/2016/opencontainers.2016-07-20-21.03.log.html#l-69 [util-linux-kill]: http://man7.org/linux/man-pages/man1/kill.1.html [wikipedia-utf-8]: https://en.wikipedia.org/wiki/UTF-8 [wikipedia-posix-locale]: https://en.wikipedia.org/wiki/Locale#POSIX_platforms [windows-singals]: https://groups.google.com/a/opencontainers.org/forum/#!topic/dev/PlGKu7QUwLE Subject: Fwd: Windows support for OCI stop/signal/kill (runtime-spec#356) Date: Thu, 26 May 2016 11:03:29 -0700 Message-ID: <20160526180329.GL17496@odin.tremily.us> Signed-off-by: Julian Friedman <julz.friedman@uk.ibm.com> Hopefully-Signed-off-by: Mike Brown <brownwm@us.ibm.com> Signed-off-by: W. Trevor King <wking@tremily.us> Reviewed-by: Jesse Butler <jeeves.butler@gmail.com>

Based on work by Aleksa in opencontainers/runc#1018, this commit makes the following choices: * SOCK_SEQPACKET instead of SOCK_STREAM, because this is a message-based protocol, so it seems more natural to use a message-oriented socket type. * A string 'type' field for all messages, so we can add additional message types in the future without breaking backwards compatibility (new console-socket servers will still support old clients). Aleksa favored splitting this identifier into an integer 'type' and 'version' fields [runc-socket-type-version], but I don't see the point if they're both opaque integers without internal structure. And I expect this protocol to be stable enough that it's not worth involving SemVer and its structured versioning. * Response messages, so the client can tell whether the request was received and processed successfully or not. That gives the client a way to bail out if, for example, the server does not support the 'terminal' message type. Signed-off-by: W. Trevor King <wking@tremily.us>

williammartin · 2017-03-27T14:28:46Z

@hqhq @cyphar Sorry to resurrect...did y'all ever come up with a solution for the weird terminal formatting mentioned in #1018 (comment) ?

These were added in 244c9fc (*: console rewrite, 2016-06-04, opencontainers#1018) alongside procConsole and the associated handling. procConsole and that handling were removed in 00a0ecf (Add separate console socket, 2017-03-02, opencontainers#1356), but 00a0ecf missed this comment. Signed-off-by: W. Trevor King <wking@tremily.us>

GordonTheTurtle added the status/0-triage label Sep 4, 2016

cyphar added this to the 1.0.0 milestone Sep 4, 2016

cyphar mentioned this pull request Sep 7, 2016

runc run stalls under strace #1025

Open

yifan-gu mentioned this pull request Sep 8, 2016

stage0: attach to the app's stdin/stdout/stderr rkt/rkt#1799

Closed

wking mentioned this pull request Sep 8, 2016

Add Runtime CLI Spec opencontainers/runtime-spec#513

Closed

wking reviewed Sep 9, 2016
View reviewed changes

cyphar added 3 commits December 1, 2016 15:49

cyphar mentioned this pull request Dec 6, 2016

conmon: implement logging cri-o/cri-o#162

Merged

mrunalp merged commit 34f23cb into opencontainers:master Dec 7, 2016

cyphar deleted the console-rewrite branch December 7, 2016 23:03

cyphar mentioned this pull request Dec 9, 2016

Add recvtty to install path #1220

Closed

cyphar mentioned this pull request Dec 18, 2016

Replace --console-socket argument #1234

Closed

sandyskies mentioned this pull request Dec 30, 2016

Docker exec and Orphan Process problem moby/moby#29700

Closed

tianon mentioned this pull request Jan 11, 2017

Tmux is broken in alpine:3.5 docker images gliderlabs/docker-alpine#247

Closed

AkihiroSuda mentioned this pull request Jan 12, 2017

Add simple quick-start for developers containerd/containerd#400

Merged

hqhq mentioned this pull request Feb 24, 2017

1.0.0-rc3 spec changed tty output while terminal = true #1345

Closed

cyphar mentioned this pull request Feb 27, 2017

tty: dup /dev/null over stdio for (detach && !terminal) #1349

Closed

cyphar mentioned this pull request May 3, 2017

WIP - ignore * when reading from /dev/console #1434

Closed

cyphar mentioned this pull request Feb 18, 2018

runc create hangs when passed a pipe for stdout or stderr #1721

Open

wking mentioned this pull request Feb 23, 2018

libcontainer/sync: Drop procConsole transaction from comments #1737

Merged

cyphar added the rootless-containers label Mar 17, 2018

kolyshkin mentioned this pull request Sep 16, 2020

runc run: fix panic on failed init start projectatomic/runc#33

Merged

kolyshkin mentioned this pull request Apr 3, 2023

libctr/cgroups: don't take init's cgroup into account #3784

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consoles, consoles, consoles. #1018

Consoles, consoles, consoles. #1018

cyphar commented Sep 4, 2016 •

edited

Loading

cyphar commented Sep 5, 2016 •

edited

Loading

cyphar commented Sep 6, 2016

mrunalp commented Sep 6, 2016

mrunalp commented Sep 6, 2016

cyphar commented Sep 7, 2016 •

edited

Loading

wking commented Sep 7, 2016

cyphar commented Sep 7, 2016 •

edited

Loading

cyphar commented Sep 7, 2016 •

edited

Loading

cyphar commented Sep 7, 2016 •

edited

Loading

cyphar commented Sep 7, 2016 •

edited

Loading

mrunalp commented Sep 7, 2016

mrunalp commented Sep 7, 2016

cyphar commented Sep 8, 2016

cyphar commented Sep 8, 2016

mrunalp commented Sep 8, 2016

mrunalp commented Sep 8, 2016

cyphar commented Sep 9, 2016

mrunalp commented Sep 9, 2016

cyphar commented Sep 9, 2016

mrunalp commented Sep 9, 2016

cyphar commented Sep 9, 2016

wking Sep 9, 2016

wking Sep 9, 2016

cyphar Sep 9, 2016

cyphar commented Dec 1, 2016

caniszczyk commented Dec 6, 2016

crosbymichael commented Dec 7, 2016 •

edited by caniszczyk

Loading

mrunalp commented Dec 7, 2016 •

edited by caniszczyk

Loading

cyphar commented Dec 7, 2016

williammartin commented Mar 27, 2017

Consoles, consoles, consoles. #1018

Consoles, consoles, consoles. #1018

Conversation

cyphar commented Sep 4, 2016 • edited Loading

cyphar commented Sep 5, 2016 • edited Loading

cyphar commented Sep 6, 2016

mrunalp commented Sep 6, 2016

mrunalp commented Sep 6, 2016

cyphar commented Sep 7, 2016 • edited Loading

wking commented Sep 7, 2016

cyphar commented Sep 7, 2016 • edited Loading

cyphar commented Sep 7, 2016 • edited Loading

cyphar commented Sep 7, 2016 • edited Loading

cyphar commented Sep 7, 2016 • edited Loading

mrunalp commented Sep 7, 2016

mrunalp commented Sep 7, 2016

cyphar commented Sep 8, 2016

cyphar commented Sep 8, 2016

mrunalp commented Sep 8, 2016

mrunalp commented Sep 8, 2016

cyphar commented Sep 9, 2016

mrunalp commented Sep 9, 2016

cyphar commented Sep 9, 2016

mrunalp commented Sep 9, 2016

cyphar commented Sep 9, 2016

wking Sep 9, 2016

Choose a reason for hiding this comment

wking Sep 9, 2016

Choose a reason for hiding this comment

cyphar Sep 9, 2016

Choose a reason for hiding this comment

cyphar commented Dec 1, 2016

caniszczyk commented Dec 6, 2016

crosbymichael commented Dec 7, 2016 • edited by caniszczyk Loading

mrunalp commented Dec 7, 2016 • edited by caniszczyk Loading

cyphar commented Dec 7, 2016

williammartin commented Mar 27, 2017

cyphar commented Sep 4, 2016 •

edited

Loading

cyphar commented Sep 5, 2016 •

edited

Loading

cyphar commented Sep 7, 2016 •

edited

Loading

cyphar commented Sep 7, 2016 •

edited

Loading

cyphar commented Sep 7, 2016 •

edited

Loading

cyphar commented Sep 7, 2016 •

edited

Loading

cyphar commented Sep 7, 2016 •

edited

Loading

crosbymichael commented Dec 7, 2016 •

edited by caniszczyk

Loading

mrunalp commented Dec 7, 2016 •

edited by caniszczyk

Loading