Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--pid=container:<id> does not reparent zombies to pid 1 #25348

Open
thockin opened this issue Aug 2, 2016 · 15 comments
Open

--pid=container:<id> does not reparent zombies to pid 1 #25348

thockin opened this issue Aug 2, 2016 · 15 comments
Labels
area/runtime kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. version/1.12

Comments

@thockin
Copy link
Contributor

thockin commented Aug 2, 2016

Trying to use shared PID namespaces, we observe that zombie processes are not being reparented to pid 1.

From a simple docker run it works.

# docker run -ti busybox sh
/ #

/ # echo $$
1

/ # grep PPid /proc/1/status
PPid:   0

/ # sh -c 'echo shell 1 is $$; grep PPid /proc/$$/status; sh -c "echo shell 2 is \$\$; grep PPid /proc/\$\$/status; sleep 9376 & echo \$\$ exiting"; echo $$ exiting'
shell 1 is 6
PPid:   1
shell 2 is 8
PPid:   6
8 exiting
6 exiting

/ #  ps auxw | grep 9376
   10 root       0:00 sleep 9376
   12 root       0:00 grep 9376

/ # grep PPid /proc/10/status
PPid:   1

This is a pretty clear success. If I do this same experiment from a container that joins another PID ns, though:

# docker run -d busybox sleep 10000
c641a5e7eb6839719113dcfaeb80bd3b257b4f159553a85638b826134941787f

# docker run -ti --pid container:c641a5e7e busybox sh
/ #

/ # echo $$
5

/ # grep PPid /proc/1/status
PPid:   0

/ # sh -c 'echo shell 1 is $$; grep PPid /proc/$$/status; sh -c "echo shell 2 is \$\$; grep PPid /proc/\$\$/status; sleep 9376 & echo \$\$ exiting"; echo $$ exiting'
shell 1 is 10
PPid:   5
shell 2 is 12
PPid:   10
12 exiting
10 exiting

/ # ps auxw | grep 9376
   14 root       0:00 sleep 9376
   16 root       0:00 grep 9376

/ # grep PPid /proc/14/status 
PPid:   0

This means that any such container is not managed by the in-container init, but by the host's init. Is this intentional or a bug?

@dims

Output of docker version:

# docker version
Client:
 Version:      1.12.0
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   8eab29e
 Built:        Thu Jul 28 22:11:10 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.0
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   8eab29e
 Built:        Thu Jul 28 22:11:10 2016
 OS/Arch:      linux/amd64

Output of docker info:

# docker info
Containers: 4
 Running: 2
 Paused: 0
 Stopped: 2
Images: 1
Server Version: 1.12.0
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 9
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: null host bridge overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: apparmor seccomp
Kernel Version: 4.4.0-31-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 3.613 GiB
Name: docker-current
ID: K6EC:G3VW:TPJ2:VHLT:3HBA:YEKW:PNYV:5Z5X:WJF4:EWSD:FC7B:SOBW
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Insecure Registries:
 127.0.0.0/8
@justincormack
Copy link
Contributor

I think this is inevitable, as only pid 1 is special in terms of parent, while the process you create really does have a parent outside the namespace. You might be able to fix this in the kernel. Would it help if there was an extra fork here, so the parent was inside the namespace but immediately reaped?

@thockin
Copy link
Contributor Author

thockin commented Aug 3, 2016

I guess I had assumed we had found how to do the re-parenting or that PID namespaces did something sane here. If this is just how PID ns works (and I am not inclined to go trawling kernel code for this :) then we should document the behavior. Much noise has been made of "the pid 1 problem" but this totally avoids it (in this case).

@crosbymichael
Copy link
Contributor

I see the bug here. Its a bug with how runc is handling joining an existing pid ns, its not doing a double fork to reparent the processes once it is inside the pid ns. So its parent will stay on the outside of the container, I'll open an issue on runc and link this.

@crosbymichael
Copy link
Contributor

also, we already do this double fork for the exec use case so we should be able to replicate that functionality here

@dims
Copy link
Contributor

dims commented Aug 3, 2016

@crosbymichael Excellent news. thanks!

@thaJeztah thaJeztah added the kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. label Aug 3, 2016
@thaJeztah thaJeztah added this to the 1.12.1 milestone Aug 3, 2016
@thockin
Copy link
Contributor Author

thockin commented Aug 3, 2016

AFAICT exec is broken in the exact same way:

# docker run -d busybox sleep 10000
8be0bbfef17a95d224b75b0fe226456d9c8a7617f1240d12fc9b616385836270

# docker exec -ti 8be0bbfef1 sh
/ #

/ # echo $$
5

/ # grep PPid /proc/$$/status
PPid: 0

/ # sh -c 'echo shell 1 is $$; grep PPid /proc/$$/status; sh -c "echo shell
2 is \$\$; grep PPid /proc/\$\$/status; sleep 9376 & echo \$\$ exiting";
echo $$ exiting'
shell 1 is 10
PPid: 5
shell 2 is 12
PPid: 10
12 exiting
10 exiting

/ # ps auxw | grep 9376
   14 root       0:00 sleep 9376
   16 root       0:00 grep 9376

/ # grep PPid /proc/14/status
PPid: 0

FWIW, this (exec) works correctly in Docker 1.9.1 (I just re-verified). It
is the exact same output as above, exept the final PPis is 1

On Wed, Aug 3, 2016 at 9:59 AM, Michael Crosby notifications@github.com
wrote:

also, we already do this double fork for the exec use case so we should
be able to replicate that functionality here


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#25348 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFVgVIOuS7teHYkJn5goaTiGARIfIfPiks5qcMjhgaJpZM4JbHTF
.

@mrunalp
Copy link
Contributor

mrunalp commented Aug 3, 2016

The behavior is expected. A double fork should fix it as @crosbymichael pointed out. See diagrams here https://lwn.net/Articles/532748/ that explain the behavior.

@thockin
Copy link
Contributor Author

thockin commented Aug 3, 2016

I don't parse that. If it is expected, it is not a bug, but you then say
we can "fix" it, meaning it is a bug. Which is it? :)

On Wed, Aug 3, 2016 at 11:22 AM, Mrunal Patel notifications@github.com
wrote:

The behavior is expected. A double fork should fix it as @crosbymichael
https://github.com/crosbymichael pointed out. See diagrams here
https://lwn.net/Articles/532748/ that explain the behavior.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#25348 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFVgVM21mEaevpykHm-PmY-BJG1x2fV-ks5qcNxzgaJpZM4JbHTF
.

@mrunalp
Copy link
Contributor

mrunalp commented Aug 3, 2016

@thockin Sorry, I meant that I see why it is behaving the way it is because of our implementation.
And, yes, it is a bug that should be fixed :)

@cyphar
Copy link
Contributor

cyphar commented Aug 3, 2016

Yeah, I will include a fix for this in opencontainers/runc#950. Internally, both the shared namespace and exec cases are identical, so they're both equally broken.

@thockin
Copy link
Contributor Author

thockin commented Aug 4, 2016

cool!

On Wed, Aug 3, 2016 at 12:03 PM, Mrunal Patel notifications@github.com
wrote:

@thockin https://github.com/thockin Sorry, I meant that I see why it is
behaving the way it is because of our implementation.
And, yes, it is a bug that should be fixed :)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#25348 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFVgVBraeVngbXNaxo7Et9vDzwjCd2rNks5qcOXwgaJpZM4JbHTF
.

@thaJeztah
Copy link
Member

ping @crosbymichael is there any progress on this issue? Should this stay on the 1.12.2 milestone?

@cyphar
Copy link
Contributor

cyphar commented Sep 16, 2016

opencontainers/runc#976 is where my fix for this issue is being worked on, but I'm currently swamped with the other nsenter fixes that are necessary to fix other issues with runC. In addition, there's the console rewrite which has been taking a lot of time to do as well.

@thaJeztah
Copy link
Member

Thanks @cyphar, I think it's reasonable to assume this won't be fixed before a 1.12.x patch release; let me move it to 1.13 (hopefully)

@cyphar
Copy link
Contributor

cyphar commented Oct 10, 2017

After a bunch of review, it looks like this might be possible with the very-rarely-used cn_proc connectors (the real limitation of implementing this feature is that it requires getting the exit code of the process). The downside is that it breaks container nesting, especially in user namespaces. I've been working on the kernel side of things, and I'm hoping it should be doable in the future. Without cn_proc working as an unprivileged user, this would break rootless containers.

http://netsplit.com/the-proc-connector-and-socket-filters

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/runtime kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. version/1.12
Projects
None yet
Development

No branches or pull requests

9 participants