Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsenter: correctly handle pidns orphaning #976

Closed
wants to merge 7 commits into from
Closed

nsenter: correctly handle pidns orphaning #976

wants to merge 7 commits into from

Conversation

cyphar
Copy link
Member

@cyphar cyphar commented Aug 8, 2016

The main change here is to make sure that new processes that join a container are correctly reparented. Read this LWN article for more information about what the issue is and how the solution needs to be structured.

TODO:

  • Double-fork inside stage 2 to get reparented to the container's init (if we're an exec process).
    • This is broken because we currently send the PID of the wrong process in stage 1.
  • runc exec doesn't exit after the subprocess exits because of this feature. We'll need to think up a new way of handling this...

Fixes #971

Signed-off-by: Aleksa Sarai asarai@suse.de

Note: This is based on #950, #977 and #975.

This avoids us from running into cases where libcontainer thinks that a
particular namespace file is a different type, and makes it a fatal
error rather than causing broken functionality.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
Depending on your SELinux setup, the order in which you join namespaces
can be important. In general, user namespaces should *always* be joined
and unshared first because then the other namespaces are correctly
pinned and you have the right priviliges within them. This also is very
useful for rootless containers, as well as older kernels that had
essentially broken unshare(2) and clone(2) implementations.

This also includes huge refactorings in how we spawn processes for
complicated reasons that I don't want to get into because it will make
me spiral into a cloud of rage. The reasoning is in the giant comment in
clone_parent. Have fun.

In addition, because we now create multiple children with CLONE_PARENT,
we cannot wait for them to SIGCHLD us in the case of a death. Thus, we
have to resort to having a child kindly send us their exit code before
they die. Hopefully this all works okay, but at this point there's not
much more than we can do.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
In general, it is a bad idea to be unmapped inside a user namespace at
any point (especially when euid=[kuid 0]) as it can lead to security
vulnerabilities. Also, in certain SELinux setups you must also be mapped
in your user namespace when unsharing other namespaces.

Deal with all of this by parsing the {uid,gid}maps and then setresuid(2)
to the right user before and after the critical unshare(CLONE_NEWUSER)
(as well as dealing with setns(2) by changing user to the owner of the
namespace file we're joining).

Fixes: CVE-2015-8709
Reported-by: Andrey Vagin <avagin@virtuozzo.com>
Reported-by: Mrunal Patel <mpatel@redhat.com>
Signed-off-by: Aleksa Sarai <asarai@suse.de>
@cyphar cyphar added this to the 1.0.0 milestone Oct 4, 2016
@cyphar
Copy link
Member Author

cyphar commented Oct 4, 2016

@opencontainers/runc-maintainers This sort of works. However, because of the new feature (that exec processes are no longer children of the runC process) there's a lot of issues with waitProcess in the libcontainer integration tests that I can't quite crack. I've managed to sort-of fix the issues with runC (but it requires some not-very nice hacks where I call os.Exit from a goroutine).

Is there a nicer way of handling the exit of a child exec'd process other than just getting an EIO from the stdio of the process and then exiting blindly?

Similar to the already existing proc functions, these just get different
fields from the relevant place in /proc/<pid>/stat. This is part of the
nsenter rewrite patchset (namely the pidns-orphaning part).

A nice extension to this would be to use inotify and make
libcontainer/system/proc.go entirely chan based.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
Due to how parent processes are treated in relation to PID namespaces
and zombie reaping[1], it is necessary for attaching (setns) processes
to double-fork inside the PID namespace to orphan themselves. In
addition, this also changes the PID passing code to use the PID
namespace translation features of SCM_CREDENTIALS.

[1]: https://lwn.net/Articles/532748/

Signed-off-by: Aleksa Sarai <asarai@suse.de>
Because we are no longer the parents of exec'd processes, we need to
handle exiting and such things quite differently. There are two main
cases we need to deal with:

1. Now that the exec'd process is no longer a child of runC, we cannot
   wait4 the process. This means that we lose some information (the exit
   code, and signals when it dies), which means that we have to
   explicitly handle the death by polling /proc/<pid>/stat.

2. In addition, because of how the above solution works, we also have to
   include a wait4 goroutine to clean up all of the nsenter processes
   which would be cleaned up by a simple process.Wait() but are no
   longer that simple in the new setup.

All in all, I'm very sorry. I hope that people debugging this in the
future will forgive our hubris.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
This ensures that we don't regress on the current setup, where we are
correctly setting up processes using `runc exec` such that the process'
parent is correctly reset to PID 1 inside the container.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
@cyphar
Copy link
Member Author

cyphar commented Oct 12, 2016

Alright, I've fixed all of the issues through some fairly nasty code in 247d355 ([wip] libcontainer: handle exit of exec'd processes). Unfortunately, I don't think there's a clean way of fixing it -- we won't ever get a SIGCHLD so we just have to parse everything and poll.

EDIT: Scract the "all the issues" part. Looks like some other things are broken now because of my wait4 hack...

}

parts := strings.Split(string(data), " ")
// the state field is located in pos 4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PID of the parent filed is located in pos4. so s/state/parent pid/.

Copy link
Member Author

@cyphar cyphar Oct 13, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, but note that this PR is in quite a lot of flux (it doesn't really work at the moment and I'm trying to figure out why the hell processes are so hard on Linux). So I'd recommend holding off on spending a lot of time reviewing this while I'm still writing this code. :D

return parts[3-1], nil // starts at 1
}

// GetProcessParent reads reads /proc/<pid>/stat to determine the parent of a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two reads words, should delete one.

@@ -548,8 +584,13 @@ void nsexec(void)
bail("missing cloneflags");

/* Pipe so we can tell the child when we've finished setting up. */
if (socketpair(AF_LOCAL, SOCK_STREAM, 0, syncpipe) < 0)
if (socketpair(AF_LOCAL, SOCK_STREAM, 0, parentpipe) < 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment for function sendpid and recvpid using AF_UNIX. Although AF_LOCAL and AF_UNIX are equal, I think it should be consistent.

@sandyskies
Copy link

@cyphar I merge ur fixed with runc of docker 1.12.5 but it does't work . An error was reported:
docker run -ti 427deb629ee5 bash
docker: Error response from daemon: containerd: container not started.

@cyphar
Copy link
Member Author

cyphar commented Jan 23, 2017

I'll be honest, I'm not sure that this code can ever be completely sane. I'll need to think about this some more, and I've been very busy with other things recently. The big issue is that the polling system for /proc/[pid]/stat is just not a very good solution for tracking the death of a process, and it's just not fit-for-purpose if we want to get exit codes...

@dqminh
Copy link
Contributor

dqminh commented Aug 1, 2017

@cyphar is this still relevant ?

@cyphar
Copy link
Member Author

cyphar commented Aug 1, 2017

@dqminh I believe it's still relevant (we still have pidns orphaning problems) but I think that the exit-code problem is not really resolvable (and polling /proc/$pid/stat is a horrible way of not completely solving the problem as well).

@cyphar
Copy link
Member Author

cyphar commented Oct 10, 2017

This might be doable with cn_proc but this requires more kernel work to be possible as an unprivileged user. http://netsplit.com/the-proc-connector-and-socket-filters

@cyphar
Copy link
Member Author

cyphar commented Nov 14, 2018

We need more kernel work for this.

@AkihiroSuda
Copy link
Member

What's current status?

@cyphar cyphar closed this May 20, 2021
@cyphar cyphar deleted the nsenter-pidns-orphaning branch May 20, 2021 04:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Joining existing pid ns not reparenting process
6 participants