-
Notifications
You must be signed in to change notification settings - Fork 880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docker daemon threads can become affixed to a container's namespace, causing network errors later when the container is stopped #1113
Comments
Thanks @alindeman @lstoll for the research and a very precise bug report ❤️ |
I spent some time thinking about this and poking around the go runtime, and it seems like this is unavoidable, and there are no controls exposed to manage OS thread creation and management. Judging by things like golang/go#1435 the prevailing attitude is "functionality should be added to go" as opposed to "more controls should be exposed for those using syscalls". Based on this, I think a potential solution could be passing channels to the function called by I'd be happy to give this a shot, I wanted to put it out there to see if anyone had other proposed solutions. |
@lstoll @alindeman Thanks for the detailed analysis. As you observed this is a fundamental limitation with Golang because of the lack of control in the OS thread creation/management. libnetwork code is particularly exposed to this problem because we have to switch to the containers' net namespace to do the networking plumbing. We do have a mechanism to protect against this, which helps in many cases. Please see the usages of InitOSContext in libnetwork. I am trying out a solution, using the reexec option Docker has (it basically spawns a new Docker process, to do just some specific operations). This guarantees that the Docker daemon is completely isolated. |
Setting up the network in a totally separate, ephemeral process sounds like a good solution to me 👍 |
Yep, complete isolation makes even more sense especially if there's On Thu, Apr 14, 2016 at 11:59 AM Andy Lindeman notifications@github.com
|
Think this can be closed; this will be fixed in 1.11.1 through moby/moby#22261 (which was just merged) |
Never understand what |
Related: golang/go#20676 |
@cooljit0725 #1115 fixes one specific case where this issue can happen. Since libnework sets into different namespaces there could be some other trigger as well. If you have identified a specific sequence to recreate the issue can you share it ? Instead of reopening this we should probably create a new issue with the details on how to recreate the issue. |
@sanimej we are trying to find a way to reproduce it more easily and also investigating why this still happening , we will share the progress if we have any. |
This will initialize defOsSbox once controller is set. It's a workaround to prevent problems described in moby/libnetwork#1113 from happenning. This issue has potential risk to create defOsSbox right on a none-host namespace thread. thus all containers created with --net=host won't work as expected. as defOsSbox has been bind to a container network namespace Signed-off-by: Deng Guangxing <dengguangxing@huawei.com>
initialize defOsSbox in controller New() This will initialize defOsSbox once controller is set. It's a workaround to prevent problems described in moby/libnetwork#1113 from happenning. This issue has potential risk to create defOsSbox right on a none-host namespace thread. thus all containers created with --net=host won't work as expected. as defOsSbox has been bind to a container network namespace Signed-off-by: Deng Guangxing <dengguangxing@huawei.com> See merge request docker/docker!617
Background
I'm able to consistently reproduce an issue where some operating system threads for the Docker daemon start in a network namespace for a container, and stay in that namespace indefinitely. If or when the Go runtime decides to schedule code execution on one of these threads at a later point, that network namespace may be in an broken state, especially if the container has been since torn down. In many instances I've observed, the network namespace will no longer have any routes in its route table.
The symptoms can manifest as
docker
commands on the host (e.g.,docker pull
ordocker push
) erroring withnetwork is unreachable
orunknown host
.Requisite Version & System Information
Steps to Reproduce
I find that it's easiest to reproduce when the Docker daemon is started afresh, when the fewest number of operating system threads have been started by the Go runtime.
I also find that it's easiest to reproduce when starting a decent number of containers all at once. I threw together a
docker-compose.yml
file that starts 10 Redis containers:Start all the containers at once with:
Tear down the containers and the network with:
Navigate to
/proc/<docker pid>/task
and view the network namespace for each thread:If the bug manifested itself, there will be a thread or two in a different namespace than the rest. In my example, thread 7044 is in a namespace previously used by one of the containers, but since the container has been killed and the network torn down, it's no longer functional:
The bug doesn't always manifest as there is non-determinism involved (which I try to explain later on). Sometimes I need to restart the Docker daemon and try again to get it to manifest.
Because thread 7044 (or any like it) are running as part of the Docker daemon and on top of the Go runtime, the Go runtime is free to schedule code to execute on it. If it does--for instance to handle a Remote API request to pull an image--that code will fail with a network error.
Analysis
I believe these rogue threads are created when the libnetwork code starts the embedded DNS server. I'll do my best to step through the code path I believe to be problematic.
When a new container in a non-default network is created, the sandbox.SetKey function is invoked to, among other duties, setup the DNS resolver. Specifically, the function returned by sb.resolver.SetupFunc() is invoked within the network namespace of the container in order to start the resolver.
libnetwork provides functions like InvokeFunc and nsInvoke to execute code in a given network namespace.
nsInvoke
invokes the function provided as its first argument just before switching namespaces, then switches namespaces, then invokes the function provided as its second argument (now within the new namespace), and finally switches back to the original namespace. Because the Go runtime could ordinarily preempt the code and start execution of a different piece of code on the same operating system thread,nsInvoke
callsruntime.LockOSThread()
to make sure the current goroutine is the only one that can use the operating system thread. This prevents other code from inadvertently executing in the network namespace while the operating system thread is switched over to the container namespace.However, I've determined in my analysis that it is also not safe to start a new goroutine in the critical section where the namespace is switched. Starting a new goroutine in that region has the potential to prompt the Go runtime to start a new operating system thread--cloned from the current one, which inherits any network namespace settings--if the Go scheduler doesn't have any other available threads to run code on.
And, unfortunately, there are pieces of code that (inadvertently?) start goroutines during the setup of the embedded DNS server. Specifically, the
resolver.SetupFunc
function invokes iptables.RawCombinedOutputNative. Following the call chain, we arrive at some code that invokes exec.Command(...).CombinedOutput().The
exec
package is implemented in Go and is pretty easy to read. TheCombinedOutput
method invokes theRun
method, which in turn invokes theStart
method. And within theStart
method, several goroutines are spawned to monitor for errors (via theerrch
channel).Unfortunately these goroutines are spawned during the time when the network namespace is switched; and when a goroutine is spawned, the Go runtime is free to
clone(2)
a new operating system thread to run the routine, if it doesn't have any other threads available. And when the Docker runtime is early in its runtime, I believe this can happen easily.To verify this was the issue, I ran
sysdig
to follow the syscalls that led up to a thread being spawned that I observed was 'stuck' in the container namespace:First, the
setns
call is invoked to switch namespaces on a parent thread, 29917:The thread reports that it is invoking
iptables
:It does a typical clone/execve procedure to start
iptables
. This thread's image is replaced byiptables
, so is of no concern:Next, however, a new operating system thread is created as part of the goroutine(s) spun up in the
exec.Start
method:This thread, 30077, is started as a regular course of the Go runtime. It's not used to
execve
another program: it's used to send to theerrch
channel inexec.Start
. When the goroutine finishes, the Go runtime is free to use the operating system thread for another purpose. And because it was started during the time when its parent thread had changed its namespace, it's more-or-less permanently stuck in the container namespace.The main thread, 29917, goes on to
setns
itself back to the original namespace:All is well for thread 29917. But, no such
setns
happens with thread 30077 because the Go runtime abstracts threads from the program. As far as I know, there's no way to even be aware of a new operating system thread being spawned.If and when thread 30077 is used to run other code as a normal course of the Docker daemon operation, that code will run in the network context of the container. When and if the container and/or network is destroyed, the route table for that network namespace will also be destroyed, leaving it with only a loopback interface and an empty route table (shown above in the
nsenter
commands). It's at this point that any code running on thread 30077 will be unable to reach out to the local network or Internet, so it will return errors for operations likedocker pull
ordocker push
.Because of the nature of the Go runtime and scheduler, these errors will only occur sporadically, only if the code is scheduled on an affected thread.
Thanks to @lstoll and @wfarr for helping me debug this issue.
The text was updated successfully, but these errors were encountered: