-
Notifications
You must be signed in to change notification settings - Fork 616
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thousands of sleeping threads due to restarting service task #2388
Comments
Possibly related: moby/libnetwork#1113 |
Some more information on this... A: Swarm-cluster created Detailed view Broken nginx stack configuration, deployed with [edit]
We're also verifying this with the same stack running on a single node with This is looking pretty serious, since one broken configuration will kill your node within a few hours. |
Some more info... This is how a
This line shows the processes/threads getting accumulated:
other counts stay the same |
Ping @thaJeztah - sorry to bug you here, but usually you're the person who knows who can shed some light on such issues :) This is completely keeping us from progressing with swarm mode |
Can you add the output of |
Also, it may be good to get a stack dump from the daemon; you can obtain that by sending |
System
|
Are you able to reproduce on a 17.09 daemon? Or perhaps better; 17.10, as I see the patch for moby/libnetwork#1934 was included in 17.10, but not (yet) back ported to 17.09; docker-archive/docker-ce#264 /cc @fcrisciani |
Yes, I tried it with On the two workers one gets filled with threads, around 2900 in 24h, while the other gets filled with logs ~7GB in 24h. Both machines close to dying, just because of the one broken config from the initial example. Can you reproduce that? |
@schmunk42 can you attach the go routine stack trace? You just have to send the signal SIGUSR1 to the |
@fcrisciani This is the log generated by SIGUSR1 These are entries in
Addon: https://gist.github.com/schmunk42/edfa992128ce3c76004233e76b74ccbc (output from |
@schmunk42 found the issue, working on a fix, this is the PR that introduced the regression moby/moby#34188 There is some logic in https://github.com/moby/moby/blob/master/daemon/cluster/executor/container/controller.go#L186 that expects the type ErrNoSuchNetwork in order to retry to create the network as managed. In that PR the error type is changed and so the issue start popping up. This is currently affecting all releases starting from 17.09. What you are seeing is a result of that failure and the containers are remaining in Created state. This is something else that will have to be checked |
@fcrisciani Any updates on this one? Is a fix already available in a release? |
@schmunk42 sorry for not following up, 17.12 contains the fix, can you please verify on that release? |
@fcrisciani I can confirm this is fixed. I tested it with 18.01. Thank you! |
@schmunk42 no problem |
hi @fcrisciani I got a problem which I think quite similar to this, below is the messages in syslog, do you think this is the same issue or should I file another ticket?
output of
the network does exist at the time:
thanks EDIT1: update log messages and network info |
@cuongnv23 the issue being discussed here was resolved in Docker 18.01, and you're running an older version than that (17.09); it's worth testing it you're still able to reproduce on a current version of Docker |
@thaJeztah in this comment moby/moby#35634 (comment) you said that this issue was back-ported to 17.09.1 and that's version I'm using, so I thought it's worth to mention here. |
We've an application which consists of two container (php, redis). While running it dozens of time upon
docker/swarm
we some problems running it withswarm mode
.Due to a misconfiguration we had some containers restarting over and over, we have no
restart:
policy btw., but the php container is recreated every few seconds, tries to connect to a database, which fails and the setup process is stopped.The issue we are facing is, that on the node on which this is happening, we currently have over 3700 sleeping threads, which won't go away even after removing everything with
docker service rm
.Output of
ps -eLf
Output of
ps amux
Is this a bug or some option we are missing? Engine version
17.07.0-ce
The text was updated successfully, but these errors were encountered: