Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thousands of sleeping threads due to restarting service task #2388

Closed
schmunk42 opened this issue Sep 25, 2017 · 19 comments
Closed

Thousands of sleeping threads due to restarting service task #2388

schmunk42 opened this issue Sep 25, 2017 · 19 comments

Comments

@schmunk42
Copy link

We've an application which consists of two container (php, redis). While running it dozens of time upon docker/swarm we some problems running it with swarm mode.

Due to a misconfiguration we had some containers restarting over and over, we have no restart: policy btw., but the php container is recreated every few seconds, tries to connect to a database, which fails and the setup process is stopped.

The issue we are facing is, that on the node on which this is happening, we currently have over 3700 sleeping threads, which won't go away even after removing everything with docker service rm.

Output of ps -eLf

[...]
root     15531     1 11907  0 3605 06:48 ?        00:00:01 /usr/bin/dockerd -H tcp://0.0.0.0:2376 -H unix:///var/run/docker.sock --storage-driver aufs --tlsverify --tlscacert /etc/docker/ca.pem --tlscert /etc/docker/server.pem --tlskey /etc/docker/server-key.pem --label node=1 --label provider=generic
root     15531     1 12769  0 3605 06:50 ?        00:00:02 /usr/bin/dockerd -H tcp://0.0.0.0:2376 -H unix:///var/run/docker.sock --storage-driver aufs --tlsverify --tlscacert /etc/docker/ca.pem --tlscert /etc/docker/server.pem --tlskey /etc/docker/server-key.pem --label node=1 --label provider=generic
root     15531     1 12783  0 3605 06:50 ?        00:00:04 /usr/bin/dockerd -H tcp://0.0.0.0:2376 -H unix:///var/run/docker.sock --storage-driver aufs --tlsverify --tlscacert /etc/docker/ca.pem --tlscert /etc/docker/server.pem --tlskey /etc/docker/server-key.pem --label node=1 --label provider=generic
root     15531     1 12784  0 3605 06:50 ?        00:00:00 /usr/bin/dockerd -H tcp://0.0.0.0:2376 -H unix:///var/run/docker.sock --storage-driver aufs --tlsverify --tlscacert /etc/docker/ca.pem --tlscert /etc/docker/server.pem --tlskey /etc/docker/server-key.pem --label node=1 --label provider=generic
root     15531     1 14874  0 3605 06:52 ?        00:00:01 /usr/bin/dockerd -H tcp://0.0.0.0:2376 -H unix:///var/run/docker.sock --storage-driver aufs --tlsverify --tlscacert /etc/docker/ca.pem --tlscert /etc/docker/server.pem --tlskey /etc/docker/server-key.pem --label node=1 --label provider=generic
root     15531     1 15083  0 3605 06:52 ?        00:00:06 /usr/bin/dockerd -H tcp://0.0.0.0:2376 -H unix:///var/run/docker.sock --storage-driver aufs --tlsverify --tlscacert /etc/docker/ca.pem --tlscert /etc/docker/server.pem --tlskey /etc/docker/server-key.pem --label node=1 --label provider=generic
root     15531     1 15216  0 3605 06:52 ?        00:00:10 /usr/bin/dockerd -H tcp://0.0.0.0:2376 -H unix:///var/run/docker.sock --storage-driver aufs --tlsverify --tlscacert /etc/docker/ca.pem --tlscert /etc/docker/server.pem --tlskey /etc/docker/server-key.pem --label node=1 --label provider=generic
root     15531     1 15237  0 3605 06:52 ?        00:00:09 /usr/bin/dockerd -H tcp://0.0.0.0:2376 -H unix:///var/run/docker.sock --storage-driver aufs --tlsverify --tlscacert /etc/docker/ca.pem --tlscert /etc/docker/server.pem --tlskey /etc/docker/server-key.pem --label node=1 --label provider=generic
root     15531     1 15942  0 3605 06:52 ?        00:00:06 /usr/bin/dockerd -H tcp://0.0.0.0:2376 -H unix:///var/run/docker.sock --storage-driver aufs --tlsverify --tlscacert /etc/docker/ca.pem --tlscert /etc/docker/server.pem --tlskey /etc/docker/server-key.pem --label node=1 --label provider=generic
root     15531     1 17144  0 3605 06:54 ?        00:00:08 /usr/bin/dockerd -H tcp://0.0.0.0:2376 -H unix:///var/run/docker.sock --storage-driver aufs --tlsverify --tlscacert /etc/docker/ca.pem --tlscert /etc/docker/server.pem --tlskey /etc/docker/server-key.pem --label node=1 --label provider=generic
root     15531     1 17773  0 3605 06:56 ?        00:00:02 /usr/bin/dockerd -H tcp://0.0.0.0:2376 -H unix:///var/run/docker.sock --storage-driver aufs --tlsverify --tlscacert /etc/docker/ca.pem --tlscert /etc/docker/server.pem --tlskey /etc/docker/server-key.pem --label node=1 --label provider=generic
[...]

Output of ps amux

[...]
root         -  0.0    -      -     - -        Ssl  05:29   0:07 -
root         -  0.0    -      -     - -        Ssl  05:31   0:03 -
root         -  0.1    -      -     - -        Ssl  05:31   0:10 -
root         -  0.0    -      -     - -        Ssl  05:34   0:01 -
root         -  0.0    -      -     - -        Ssl  05:34   0:07 -
root         -  0.0    -      -     - -        Ssl  05:35   0:04 -
root         -  0.2    -      -     - -        Ssl  05:35   0:19 -
root         -  0.0    -      -     - -        Ssl  05:35   0:03 -
root         -  0.2    -      -     - -        Ssl  05:39   0:18 -
root         -  0.1    -      -     - -        Ssl  05:39   0:09 -
root         -  0.0    -      -     - -        Ssl  05:40   0:00 -
root         -  0.0    -      -     - -        Ssl  05:40   0:04 -
root         -  0.0    -      -     - -        Ssl  05:40   0:00 -
root         -  0.0    -      -     - -        Ssl  05:40   0:00 -
root         -  0.1    -      -     - -        Ssl  05:46   0:07 -
root         -  0.0    -      -     - -        Ssl  05:46   0:06 -
[...]

Is this a bug or some option we are missing? Engine version 17.07.0-ce

@schmunk42
Copy link
Author

Possibly related: moby/libnetwork#1113
CC: @handcode

@schmunk42
Copy link
Author

schmunk42 commented Sep 26, 2017

Some more information on this...

A: Swarm-cluster created
B: Broken application stack created (restarting due to config errors)
C: Broken application stack removed; broken nginx (see below) stack created for testing
D: Engine restart (due to too many threads?)
E: Recreated swarm-cluster and deployed broken-nginx stack on vanilla setup to verify the issue

image

Detailed view

image-1

Broken nginx stack configuration, deployed with docker stack deploy

[edit]
Note: The host-volume is intentional to simulate a broken config.

version: '3'
services:
  nginx:
    image: nginx
    depends_on:
      - redis
    volumes:
      - /tmp:/etc/nginx
  redis:
    image: redis

We're also verifying this with the same stack running on a single node with docker-compose, but with additional restart options to simulate a similar behavior, seems not to happen here (threads stay on a "reasonable" number)

This is looking pretty serious, since one broken configuration will kill your node within a few hours.

@schmunk42
Copy link
Author

Some more info...

This is how a pstree looks like:

systemd-+-agetty
        |-auditd---{auditd}
        |-cron
        |-dbus-daemon
        |-dhclient
        |-dockerd-+-docker-containe-+-docker-containe-+-redis-server---3*[{redis-server}]
        |         |                 |                 `-9*[{docker-containe}]
        |         |                 |-2*[docker-containe-+-redis-server---3*[{redis-server}]]
        |         |                 |                    `-8*[{docker-containe}]]
        |         |                 `-16*[{docker-containe}]
        |         `-128*[{dockerd}]
        |-fail2ban-server---2*[{fail2ban-server}]
        |-packagekitd-+-{gdbus}
        |             `-{gmain}
        |-polkitd-+-{gdbus}
        |         `-{gmain}
        |-rsyslogd-+-{in:imklog}
        |          |-{in:imuxsock}
        |          `-{rs:main Q:Reg}
        |-sshd-+-sshd---sshd---bash---sudo---bash---watch---watch---sh---pstree
        |      `-sshd---sshd---bash---top
        |-systemd---(sd-pam)
        |-systemd-journal
        |-systemd-logind
        |-systemd-timesyn---{sd-resolve}
        `-systemd-udevd

This line shows the processes/threads getting accumulated:

    |         `-128*[{dockerd}]

other counts stay the same

@schmunk42
Copy link
Author

Ping @thaJeztah - sorry to bug you here, but usually you're the person who knows who can shed some light on such issues :) This is completely keeping us from progressing with swarm mode

@thaJeztah
Copy link
Member

Can you add the output of docker version and docker info as well, to give some context about the version that's running, platform, and configuration? Other than the flags set on the daemon (shown in the top comment), do you have any configuration options set (such as a specific logging driver, etc?)

@thaJeztah
Copy link
Member

Also, it may be good to get a stack dump from the daemon; you can obtain that by sending dockerd a SIGUSR1 signal, which will produce a stack dump in the daemon logs (or mention a file that it saved the dump in)

@schmunk42
Copy link
Author

schmunk42 commented Oct 18, 2017

System

  • Docker Swarm (swarm mode)
  • 2 nodes in Debian 9 in VirtualBox VM
  • Host is Debian 8.7

docker version

Client:
 Version:      17.09.0-ce-rc3
 API version:  1.31 (downgraded from 1.32)
 Go version:   go1.8.3
 Git commit:   2357fb2
 Built:        Thu Sep 21 02:30:17 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.07.0-ce
 API version:  1.31 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   8784753
 Built:        Tue Aug 29 17:41:44 2017
 OS/Arch:      linux/amd64
 Experimental: false

docker info

Containers: 5
 Running: 1
 Paused: 0
 Stopped: 4
Images: 3
Server Version: 17.07.0-ce
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 22
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: mag1ux9hbpu4ht3m7hw9y2k82
 Is Manager: true
 ClusterID: n8udy3doioh76n0nahd0f7n21
 Managers: 1
 Nodes: 3
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Autolock Managers: false
 Root Rotation In Progress: false
 Node Address: 5.9.132.49
 Manager Addresses:
  5.9.132.49:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 3addd840653146c90a254301d6c3a663c7fd6429
runc version: 2d41c047c83e09a6d61d464906feb2a2f3c52aa4
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.9.0-3-amd64
Operating System: Debian GNU/Linux 9 (stretch)
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 996.5MiB
Name: 49.hz-3.ci20.de
ID: TK43:Z7UR:7KCT:OGQQ:ZSQJ:NDVC:SBAM:JEG7:H2WZ:TUDT:PNW6:P54Z
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
 master=1
 provider=generic
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

syslog, daemon.log

Last lines of logs - there are literally 10.000s of lines like:

Oct 18 06:41:15 49 dockerd[16012]: time="2017-10-18T06:41:15.413061421Z" level=warning msg="rmServiceBinding handleEpTableEvent a1000_nginx c0b470da70bdb19458b71739ad8ba25b1b2138ce29d3cb60a79f54e86eeb2b2d aborted c.serviceBindings[skey] !ok"

or

Oct 18 07:30:15 49 dockerd[16012]: time="2017-10-18T07:30:15.794025337Z" level=error msg="failed removing service binding for ff251e3002c4d142200bb9d5d777c007c3108e032dc4bd085aca34ee90c117c6 epRec:{a1000_nginx.1.8mg9zapesvy3pjq5qudl9h0jp a1000_nginx pdtkw7qm533i785xpxsortwcl 10.0.0.2 10.0.0.3 [] [nginx] [a39bb7ea080b]} err:network y2n9z1cvzyvnu7dtmc7k8uehh not found"

Possibly related:

@thaJeztah
Copy link
Member

Are you able to reproduce on a 17.09 daemon? Or perhaps better; 17.10, as I see the patch for moby/libnetwork#1934 was included in 17.10, but not (yet) back ported to 17.09; docker-archive/docker-ce#264

/cc @fcrisciani

@schmunk42
Copy link
Author

Yes, I tried it with 17.10 (upgraded all nodes), see this gist for some logs.

On the two workers one gets filled with threads, around 2900 in 24h, while the other gets filled with logs ~7GB in 24h. Both machines close to dying, just because of the one broken config from the initial example.

Can you reproduce that?

@fcrisciani
Copy link

@schmunk42 can you attach the go routine stack trace? You just have to send the signal SIGUSR1 to the dockerd.

@schmunk42
Copy link
Author

schmunk42 commented Nov 28, 2017

@fcrisciani This is the log generated by SIGUSR1
https://gist.github.com/schmunk42/2c131c972f95f74ba96cdd38aca34ded

These are entries in daemon.log popping up every few seconds without any trigger, just by running the broken stack

Nov 28 12:54:57 60 systemd-udevd[8706]: Could not generate persistent MAC address for vx-001003-txlxs: No such file or directory
Nov 28 12:54:57 60 systemd-udevd[8723]: Could not generate persistent MAC address for veth65e1074: No such file or directory
Nov 28 12:54:57 60 systemd-udevd[8727]: Could not generate persistent MAC address for veth9b931b0: No such file or directory
Nov 28 12:54:57 60 systemd-udevd[8741]: Could not generate persistent MAC address for veth03f4d2e: No such file or directory
Nov 28 12:54:57 60 systemd-udevd[8744]: Could not generate persistent MAC address for veth7240c7b: No such file or directory
Nov 28 12:54:58 60 systemd-udevd[8873]: Could not generate persistent MAC address for vx-001003-txlxs: No such file or directory
Nov 28 12:54:58 60 systemd-udevd[8880]: Could not generate persistent MAC address for veth9b931b0: No such file or directory
Nov 28 12:54:58 60 dockerd[11717]: time="2017-11-28T12:54:58.477791086Z" level=error msg="fatal task error" error="task: non-zero exit (1)" module=node/agent/taskmanager node.id=joew8i91uii2jxxrg4rf5huvc service.id=yc2k69nxb0e93klatluyo68fm task.id=ehw9fz9p90mzscnanw7x35k78
Nov 28 12:55:10 60 systemd-udevd[8956]: Could not generate persistent MAC address for vx-001003-txlxs: No such file or directory
Nov 28 12:55:10 60 systemd-udevd[8972]: Could not generate persistent MAC address for veth3664dfc: No such file or directory
Nov 28 12:55:10 60 systemd-udevd[8975]: Could not generate persistent MAC address for veth78ca53a: No such file or directory
Nov 28 12:55:10 60 systemd-udevd[8990]: Could not generate persistent MAC address for vetha9bbb84: No such file or directory
Nov 28 12:55:10 60 systemd-udevd[8993]: Could not generate persistent MAC address for veth7c1b114: No such file or directory
Nov 28 12:55:11 60 systemd-udevd[9122]: Could not generate persistent MAC address for veth78ca53a: No such file or directory
Nov 28 12:55:11 60 dockerd[11717]: time="2017-11-28T12:55:11.469510925Z" level=error msg="fatal task error" error="task: non-zero exit (1)" module=node/agent/taskmanager node.id=joew8i91uii2jxxrg4rf5huvc service.id=yc2k69nxb0e93klatluyo68fm task.id=e4t55oeef4xgxdhpc6luuuk86
Nov 28 12:55:12 60 dockerd[11717]: time="2017-11-28T12:55:12.617012600Z" level=info msg="Node join event for 389fba4fb358/XXX.XXX.114.243"
Nov 28 12:55:16 60 dockerd[11717]: time="2017-11-28T12:55:16.710447099Z" level=error msg="fatal task error" error="starting container failed: No such network: a1000_default" module=node/agent/taskmanager node.id=joew8i91uii2jxxrg4rf5huvc service.id=yc2k69nxb0e93klatluyo68fm task.id=has8nrdkx1n6s0vyx29vjtjvq
Nov 28 12:55:19 60 dockerd[11717]: time="2017-11-28T12:55:19.339634446Z" level=error msg="network a1000_default remove failed: No such network: a1000_default" module=node/agent node.id=joew8i91uii2jxxrg4rf5huvc
Nov 28 12:55:19 60 dockerd[11717]: time="2017-11-28T12:55:19.339734351Z" level=error msg="remove task failed" error="No such network: a1000_default" module=node/agent node.id=joew8i91uii2jxxrg4rf5huvc task.id=has8nrdkx1n6s0vyx29vjtjvq

Addon: https://gist.github.com/schmunk42/edfa992128ce3c76004233e76b74ccbc (output from docker ps -a on a node, which has over 1500 threads)

@fcrisciani
Copy link

@schmunk42 found the issue, working on a fix, this is the PR that introduced the regression moby/moby#34188
In particular the line: https://github.com/moby/moby/pull/34188/files#diff-470d751551967a09e51effadaf1d5cf4R59

There is some logic in https://github.com/moby/moby/blob/master/daemon/cluster/executor/container/controller.go#L186 that expects the type ErrNoSuchNetwork in order to retry to create the network as managed. In that PR the error type is changed and so the issue start popping up. This is currently affecting all releases starting from 17.09.

What you are seeing is a result of that failure and the containers are remaining in Created state. This is something else that will have to be checked
^ @dperny

@schmunk42
Copy link
Author

@fcrisciani Any updates on this one? Is a fix already available in a release?

@fcrisciani
Copy link

@schmunk42 sorry for not following up, 17.12 contains the fix, can you please verify on that release?

@schmunk42
Copy link
Author

@fcrisciani I can confirm this is fixed. I tested it with 18.01. Thank you!

@fcrisciani
Copy link

@schmunk42 no problem

@cuongnv23
Copy link

cuongnv23 commented Jun 11, 2018

hi @fcrisciani I got a problem which I think quite similar to this, below is the messages in syslog, do you think this is the same issue or should I file another ticket?

Jun 11 02:46:34 staging-swarm-02 dockerd: time="2018-06-11T02:46:34.974723636Z" level=error msg="failed removing service binding for e9c4423c7b01a1a53a75d94ed06330ca135084e5cd578c88e3d0f378d5f76feb epRec:{xxx-app.1.afr4cka0a9om9ylpjvg2nnn1h xxx-app i05yzw95ck9sm0p2zcsis5qp8 10.0.0.112 10.0.0.141 [] [] [c0cb37ab618f]} err:network qn6k76sgsbftqa8vbz9s0y7wk not found"
Jun 11 02:46:34 staging-swarm-02 dockerd: time="2018-06-11T02:46:34.974782441Z" level=error msg="failed removing service binding for ee3cf54b71ce997f04f1408bf7b182b0052e7eb6a9c52ee3cb760219199a0e5f epRec:{yyy-app.1.5nsg3tei2bg0nlw6gmjl9vue6 yyy-app ndyzvvs8wimyx4cvgmvhyehzb 10.0.0.84 10.0.0.57 [] [] [4a59eb18d2bc]} err:network qn6k76sgsbftqa8vbz9s0y7wk not found"
Jun 11 02:46:34 staging-swarm-02 dockerd: time="2018-06-11T02:46:34.974838080Z" level=error msg="failed removing service binding for f2b0e57d6bcf19d458f6a34d01d8db6e03a31dcd5957a8bc258361c2a772dd86 epRec:{zzz-app.1.kwbxh1kpo7ni3p4wj79v3qc05 zzz-app fcukban1lle8fiws0vimbh58b 10.0.0.31 10.0.0.138 [] [] [e14b2f90cde5]} err:network qn6k76sgsbftqa8vbz9s0y7wk not found"
Jun 11 02:46:35 staging-swarm-02 dockerd: time="2018-06-11T02:46:35.761660875Z" level=error msg="logs call failed" error="failed to publish log message: EOF" module="node/agent/taskmanager" node.id=qqt7nrh49j98fizne4i641cml
Jun 11 02:46:35 staging-swarm-02 dockerd: time="2018-06-11T02:46:35.761788154Z" level=error msg="logs call failed" error="failed to publish log message: EOF" module="node/agent/taskmanager" node.id=qqt7nrh49j98fizne4i641cml
Jun 11 02:46:36 staging-swarm-02 dockerd: time="2018-06-11T02:46:36.095228177Z" level=error msg="logs call failed" error="failed to publish log message: EOF" module="node/agent/taskmanager" node.id=qqt7nrh49j98fizne4i641cml
Jun 11 02:46:36 staging-swarm-02 dockerd: time="2018-06-11T02:46:36.095326252Z" level=error msg="logs call failed" error="failed to publish log message: EOF" module="node/agent/taskmanager" node.id=qqt7nrh49j98fizne4i641cml

output of docker info:

docker info
Containers: 138
 Running: 4
 Paused: 0
 Stopped: 134
Images: 65
Server Version: 17.09.1-ce
Storage Driver: overlay
 Backing Filesystem: xfs
 Supports d_type: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: qqt7nrh49j98fizne4i641cml
 Is Manager: false
 Node Address: 10.136.57.134
 Manager Addresses:
  10.136.57.89:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 06b9cb35161009dcb7123345749fef02f7cea8e0
runc version: 3f2f8b84a77f73d38244dd690525642a72156c64
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 3.10.0-693.11.6.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 7.639GiB
Name: staging-swarm-02
ID: W7UX:N2GT:PKF2:62VT:TZZP:LKQR:Q5YF:CCCS:L4RQ:NMXH:I6BC:7TBN
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

the network does exist at the time:

$ docker network ls | grep qn6k76
qn6k76sgsbft        appnet              overlay             swarm

thanks

EDIT1: update log messages and network info

@thaJeztah
Copy link
Member

@cuongnv23 the issue being discussed here was resolved in Docker 18.01, and you're running an older version than that (17.09); it's worth testing it you're still able to reproduce on a current version of Docker

@cuongnv23
Copy link

@thaJeztah in this comment moby/moby#35634 (comment) you said that this issue was back-ported to 17.09.1 and that's version I'm using, so I thought it's worth to mention here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants