Skip to content
This repository has been archived by the owner on Nov 1, 2022. It is now read-only.

Flux leaks file descriptors and runs out of file descriptors #1639

Closed
agcooke opened this issue Jan 8, 2019 · 11 comments
Closed

Flux leaks file descriptors and runs out of file descriptors #1639

agcooke opened this issue Jan 8, 2019 · 11 comments
Labels

Comments

@agcooke
Copy link

agcooke commented Jan 8, 2019

We had the weave-flux-agent running for a few days and noticed that it stopped syncing to weave cloud.

There was an error message as follows in the logs:

ts=2019-01-08T12:59:51.36780816Z caller=upstream.go:113 component=upstream err="executing websocket wss://cloud.weave.works./api/flux/v10/daemon: dial tcp: lookup cloud.weave.works. on 172.20.0.10:53: dial udp 172.20.0.10:53: socket: too many open files"

We configured our AWS EKS AL2 Nodes to have the following ulimits:

 "default-ulimits": {
   "nofile": {
     "Name": "nofile",
     "Soft": 2048,
     "Hard": 8192
   }
 }

On our node we had the following output:

sudo ls -l /proc/20190/fd/ | wc
  2049   22530  131528

From lsof

COMMAND   PID USER   FD      TYPE DEVICE SIZE/OFF     NODE NAME
fluxd   20190 root  cwd       DIR  0,347     4096  3539637 /home/flux
fluxd   20190 root  rtd       DIR  0,347     4096  3539897 /
fluxd   20190 root  txt       REG  0,347 42909823  3539827 /usr/local/bin/fluxd
fluxd   20190 root  mem       REG 202,80           3539827 /usr/local/bin/fluxd (stat: No such file or directory)
fluxd   20190 root    0u     sock    0,8      0t0 38846617 protocol: TCP
fluxd   20190 root    1w     FIFO   0,11      0t0    91816 pipe
fluxd   20190 root    2w     FIFO   0,11      0t0    91817 pipe
fluxd   20190 root    3u     sock    0,8      0t0    91967 protocol: TCP
fluxd   20190 root    4u  a_inode   0,12        0     7747 [eventpoll]
fluxd   20190 root    5u     sock    0,8      0t0   133739 protocol: TCPv6
fluxd   20190 root    6u     sock    0,8      0t0  3685001 protocol: TCP
fluxd   20190 root    7u     sock    0,8      0t0   178126 protocol: TCP
fluxd   20190 root    8u     sock    0,8      0t0   575783 protocol: TCP
fluxd   20190 root    9u     sock    0,8      0t0   178174 protocol: TCP
fluxd   20190 root   10u     sock    0,8      0t0   178549 protocol: TCP
fluxd   20190 root   11u     sock    0,8      0t0   179416 protocol: TCP
fluxd   20190 root   12u     sock    0,8      0t0  2386384 protocol: TCP
fluxd   20190 root   13u     sock    0,8      0t0   218841 protocol: TCPv6
fluxd   20190 root   14u     sock    0,8      0t0  1333669 protocol: TCP
fluxd   20190 root   15u     sock    0,8      0t0   181075 protocol: TCP
.....
fluxd   20190 root 2042u     sock    0,8      0t0 38742366 protocol: TCP
fluxd   20190 root 2043u     sock    0,8      0t0 38894759 protocol: TCP
fluxd   20190 root 2044u     sock    0,8      0t0 38764757 protocol: TCP
fluxd   20190 root 2045u     sock    0,8      0t0 38782203 protocol: TCP
fluxd   20190 root 2046u     sock    0,8      0t0 38793189 protocol: TCP
fluxd   20190 root 2047u     sock    0,8      0t0 39311101 protocol: TCP

We connect to bitbucket repo's.

I had to delete the pod to get flux going to unblock our pipeline.

Possibly related to: http://github.com/weaveworks/flux/issues/1602

@agcooke agcooke changed the title Flux leaks file descriptors when connecting to wss://cloud.weave.works./api/flux Flux leaks file descriptors and runs out of file descriptors Jan 8, 2019
@agcooke
Copy link
Author

agcooke commented Jan 8, 2019

Netstat showed no connected sockets.

@errordeveloper
Copy link
Contributor

@agcooke thanks for reporting this, we will look into it.

@2opremio 2opremio added the bug label Jan 14, 2019
@2opremio
Copy link
Contributor

Netstat showed no connected sockets.

It would have been nice to see more details about the sockets though

@agcooke have you managed to reproduce it? what version of flux was the pod running?

@2opremio
Copy link
Contributor

Duplicate of #1602 ?

@agcooke
Copy link
Author

agcooke commented Jan 29, 2019

@2opremio I do not think so. I was away on for some weeks, but we did see it happen again. I will see if I can find logs for that.

@foot
Copy link

foot commented Feb 1, 2019

I've had another report of this and have some logs I might be able to share to shed some light on it.

@squaremo
Copy link
Member

squaremo commented Feb 5, 2019

@foot Can you DM me those logs? Ta

@indrekh
Copy link

indrekh commented Dec 5, 2019

In my case the problem was in unreachable registries.
Since I don't use "Automated deployment of new container images" I added "- --registry-exclude-image=*" option and the unclosed socket problem was solved.

@2opremio
Copy link
Contributor

So, the problem was caused by registries not being reachable and (probably) the registry client leaking sockets (probably HTTP response bodies). Has anyone seen this problem recently?

@indrekh Would you be so kind of re-testing this? (assuming you are still using Flux).

@2opremio
Copy link
Contributor

@squaremo do you recall what happened with this?

@kingdonb
Copy link
Member

Possibly related to #3450

Closing, unless we have active reports of this issue or more direct insights into how this happens it will not be possible to resolve it.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

7 participants