Skip to content
This repository has been archived by the owner on Nov 1, 2022. It is now read-only.

Goroutine leak in fluxd? #1602

Closed
brantb opened this issue Dec 14, 2018 · 7 comments
Closed

Goroutine leak in fluxd? #1602

brantb opened this issue Dec 14, 2018 · 7 comments

Comments

@brantb
Copy link
Contributor

brantb commented Dec 14, 2018

We've observed fluxd's memory usage slowly increasing until it reaches the memory limit (arbitrarily set to 300Mi by us) and is OOMKilled by Kubernetes:

image

With @ncabatoff's guidance I used the profiler to dump the goroutines and found several thousand. Here's a gist with the output of lsof and netstat plus the goroutine list. (10.244.3.20 is the IP address assigned to the flux-memcached pod)

We also have a lot of logs like the following:

ts=2018-12-14T15:11:47.467587321Z caller=warming.go:192 component=warmer canonical_name=mcr.microsoft.com/k8s/metrics/adapter auth={map[]} err="requesting tags: json: cannot unmarshal array into Go value of type struct { Tags []string \"json:\\\"tags\\\"\" }"
ts=2018-12-14T15:11:51.219396002Z caller=warming.go:192 component=warmer canonical_name=mcr.microsoft.com/k8s/aad-pod-identity/mic auth={map[]} err="requesting tags: json: cannot unmarshal array into Go value of type struct { Tags []string \"json:\\\"tags\\\"\" }"
ts=2018-12-14T15:11:54.583460254Z caller=warming.go:192 component=warmer canonical_name=iqsandbox.azurecr.io/gameday/quackserver auth={map[]} err="requesting tags: Get https://iqsandbox.azurecr.io/v2/gameday/quackserver/tags/list: unauthorized: authentication required"
ts=2018-12-14T15:11:55.710904979Z caller=warming.go:192 component=warmer canonical_name=mcr.microsoft.com/k8s/aad-pod-identity/nmi auth={map[]} err="requesting tags: json: cannot unmarshal array into Go value of type struct { Tags []string \"json:\\\"tags\\\"\" }"

iqsandbox.azurecr.io is a container registry which doesn't have a pull secret present in the namespace Flux is running in. Anecdotally, this issue seems to have started (or gotten worse) around the time we started running containers using images from mcr.microsoft.com, so maybe something in that code path is the culprit?

I'll be happy to provide any additional diagnostic info you need. Thanks again to @ncabatoff for walking me through this so far (I don't have any experience with the golang toolchain and his help was invaluable).

@ncabatoff
Copy link
Contributor

One extra detail: the reason we think the memory increase is due to the goroutines is because @brantb also produced a pprof heap SVG which shows that the heap accounts for only ~80/262Mi memory used.

@brantb
Copy link
Contributor Author

brantb commented Dec 14, 2018

Whoops, I forgot to include that. Here's the heap graph.

@2opremio
Copy link
Contributor

@brantb left some more info on Slack today https://weave-community.slack.com/archives/C4U5ATZ9S/p1547737322675900

@2opremio
Copy link
Contributor

2opremio commented Jan 17, 2019

@brantb I am not 100% that #1672 will solve the problem but it will surely help

Interestingly, the amount of authentication-error logs from the warmer roughly matches the amount of descriptors leaked:

# grep component=warmer "flux-log.txt"  | grep unauthorized | wc -l
    3791
# cat /proc/$FLUXPID/net/sockstat
sockets: used 3808
TCP: inuse 16 orphan 0 tw 27 alloc 3046 mem 2750
UDP: inuse 0 mem 0
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0

@brantb
Copy link
Contributor Author

brantb commented Jan 29, 2019

We've been running master-2441121d for over a week now and the memory profile looks much healthier:

image

I'm going to call this one a win. Thanks again, @2opremio & @ncabatoff!

@brantb brantb closed this as completed Jan 29, 2019
@davidkarlsen
Copy link
Contributor

@brantb nice UI - what is it?

@brantb
Copy link
Contributor Author

brantb commented Jan 29, 2019

@davidkarlsen That's Azure Monitor, for their managed Kubernetes service (AKS). 😄

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants