-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Goroutine leak in fluxd? #1602
Comments
One extra detail: the reason we think the memory increase is due to the goroutines is because @brantb also produced a pprof heap SVG which shows that the heap accounts for only ~80/262Mi memory used. |
Whoops, I forgot to include that. Here's the heap graph. |
@brantb left some more info on Slack today https://weave-community.slack.com/archives/C4U5ATZ9S/p1547737322675900 |
@brantb I am not 100% that #1672 will solve the problem but it will surely help Interestingly, the amount of authentication-error logs from the warmer roughly matches the amount of descriptors leaked:
|
We've been running I'm going to call this one a win. Thanks again, @2opremio & @ncabatoff! |
@brantb nice UI - what is it? |
@davidkarlsen That's Azure Monitor, for their managed Kubernetes service (AKS). 😄 |
We've observed
fluxd
's memory usage slowly increasing until it reaches the memory limit (arbitrarily set to300Mi
by us) and is OOMKilled by Kubernetes:With @ncabatoff's guidance I used the profiler to dump the goroutines and found several thousand. Here's a gist with the output of
lsof
andnetstat
plus the goroutine list. (10.244.3.20
is the IP address assigned to theflux-memcached
pod)We also have a lot of logs like the following:
iqsandbox.azurecr.io
is a container registry which doesn't have a pull secret present in the namespace Flux is running in. Anecdotally, this issue seems to have started (or gotten worse) around the time we started running containers using images frommcr.microsoft.com
, so maybe something in that code path is the culprit?I'll be happy to provide any additional diagnostic info you need. Thanks again to @ncabatoff for walking me through this so far (I don't have any experience with the golang toolchain and his help was invaluable).
The text was updated successfully, but these errors were encountered: