Goroutine leak #854

pebrc · 2019-05-12T11:50:13Z

When deploying multiple Elasticsearch clusters (here 5) I am seeing the operator process being killed repeatedly because it exceeds its memory limit

pebrc · 2019-05-12T16:14:33Z

pebrc · 2019-05-12T18:12:38Z

Showing nodes accounting for 21265.38kB, 100% of 21265.38kB total
Showing top 50 nodes out of 190
      flat  flat%   sum%        cum   cum%
 3079.51kB 14.48% 14.48%  3079.51kB 14.48%  crypto/tls.(*block).reserve
 2072.09kB  9.74% 24.23%  2072.09kB  9.74%  github.com/elastic/cloud-on-k8s/operators/vendor/github.com/beorn7/perks/quantile.newStream
 2050.75kB  9.64% 33.87%  2562.76kB 12.05%  crypto/x509.parseCertificate
 1542.01kB  7.25% 41.12%  1542.01kB  7.25%  bufio.NewReaderSize (inline)
 1542.01kB  7.25% 48.37%  1542.01kB  7.25%  bufio.NewWriterSize (inline)
 1026.25kB  4.83% 53.20%  1026.25kB  4.83%  bytes.makeSlice
 1026.25kB  4.83% 58.02%  5132.01kB 24.13%  crypto/tls.(*Conn).readHandshake
  650.62kB  3.06% 61.08%   650.62kB  3.06%  github.com/elastic/cloud-on-k8s/operators/vendor/golang.org/x/net/http2.(*ClientConn).frameScratchBuffer
  591.75kB  2.78% 63.87%   591.75kB  2.78%  github.com/elastic/cloud-on-k8s/operators/vendor/github.com/modern-go/reflect2.loadGo17Types
  513.50kB  2.41% 66.28%   513.50kB  2.41%  github.com/elastic/cloud-on-k8s/operators/vendor/k8s.io/client-go/restmapper.NewDiscoveryRESTMapper
  512.88kB  2.41% 68.69%   512.88kB  2.41%  encoding/json.typeFields
  512.62kB  2.41% 71.10%   512.62kB  2.41%  encoding/base64.(*Encoding).DecodeString
  512.23kB  2.41% 73.51%   512.23kB  2.41%  net/http.(*Transport).getIdleConnCh
  512.19kB  2.41% 75.92%   512.19kB  2.41%  runtime.malg
  512.17kB  2.41% 78.33%   512.17kB  2.41%  crypto/x509.(*CertPool).AddCert
  512.16kB  2.41% 80.74%   512.16kB  2.41%  crypto/aes.(*aesCipherGCM).NewGCM
  512.14kB  2.41% 83.15%   512.14kB  2.41%  reflect.mapassign
  512.12kB  2.41% 85.55%   512.12kB  2.41%  crypto/aes.newCipher
  512.06kB  2.41% 87.96%   512.06kB  2.41%  github.com/elastic/cloud-on-k8s/operators/vendor/github.com/modern-go/reflect2.newUnsafeStructField
  512.02kB  2.41% 90.37%   512.02kB  2.41%  net.sockaddrToTCP
  512.02kB  2.41% 92.78%   512.02kB  2.41%  github.com/elastic/cloud-on-k8s/operators/vendor/github.com/json-iterator/go.(*Iterator).ReadString
  512.02kB  2.41% 95.18%   512.02kB  2.41%  github.com/elastic/cloud-on-k8s/operators/vendor/github.com/json-iterator/go.processTags
  512.01kB  2.41% 97.59%   512.01kB  2.41%  encoding/asn1.parseField
  512.01kB  2.41%   100%   512.01kB  2.41%  errors.New

pebrc · 2019-05-12T18:34:52Z

It looks like we are leaking goroutines. Number of goroutines keeps growing until the process is killed.
The offenders are http connections I believe.

goroutine profile: total 1141
451 @ 0x42e00b 0x429339 0x4289e6 0x49525a 0x49536d 0x4960b9 0x59657f 0x5a9848 0x6014a9 0x6019bd 0x604f61 0x6b90b5 0x5483ff 0x54855f 0x6b9c02 0x45be71
#	0x4289e5	internal/poll.runtime_pollWait+0x65	/usr/local/go/src/runtime/netpoll.go:173
#	0x495259	internal/poll.(*pollDesc).wait+0x99	/usr/local/go/src/internal/poll/fd_poll_runtime.go:85
#	0x49536c	internal/poll.(*pollDesc).waitRead+0x3c	/usr/local/go/src/internal/poll/fd_poll_runtime.go:90
#	0x4960b8	internal/poll.(*FD).Read+0x178		/usr/local/go/src/internal/poll/fd_unix.go:169
#	0x59657e	net.(*netFD).Read+0x4e			/usr/local/go/src/net/fd_unix.go:202
#	0x5a9847	net.(*conn).Read+0x67			/usr/local/go/src/net/net.go:177
#	0x6014a8	crypto/tls.(*block).readFromUntil+0x88	/usr/local/go/src/crypto/tls/conn.go:492
#	0x6019bc	crypto/tls.(*Conn).readRecord+0xdc	/usr/local/go/src/crypto/tls/conn.go:593
#	0x604f60	crypto/tls.(*Conn).Read+0xf0		/usr/local/go/src/crypto/tls/conn.go:1145
#	0x6b90b4	net/http.(*persistConn).Read+0x74	/usr/local/go/src/net/http/transport.go:1497
#	0x5483fe	bufio.(*Reader).fill+0x10e		/usr/local/go/src/bufio/bufio.go:100
#	0x54855e	bufio.(*Reader).Peek+0x3e		/usr/local/go/src/bufio/bufio.go:132
#	0x6b9c01	net/http.(*persistConn).readLoop+0x1a1	/usr/local/go/src/net/http/transport.go:1645

451 @ 0x42e00b 0x43e076 0x6bb1c3 0x45be71
#	0x6bb1c2	net/http.(*persistConn).writeLoop+0x112	/usr/local/go/src/net/http/transport.go:1885

pebrc · 2019-05-13T07:22:42Z

As I understand it we are leaking HTTP connections. I tried

disabling keep alives already with no success
ensuring we close/consume response bodies even on error without success
dialing down max idle connections (even though my understanding would be that turning off keep alives should make that setting redundant)

sebgl · 2019-05-13T10:48:11Z

I think we might be leaking goroutines from keep-alive connections in the http client's Transport struct we use to request Elasticsearch. That client along with its transport profile gets recreated at each reconciliation.

I did some experiments (nice way to spend time in the plane 😄):

func main() {
	// test http server
	server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		io.WriteString(w, "Hello world")
	}))
	// simulate reconciliation loop
	var memStats runtime.MemStats
	for {
		// create a new client and a new transport at every iteration
		transport := &http.Transport{
			// DisableKeepAlives: true,
		}
		c := &http.Client{Transport: transport}

		resp, _ := c.Get(server.URL)
		io.Copy(ioutil.Discard, resp.Body)
		resp.Body.Close()

		// close idle keep-alive connections
		// transport.CloseIdleConnections()

		runtime.ReadMemStats(&memStats)
		log.Printf("%d goroutines, %d heapalloc", runtime.NumGoroutine(), memStats.HeapAlloc)
		time.Sleep(10 * time.Millisecond)
	}
}

Output:

2019/05/13 12:35:37 5 goroutines, 249008 heapalloc
2019/05/13 12:35:37 8 goroutines, 301040 heapalloc
2019/05/13 12:35:37 11 goroutines, 341984 heapalloc
2019/05/13 12:35:37 14 goroutines, 373976 heapalloc
(...)
2019/05/13 12:35:57 860 goroutines, 7732936 heapalloc
(...)
2019/05/13 12:36:16 2597 goroutines, 22930640 heapalloc

Notice how both the number of goroutines and the heap size increase over time.

Now, output of the same code but this time I uncomment transport.CloseIdleConnections():

2019/05/13 12:37:19 4 goroutines, 252000 heapalloc
2019/05/13 12:37:19 4 goroutines, 299872 heapalloc
2019/05/13 12:37:19 4 goroutines, 331232 heapalloc
2019/05/13 12:37:19 5 goroutines, 360816 heapalloc
2019/05/13 12:37:19 5 goroutines, 388936 heapalloc
2019/05/13 12:37:19 4 goroutines, 409136 heapalloc
(...)
2019/05/13 12:37:40 4 goroutines, 477768 heapalloc
(...)
2019/05/13 12:37:57 5 goroutines, 606672 heapalloc

Number of goroutines stays constant over time (4-5), we see heap size increase but is regularly reset (GC doing its job).

It looks like if we don't explicitly close idle keep-alive connections, the http client's transport is not terminated and garbage collected as we would expect, and keeps long-lived connections forever. Effectively increasing goroutines and heap over time.

Some options to fix that:

Add a Close() method to our ES client that we call at the end of each reconciliation: internally, it would call transport.CloseIdleConnections()
Set DisableKeepAlives: true in the Transport configuration (disable keep alive, it gives us the same 2nd output in the test program above)
Persist http client and its transport across multiple reconciliations.

Generally it seems Go http client and transport are aimed to be thread-safe and optimized for reuse. I think we should aim for 3. in the long term. Since we also have Observers that persist a connection to the ES cluster, I think we could encapsulate a cluster's client that we keep around to be reused both by observers and reconciliations, for a given cluster.

Note that I'm not sure this is the exact (only?) problem we have, I did not test these settings yet in the operator.

Edit: I just noticed @pebrc last comment about how disabling keep-alive connections does not help, so that's probably not the problem :/ Did you disable it through DisableKeepAlives: true @pebrc?

sebgl · 2019-05-13T11:30:42Z

I did more testing, in the operator codebase this time.

See this debugging branch: master...sebgl:debug-memleak

In that branch I output the current number of goroutines and heap size every 100 ms.
Once we have one ES cluster deployed, it's pretty clear by looking at the logs that every reconciliation attempt + every observer iteration does increase the number of goroutines.

I changed the implementation of both ES client and process manager client to close leftover idle connections from the transport once the client is not required.

I now get a stable number of goroutines! 🎉 I think we've spotted the problem :)

pebrc · 2019-05-13T11:37:28Z

Nice work @sebgl! (and yes I used DisableKeepAlives instead of CloseIdleConnections)

pebrc · 2019-05-13T11:59:56Z

Disabling keep alive might have worked after all, might have been running an old version of the code initially due to our tagging only picking up on new commits

pebrc changed the title ~~Resource limits for all-in-one deployment might be too low~~ Potential memory leak May 12, 2019

pebrc changed the title ~~Potential memory leak~~ Goroutine leak May 13, 2019

sebgl mentioned this issue May 14, 2019

Close HTTP idle connections to prevent goroutines leak #855

Merged

sebgl closed this as completed in #855 May 15, 2019

pebrc mentioned this issue May 20, 2019

Operator crashbackoff related to webhook #800

Closed

sebgl mentioned this issue Mar 23, 2020

Close idle connections on the http client directly #2755

Merged

DanielZhangQD mentioned this issue Aug 6, 2020

fix goroutine leak when tls enabled pingcap/tidb-operator#3081

Merged

ti-srebot mentioned this issue Aug 6, 2020

fix goroutine leak when tls enabled (#3081) pingcap/tidb-operator#3083

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Goroutine leak #854

Goroutine leak #854

pebrc commented May 12, 2019 •

edited

Loading

pebrc commented May 12, 2019

pebrc commented May 12, 2019

pebrc commented May 12, 2019 •

edited

Loading

pebrc commented May 13, 2019 •

edited

Loading

sebgl commented May 13, 2019 •

edited

Loading

sebgl commented May 13, 2019

pebrc commented May 13, 2019 •

edited

Loading

pebrc commented May 13, 2019 •

edited

Loading

Goroutine leak #854

Goroutine leak #854

Comments

pebrc commented May 12, 2019 • edited Loading

pebrc commented May 12, 2019

pebrc commented May 12, 2019

pebrc commented May 12, 2019 • edited Loading

pebrc commented May 13, 2019 • edited Loading

sebgl commented May 13, 2019 • edited Loading

sebgl commented May 13, 2019

pebrc commented May 13, 2019 • edited Loading

pebrc commented May 13, 2019 • edited Loading

pebrc commented May 12, 2019 •

edited

Loading

pebrc commented May 12, 2019 •

edited

Loading

pebrc commented May 13, 2019 •

edited

Loading

sebgl commented May 13, 2019 •

edited

Loading

pebrc commented May 13, 2019 •

edited

Loading

pebrc commented May 13, 2019 •

edited

Loading