multi: attempt to fix issue of high number of TCP connections by explicitly closing connections and implementing an idle timer #719

Roasbeef · 2023-12-05T01:25:11Z

On the instance we've run, we've been seeing a very high number of active TCP connections at any given time. After tweaking some Aperture related settings, it seems that the issue is with our own gRPC server, and the way we make outbound connections.

This PR implements to fixes to attempt to mitigate this issue:

Close out any outbound connections we make (courier, gRPC syncing, etc) once we're done.
Add a default idle timer that'll close out connections that haven't received any requests/responses after an interval of time.

guggero

Very nice! I think this should fix our high number of connection issue.
Some CI steps fail and I have a question around logging, otherwise this looks good!

proof/courier.go

server.go

universe/interface.go

universe/syncer.go

Roasbeef · 2023-12-09T01:13:03Z

PTAL!

guggero

Nice, LGTM 🎉

guggero · 2023-12-11T08:55:28Z

proof/courier.go

@@ -467,6 +478,8 @@ func (h *HashMailBox) RecvAck(ctx context.Context, sid streamID) error {

 // CleanUp atempts to tear down the mailbox as specified by the passed sid.
 func (h *HashMailBox) CleanUp(ctx context.Context, sid streamID) error {
+	defer h.rawConn.Close()


Non-blocking: I guess my request was less about actually logging the error but rather logging it instead of having it as a return value in the interface. Because swallowing the error here does show up as a code smell issue in my IDE (and will probably be flagged by the linter if we decide to turn on some of the rules we have in lnd).

Maybe over-engineering:

In such use-cases we could try returning a named variable

func (h *HashMailBox) CleanUp(ctx context.Context, sid streamID) (err error) {

and assign to err the error we want to return within the body.
Then our defer would look like this

defer func(){ if err != nil { err = h.rawConn.Close() } }()

Or yeah we could just omit returning error. Do we care about consistency on the Close func signature?

GeorgeTsagk

Looks Good! 🕸️

GeorgeTsagk · 2023-12-12T11:15:42Z

proof/courier.go

@@ -467,6 +478,8 @@ func (h *HashMailBox) RecvAck(ctx context.Context, sid streamID) error {

 // CleanUp atempts to tear down the mailbox as specified by the passed sid.
 func (h *HashMailBox) CleanUp(ctx context.Context, sid streamID) error {
+	defer h.rawConn.Close()


Maybe over-engineering:

In such use-cases we could try returning a named variable

func (h *HashMailBox) CleanUp(ctx context.Context, sid streamID) (err error) {

and assign to err the error we want to return within the body.
Then our defer would look like this

defer func(){ if err != nil { err = h.rawConn.Close() } }()

GeorgeTsagk · 2023-12-12T11:17:23Z

proof/courier.go

@@ -467,6 +478,8 @@ func (h *HashMailBox) RecvAck(ctx context.Context, sid streamID) error {

 // CleanUp atempts to tear down the mailbox as specified by the passed sid.
 func (h *HashMailBox) CleanUp(ctx context.Context, sid streamID) error {
+	defer h.rawConn.Close()


Or yeah we could just omit returning error. Do we care about consistency on the Close func signature?

server.go

Roasbeef · 2023-12-12T22:59:44Z

itest seems to be failing consistently, will dig in.

guggero

Found the issue with the itest, see inline comment.

guggero · 2023-12-20T16:45:25Z

proof/courier.go

@@ -467,6 +478,8 @@ func (h *HashMailBox) RecvAck(ctx context.Context, sid streamID) error {

 // CleanUp atempts to tear down the mailbox as specified by the passed sid.
 func (h *HashMailBox) CleanUp(ctx context.Context, sid streamID) error {
+	defer h.rawConn.Close()


We need to move this h.rawConn.Close() out of the CleanUp() method, since that is being called for two mailboxes (sender+receiver). So it needs to be done in DeliverProof as a defer before cleaning up the mailboxes.

… syncers In this commit, we attempt to fix a TCP connection leak by explicitly closing the gRPC connections we create once we're done with the relevant gRPC client. Otherwise, we'll end up making a new connection for each new asset to be pushed, which can add up. In the future, we should also look into the server-side keep alive options.

This is similar to the prior commit: we add a new method to allow a caller to close down a courier once they're done with it. This ensures that we'll always release the resources once we're done with them.

In this commit, we set a gRPC param that controls how long a connection can be idle for. The goal here is to prune the amount of open TCP connections on an active/popular universe server. According to the docs: > Idleness duration is defined since the most recent time the number of outstanding RPCs became zero or the connection establishment.

Roasbeef added optimization bug fix labels Dec 5, 2023

guggero self-requested a review December 5, 2023 13:01

guggero reviewed Dec 6, 2023

View reviewed changes

proof/courier.go Outdated Show resolved Hide resolved

server.go Show resolved Hide resolved

universe/interface.go Show resolved Hide resolved

universe/syncer.go Show resolved Hide resolved

Roasbeef requested review from GeorgeTsagk and guggero December 9, 2023 01:12

Roasbeef force-pushed the tcp-connection-issue branch 2 times, most recently from ed9b703 to 24eacd6 Compare December 9, 2023 01:44

guggero approved these changes Dec 11, 2023

View reviewed changes

GeorgeTsagk approved these changes Dec 12, 2023

View reviewed changes

guggero reviewed Dec 20, 2023

View reviewed changes

Roasbeef force-pushed the tcp-connection-issue branch from 24eacd6 to de7f577 Compare December 20, 2023 21:14

Roasbeef added 3 commits December 20, 2023 14:27

proof: ensure callers always tear down couriers

b58c230

This is similar to the prior commit: we add a new method to allow a caller to close down a courier once they're done with it. This ensures that we'll always release the resources once we're done with them.

Roasbeef force-pushed the tcp-connection-issue branch from de7f577 to b61d6fb Compare December 20, 2023 22:27

Roasbeef merged commit becddbf into lightninglabs:main Dec 20, 2023
14 checks passed

Roasbeef mentioned this pull request Jan 26, 2024

release: create v0.3.3 branch #771

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi: attempt to fix issue of high number of TCP connections by explicitly closing connections and implementing an idle timer #719

multi: attempt to fix issue of high number of TCP connections by explicitly closing connections and implementing an idle timer #719

Roasbeef commented Dec 5, 2023

guggero left a comment

Roasbeef commented Dec 9, 2023

guggero left a comment

guggero Dec 11, 2023

GeorgeTsagk Dec 12, 2023

GeorgeTsagk Dec 12, 2023

GeorgeTsagk left a comment

GeorgeTsagk Dec 12, 2023

GeorgeTsagk Dec 12, 2023

Roasbeef commented Dec 12, 2023

guggero left a comment

guggero Dec 20, 2023

multi: attempt to fix issue of high number of TCP connections by explicitly closing connections and implementing an idle timer #719

multi: attempt to fix issue of high number of TCP connections by explicitly closing connections and implementing an idle timer #719

Conversation

Roasbeef commented Dec 5, 2023

guggero left a comment

Choose a reason for hiding this comment

Roasbeef commented Dec 9, 2023

guggero left a comment

Choose a reason for hiding this comment

guggero Dec 11, 2023

Choose a reason for hiding this comment

GeorgeTsagk Dec 12, 2023

Choose a reason for hiding this comment

GeorgeTsagk Dec 12, 2023

Choose a reason for hiding this comment

GeorgeTsagk left a comment

Choose a reason for hiding this comment

GeorgeTsagk Dec 12, 2023

Choose a reason for hiding this comment

GeorgeTsagk Dec 12, 2023

Choose a reason for hiding this comment

Roasbeef commented Dec 12, 2023

guggero left a comment

Choose a reason for hiding this comment

guggero Dec 20, 2023

Choose a reason for hiding this comment