Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi: attempt to fix issue of high number of TCP connections by explicitly closing connections and implementing an idle timer #719

Merged
merged 3 commits into from
Dec 20, 2023

Conversation

Roasbeef
Copy link
Member

@Roasbeef Roasbeef commented Dec 5, 2023

On the instance we've run, we've been seeing a very high number of active TCP connections at any given time. After tweaking some Aperture related settings, it seems that the issue is with our own gRPC server, and the way we make outbound connections.

This PR implements to fixes to attempt to mitigate this issue:

  • Close out any outbound connections we make (courier, gRPC syncing, etc) once we're done.
  • Add a default idle timer that'll close out connections that haven't received any requests/responses after an interval of time.

Copy link
Member

@guggero guggero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! I think this should fix our high number of connection issue.
Some CI steps fail and I have a question around logging, otherwise this looks good!

proof/courier.go Outdated Show resolved Hide resolved
server.go Show resolved Hide resolved
universe/interface.go Show resolved Hide resolved
universe/syncer.go Show resolved Hide resolved
@Roasbeef
Copy link
Member Author

Roasbeef commented Dec 9, 2023

PTAL!

@Roasbeef Roasbeef force-pushed the tcp-connection-issue branch 2 times, most recently from ed9b703 to 24eacd6 Compare December 9, 2023 01:44
Copy link
Member

@guggero guggero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, LGTM 🎉

proof/courier.go Outdated
@@ -467,6 +478,8 @@ func (h *HashMailBox) RecvAck(ctx context.Context, sid streamID) error {

// CleanUp atempts to tear down the mailbox as specified by the passed sid.
func (h *HashMailBox) CleanUp(ctx context.Context, sid streamID) error {
defer h.rawConn.Close()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking: I guess my request was less about actually logging the error but rather logging it instead of having it as a return value in the interface. Because swallowing the error here does show up as a code smell issue in my IDE (and will probably be flagged by the linter if we decide to turn on some of the rules we have in lnd).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe over-engineering:

In such use-cases we could try returning a named variable

func (h *HashMailBox) CleanUp(ctx context.Context, sid streamID) (err error) {

and assign to err the error we want to return within the body.
Then our defer would look like this

defer func(){
    if err != nil {
        err = h.rawConn.Close()
    }
}()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or yeah we could just omit returning error. Do we care about consistency on the Close func signature?

Copy link
Member

@GeorgeTsagk GeorgeTsagk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks Good! 🕸️

proof/courier.go Outdated
@@ -467,6 +478,8 @@ func (h *HashMailBox) RecvAck(ctx context.Context, sid streamID) error {

// CleanUp atempts to tear down the mailbox as specified by the passed sid.
func (h *HashMailBox) CleanUp(ctx context.Context, sid streamID) error {
defer h.rawConn.Close()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe over-engineering:

In such use-cases we could try returning a named variable

func (h *HashMailBox) CleanUp(ctx context.Context, sid streamID) (err error) {

and assign to err the error we want to return within the body.
Then our defer would look like this

defer func(){
    if err != nil {
        err = h.rawConn.Close()
    }
}()

proof/courier.go Outdated
@@ -467,6 +478,8 @@ func (h *HashMailBox) RecvAck(ctx context.Context, sid streamID) error {

// CleanUp atempts to tear down the mailbox as specified by the passed sid.
func (h *HashMailBox) CleanUp(ctx context.Context, sid streamID) error {
defer h.rawConn.Close()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or yeah we could just omit returning error. Do we care about consistency on the Close func signature?

server.go Show resolved Hide resolved
@Roasbeef
Copy link
Member Author

itest seems to be failing consistently, will dig in.

Copy link
Member

@guggero guggero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found the issue with the itest, see inline comment.

proof/courier.go Outdated
@@ -467,6 +478,8 @@ func (h *HashMailBox) RecvAck(ctx context.Context, sid streamID) error {

// CleanUp atempts to tear down the mailbox as specified by the passed sid.
func (h *HashMailBox) CleanUp(ctx context.Context, sid streamID) error {
defer h.rawConn.Close()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to move this h.rawConn.Close() out of the CleanUp() method, since that is being called for two mailboxes (sender+receiver). So it needs to be done in DeliverProof as a defer before cleaning up the mailboxes.

@Roasbeef Roasbeef force-pushed the tcp-connection-issue branch from 24eacd6 to de7f577 Compare December 20, 2023 21:14
… syncers

In this commit, we attempt to fix a TCP connection leak by explicitly
closing the gRPC connections we create once we're done with the relevant
gRPC client. Otherwise, we'll end up making a new connection for each
new asset to be pushed, which can add up. In the future, we should also
look into the server-side keep alive options.
This is similar to the prior commit: we add a new method to allow a
caller to close down a courier once they're done with it. This ensures
that we'll always release the resources once we're done with them.
In this commit, we set a gRPC param that controls how long a connection
can be idle for. The goal here is to prune the amount of open TCP
connections on an active/popular universe server. According to the docs:

> Idleness duration is defined since the most recent time the number of
outstanding RPCs became zero or the connection establishment.
@Roasbeef Roasbeef force-pushed the tcp-connection-issue branch from de7f577 to b61d6fb Compare December 20, 2023 22:27
@Roasbeef Roasbeef merged commit becddbf into lightninglabs:main Dec 20, 2023
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

3 participants