-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multi: attempt to fix issue of high number of TCP connections by explicitly closing connections and implementing an idle timer #719
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice! I think this should fix our high number of connection issue.
Some CI steps fail and I have a question around logging, otherwise this looks good!
PTAL! |
ed9b703
to
24eacd6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, LGTM 🎉
proof/courier.go
Outdated
@@ -467,6 +478,8 @@ func (h *HashMailBox) RecvAck(ctx context.Context, sid streamID) error { | |||
|
|||
// CleanUp atempts to tear down the mailbox as specified by the passed sid. | |||
func (h *HashMailBox) CleanUp(ctx context.Context, sid streamID) error { | |||
defer h.rawConn.Close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-blocking: I guess my request was less about actually logging the error but rather logging it instead of having it as a return value in the interface. Because swallowing the error here does show up as a code smell issue in my IDE (and will probably be flagged by the linter if we decide to turn on some of the rules we have in lnd
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe over-engineering:
In such use-cases we could try returning a named variable
func (h *HashMailBox) CleanUp(ctx context.Context, sid streamID) (err error) {
and assign to err
the error we want to return within the body.
Then our defer
would look like this
defer func(){
if err != nil {
err = h.rawConn.Close()
}
}()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or yeah we could just omit returning error
. Do we care about consistency on the Close
func signature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks Good! 🕸️
proof/courier.go
Outdated
@@ -467,6 +478,8 @@ func (h *HashMailBox) RecvAck(ctx context.Context, sid streamID) error { | |||
|
|||
// CleanUp atempts to tear down the mailbox as specified by the passed sid. | |||
func (h *HashMailBox) CleanUp(ctx context.Context, sid streamID) error { | |||
defer h.rawConn.Close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe over-engineering:
In such use-cases we could try returning a named variable
func (h *HashMailBox) CleanUp(ctx context.Context, sid streamID) (err error) {
and assign to err
the error we want to return within the body.
Then our defer
would look like this
defer func(){
if err != nil {
err = h.rawConn.Close()
}
}()
proof/courier.go
Outdated
@@ -467,6 +478,8 @@ func (h *HashMailBox) RecvAck(ctx context.Context, sid streamID) error { | |||
|
|||
// CleanUp atempts to tear down the mailbox as specified by the passed sid. | |||
func (h *HashMailBox) CleanUp(ctx context.Context, sid streamID) error { | |||
defer h.rawConn.Close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or yeah we could just omit returning error
. Do we care about consistency on the Close
func signature?
itest seems to be failing consistently, will dig in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found the issue with the itest, see inline comment.
proof/courier.go
Outdated
@@ -467,6 +478,8 @@ func (h *HashMailBox) RecvAck(ctx context.Context, sid streamID) error { | |||
|
|||
// CleanUp atempts to tear down the mailbox as specified by the passed sid. | |||
func (h *HashMailBox) CleanUp(ctx context.Context, sid streamID) error { | |||
defer h.rawConn.Close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to move this h.rawConn.Close()
out of the CleanUp()
method, since that is being called for two mailboxes (sender+receiver). So it needs to be done in DeliverProof
as a defer
before cleaning up the mailboxes.
24eacd6
to
de7f577
Compare
… syncers In this commit, we attempt to fix a TCP connection leak by explicitly closing the gRPC connections we create once we're done with the relevant gRPC client. Otherwise, we'll end up making a new connection for each new asset to be pushed, which can add up. In the future, we should also look into the server-side keep alive options.
This is similar to the prior commit: we add a new method to allow a caller to close down a courier once they're done with it. This ensures that we'll always release the resources once we're done with them.
In this commit, we set a gRPC param that controls how long a connection can be idle for. The goal here is to prune the amount of open TCP connections on an active/popular universe server. According to the docs: > Idleness duration is defined since the most recent time the number of outstanding RPCs became zero or the connection establishment.
de7f577
to
b61d6fb
Compare
On the instance we've run, we've been seeing a very high number of active TCP connections at any given time. After tweaking some Aperture related settings, it seems that the issue is with our own gRPC server, and the way we make outbound connections.
This PR implements to fixes to attempt to mitigate this issue: