Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix frame leak due to undrained send buffer #860

Merged
merged 6 commits into from
Mar 28, 2022
Merged

Conversation

cinchurge
Copy link
Collaborator

Follow up to #845

Currently in Connection.writeFrames(), we return immediately
upon a connection error. However, at that point there may still
be frames stuck in the send buffer (Connection.sendCh) which
haven't been sent due to the connection issue. Leaving it in
such a state results in a frame leak since they will never be
released.

Change writeFrames() to always drain the send buffer upon return
to fix this behavior.

connection.go Show resolved Hide resolved
connection.go Show resolved Hide resolved
connection.go Outdated
Comment on lines 861 to 868
// Remaining frames in sendCh must be drained and released,
// or we will leak frames
defer func() {
for len(c.sendCh) > 0 {
c.opts.FramePool.Release(<-c.sendCh)
}
}()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can act as a "best-effort" release frames, but there's many race conditions which can result in frames being written to sendCh after the drain here. In the past, TChannel used to close(c.sendCh) as part of closing a connection, but it was removed as there was many race conditions that could lead to the close happening while some other goroutine wrote to the channel, see
93ef5c1

The same race conditions that would have caused a panic when closing sendCh will likely cause this drain to "miss" frames from being released.

It's worth keeping in mind that a frame leak from the pool is not a memory leak -- when a connection is closed + removed, the GC will clean up the connection, sencCh, and all frames in that channel.

I think having the best-effort is OK, but the comment should indicate that it's not a complete solution.

If we want this logic, it should ideally happen as late in the lifecycle for a connection as possible, when we're as close to being confident that there's no other writers to sendCh. E.g., when the connection state changes to closed,
https://cs.github.com/uber/tchannel-go/blob/751f26a5d1d895affe75a1edd208d63fe3b7b109/connection.go?q=repo%3Auber%2Ftchannel-go+%22connectionStartClose%22#L847

(We could also keep the logic in writeFrames, and have writeFrames block waiting for stopCh before draining sendCh).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good pointers! even if it's best-effort I think it's still useful since it'll help improve the signal of the tests since it won't mask leaks from other code paths.

moved the drain back into writeFrames and blocked on <-stopCh

connection.go Outdated
@@ -858,6 +858,14 @@ func (c *Connection) checkExchanges() {
func (c *Connection) close(fields ...LogField) error {
c.log.WithFields(fields...).Debug("Connection closing.")

// Remaining frames in sendCh must be drained and released,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was nice that writeFrames was the only place that consumed from sendCh. I don't think this method is the right place since it can be called from many places. More details in the next comment.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hah, this was my idea - sorry @cinchurge. Seemed like "draining" as part of general shutdown was a reasonable thing to do. But maybe not :)

@codecov
Copy link

codecov bot commented Feb 4, 2022

Codecov Report

Merging #860 (cc77c90) into dev (05c4b2a) will increase coverage by 0.16%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##              dev     #860      +/-   ##
==========================================
+ Coverage   88.99%   89.16%   +0.16%     
==========================================
  Files          43       43              
  Lines        4435     4439       +4     
==========================================
+ Hits         3947     3958      +11     
+ Misses        368      363       -5     
+ Partials      120      118       -2     
Impacted Files Coverage Δ
connection.go 89.37% <100.00%> (+1.03%) ⬆️
relay.go 85.68% <0.00%> (-0.43%) ⬇️
outbound.go 88.30% <0.00%> (+1.16%) ⬆️
inbound.go 82.19% <0.00%> (+1.57%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 05c4b2a...cc77c90. Read the comment docs.

@cinchurge
Copy link
Collaborator Author

bump, i think i've addressed all the issues, can you guys take another look?

relay_test.go Outdated
@@ -831,8 +834,7 @@ func TestRelayStalledConnection(t *testing.T) {
AddLogFilter("Dropping call due to slow connection.", 1, "sendChCapacity", "32").
SetSendBufferSize(32). // We want to hit the buffer size earlier, but also ensure we're only dropping once the sendCh is full.
SetServiceName("s1").
SetRelayOnly().
SetCheckFramePooling()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this removal on purpose?

It's a bit surprising given the diff is expected to make the tests "less leaky", right?

connection.go Outdated
@@ -891,6 +900,14 @@ func (c *Connection) close(fields ...LogField) error {
// Close starts a graceful Close which will first reject incoming calls, reject outgoing calls
// before finally marking the connection state as closed.
func (c *Connection) Close() error {
// Remaining frames in sendCh must be drained and released,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still want this, or is it a left-over from the previous iteration?

The PR description mentions only changes in writeFrames, and your other comment says "as discussed, prefer to keep all code that consume sendCh close together", which seems to contradict this piece of code?

@cinchurge cinchurge changed the base branch from dev to cinchurge/disable-leak-check March 28, 2022 21:56
@cinchurge cinchurge changed the base branch from cinchurge/disable-leak-check to dev March 28, 2022 22:58
Follow up to #845

Currently in `Connection.writeFrames()`, we return immediately
upon a connection error. However, at that point there may still
be frames stuck in the send buffer (`Connection.sendCh`) which
haven't been sent due to the connection issue. Leaving it in
such a state results in a frame leak since they will never be
released.

Change `writeFrames()` to always drain the send buffer upon return
to fix this behavior.
@cinchurge cinchurge merged commit 9e9043b into dev Mar 28, 2022
gandhikrishna pushed a commit that referenced this pull request Apr 27, 2022
* Fix frame leak due to undrained send buffer

Follow up to #845

Currently in `Connection.writeFrames()`, we return immediately
upon a connection error. However, at that point there may still
be frames stuck in the send buffer (`Connection.sendCh`) which
haven't been sent due to the connection issue. Leaving it in
such a state results in a frame leak since they will never be
released.

Change `writeFrames()` to always drain the send buffer upon return
to fix this behavior.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants