-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvclient: fix gRPC stream leak in rangefeed client #80705
Conversation
When the DistSender rangefeed client received a `RangeFeedError` message and propagated a retryable error up the stack, it would fail to close the existing gRPC stream, causing stream/goroutine leaks. Release note (bug fix): Fixed a goroutine leak when internal rangefeed clients received certain kinds of retriable errors.
TIMEOUT: //pkg/kv/kvserver/rangefeed:rangefeed_test (Summary) |
=== RUN TestProcessorMemoryBudgetExceeded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)
pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go
line 393 at r1 (raw file):
// Ensure context is cancelled on all errors, to prevent gRPC stream leaks. ctx, cancel := context.WithCancel(ctx) defer cancel()
- What if there were no errors? In case of EOF, this function returns
args.Timestamp, nil
. Won't canceling in this case affect other goroutines doingpartialRangeFeed
? - Any ideas as to why this leak wasn't picked up by any of the unit tests? I am guessing it's because we don't have a specific failpoint which forces an error path. Could we add a regression for this?
- Looking at this code, I am a bit terrified that we probably have many other instances of this bug. What could we do to both reduce the chance of introducing the error and increase the detection rate? E.g., is there a better way of documenting a context that the callee must cancel in case of an error? could we add a sweeper which cancels groups which are found to be blocked on
ctx.Done
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @srosenberg)
pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go
line 393 at r1 (raw file):
What if there were no errors? In case of EOF, this function returns args.Timestamp, nil. Won't canceling in this case affect other goroutines doing partialRangeFeed?
Contexts are immutable. We're creating a new context here which is scoped to this function and propagated down to callees. It doesn't affect the caller's context.
Any ideas as to why this leak wasn't picked up by any of the unit tests? I am guessing it's because we don't have a specific failpoint which forces an error path. Could we add a regression for this?
We check for goroutine leaks in tests using leaktest
, which compares snapshots of goroutines before and after a test. However, the leak here is temporary -- the streams/goroutines will get cleaned up when the overall rangefeed terminates (when the top-level context cancels), they only leak while the rangefeed is running. It is not clear to me how we would detect this in general (how do we distinguish between valid and invalid goroutines during test execution?), especially not when they're spawned by third-party libraries.
We could write a very targeted regression test for singleRangeFeed
that sets up a convoluted scenario to check for this one specific leak. However, I'm not sure if that test is going to bring any additional value over the code comment, since it would essentially just test whether someone removes that context cancellation line. It seems too narrow in scope, but I'll add one if you think it's valuable.
Looking at this code, I am a bit terrified that we probably have many other instances of this bug. What could we do to both reduce the chance of introducing the error and increase the detection rate?
leaktest
already accounts for the most egregious offenses, and is mandated for all tests via a linter. We could possibly write something specific to gRPC streams, maybe using a gRPC interceptor or something to keep track of expected streams -- but the way the caller would signal that it's done with a gRPC stream is to cancel the context, so it's a bit of a catch-22. I don't immediately have any good ideas for how we would go about this without better language support (e.g. in Rust it might be possible to enforce this at compile-time).
is there a better way of documenting a context that the callee must cancel in case of an error?
This isn't always the case -- if the error came from the gRPC stream itself then the context doesn't need to be cancelled, and it may not always be appropriate/desirable to cancel it. See the gRPC stream contract here:
could we add a sweeper which cancels groups which are found to be blocked on ctx.Done?
How would we distinguish between the goroutines we want to cancel and the ones we don't? How would that be different from just cancelling the relevant context?
I stressed this for 100k runs both with and without the race detector and didn't see any failures. Will bake this on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @srosenberg)
pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go
line 393 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
What if there were no errors? In case of EOF, this function returns args.Timestamp, nil. Won't canceling in this case affect other goroutines doing partialRangeFeed?
Contexts are immutable. We're creating a new context here which is scoped to this function and propagated down to callees. It doesn't affect the caller's context.
Any ideas as to why this leak wasn't picked up by any of the unit tests? I am guessing it's because we don't have a specific failpoint which forces an error path. Could we add a regression for this?
We check for goroutine leaks in tests using
leaktest
, which compares snapshots of goroutines before and after a test. However, the leak here is temporary -- the streams/goroutines will get cleaned up when the overall rangefeed terminates (when the top-level context cancels), they only leak while the rangefeed is running. It is not clear to me how we would detect this in general (how do we distinguish between valid and invalid goroutines during test execution?), especially not when they're spawned by third-party libraries.We could write a very targeted regression test for
singleRangeFeed
that sets up a convoluted scenario to check for this one specific leak. However, I'm not sure if that test is going to bring any additional value over the code comment, since it would essentially just test whether someone removes that context cancellation line. It seems too narrow in scope, but I'll add one if you think it's valuable.Looking at this code, I am a bit terrified that we probably have many other instances of this bug. What could we do to both reduce the chance of introducing the error and increase the detection rate?
leaktest
already accounts for the most egregious offenses, and is mandated for all tests via a linter. We could possibly write something specific to gRPC streams, maybe using a gRPC interceptor or something to keep track of expected streams -- but the way the caller would signal that it's done with a gRPC stream is to cancel the context, so it's a bit of a catch-22. I don't immediately have any good ideas for how we would go about this without better language support (e.g. in Rust it might be possible to enforce this at compile-time).is there a better way of documenting a context that the callee must cancel in case of an error?
This isn't always the case -- if the error came from the gRPC stream itself then the context doesn't need to be cancelled, and it may not always be appropriate/desirable to cancel it. See the gRPC stream contract here:
could we add a sweeper which cancels groups which are found to be blocked on ctx.Done?
How would we distinguish between the goroutines we want to cancel and the ones we don't? How would that be different from just cancelling the relevant context?
PS: I feel like this sort of thing might be best tested by long-lived end-to-end full-cluster tests. In a steady state, we would expect goroutine counts to remain relatively stable, so we could e.g. assert that goroutine counts (as exposed via metrics) should not increase by too much over some timespan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)
pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go
line 393 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
PS: I feel like this sort of thing might be best tested by long-lived end-to-end full-cluster tests. In a steady state, we would expect goroutine counts to remain relatively stable, so we could e.g. assert that goroutine counts (as exposed via metrics) should not increase by too much over some timespan.
Thanks Erik! From the gRPC stream contract, 2) Cancel the context provided, seems to be allowed even in the presence of other events. In fact, that's what the above defer
is doing. I'll think a bit more about the sweeper for gRPC goroutines. I do agree the long-lived end-to-end cluster testing would be an effective way of detecting this leak.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @srosenberg)
pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go
line 393 at r1 (raw file):
From the gRPC stream contract, 2) Cancel the context provided, seems to be allowed even in the presence of other events.
Yes, they're not mutually exclusive. If the stream has already been terminated by other means, the cancelling the context has no effect on it.
TFTRs! bors r=knz,tbg,srosenberg |
Build succeeded: |
When the DistSender rangefeed client received a
RangeFeedError
messageand propagated a retryable error up the stack, it would fail to close
the existing gRPC stream, causing stream/goroutine leaks.
Release note (bug fix): Fixed a goroutine leak when internal rangefeed
clients received certain kinds of retriable errors.