Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvclient: fix gRPC stream leak in rangefeed client #80705

Merged
merged 1 commit into from
Apr 29, 2022

Conversation

erikgrinaker
Copy link
Contributor

When the DistSender rangefeed client received a RangeFeedError message
and propagated a retryable error up the stack, it would fail to close
the existing gRPC stream, causing stream/goroutine leaks.

Release note (bug fix): Fixed a goroutine leak when internal rangefeed
clients received certain kinds of retriable errors.

When the DistSender rangefeed client received a `RangeFeedError` message
and propagated a retryable error up the stack, it would fail to close
the existing gRPC stream, causing stream/goroutine leaks.

Release note (bug fix): Fixed a goroutine leak when internal rangefeed
clients received certain kinds of retriable errors.
@erikgrinaker erikgrinaker requested a review from a team April 28, 2022 14:18
@erikgrinaker erikgrinaker self-assigned this Apr 28, 2022
@erikgrinaker erikgrinaker requested a review from a team as a code owner April 28, 2022 14:18
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg
Copy link
Member

tbg commented Apr 28, 2022

TIMEOUT: //pkg/kv/kvserver/rangefeed:rangefeed_test (Summary)
/home/roach/.cache/bazel/_bazel_roach/c5a4e7d36696d9cd970af2045211a7df/execroot/com_github_cockroachdb_cockroach/bazel-out/k8-dbg/testlogs/pkg/kv/kvserver/rangefeed/rangefeed_test/test.log

@tbg
Copy link
Member

tbg commented Apr 28, 2022

=== RUN TestProcessorMemoryBudgetExceeded
-- Test timed out at 2022-04-28 15:23:13 UTC --

Copy link
Member

@srosenberg srosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)


pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 393 at r1 (raw file):

	// Ensure context is cancelled on all errors, to prevent gRPC stream leaks.
	ctx, cancel := context.WithCancel(ctx)
	defer cancel()
  • What if there were no errors? In case of EOF, this function returns args.Timestamp, nil. Won't canceling in this case affect other goroutines doing partialRangeFeed?
  • Any ideas as to why this leak wasn't picked up by any of the unit tests? I am guessing it's because we don't have a specific failpoint which forces an error path. Could we add a regression for this?
  • Looking at this code, I am a bit terrified that we probably have many other instances of this bug. What could we do to both reduce the chance of introducing the error and increase the detection rate? E.g., is there a better way of documenting a context that the callee must cancel in case of an error? could we add a sweeper which cancels groups which are found to be blocked on ctx.Done?

Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @srosenberg)


pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 393 at r1 (raw file):

What if there were no errors? In case of EOF, this function returns args.Timestamp, nil. Won't canceling in this case affect other goroutines doing partialRangeFeed?

Contexts are immutable. We're creating a new context here which is scoped to this function and propagated down to callees. It doesn't affect the caller's context.

Any ideas as to why this leak wasn't picked up by any of the unit tests? I am guessing it's because we don't have a specific failpoint which forces an error path. Could we add a regression for this?

We check for goroutine leaks in tests using leaktest, which compares snapshots of goroutines before and after a test. However, the leak here is temporary -- the streams/goroutines will get cleaned up when the overall rangefeed terminates (when the top-level context cancels), they only leak while the rangefeed is running. It is not clear to me how we would detect this in general (how do we distinguish between valid and invalid goroutines during test execution?), especially not when they're spawned by third-party libraries.

We could write a very targeted regression test for singleRangeFeed that sets up a convoluted scenario to check for this one specific leak. However, I'm not sure if that test is going to bring any additional value over the code comment, since it would essentially just test whether someone removes that context cancellation line. It seems too narrow in scope, but I'll add one if you think it's valuable.

Looking at this code, I am a bit terrified that we probably have many other instances of this bug. What could we do to both reduce the chance of introducing the error and increase the detection rate?

leaktest already accounts for the most egregious offenses, and is mandated for all tests via a linter. We could possibly write something specific to gRPC streams, maybe using a gRPC interceptor or something to keep track of expected streams -- but the way the caller would signal that it's done with a gRPC stream is to cancel the context, so it's a bit of a catch-22. I don't immediately have any good ideas for how we would go about this without better language support (e.g. in Rust it might be possible to enforce this at compile-time).

is there a better way of documenting a context that the callee must cancel in case of an error?

This isn't always the case -- if the error came from the gRPC stream itself then the context doesn't need to be cancelled, and it may not always be appropriate/desirable to cancel it. See the gRPC stream contract here:

https://github.com/cockroachdb/vendored/blob/38303b08ef278f367bba4cab699716981ba0038d/google.golang.org/grpc/stream.go#L137-L146

could we add a sweeper which cancels groups which are found to be blocked on ctx.Done?

How would we distinguish between the goroutines we want to cancel and the ones we don't? How would that be different from just cancelling the relevant context?

@erikgrinaker
Copy link
Contributor Author

=== RUN TestProcessorMemoryBudgetExceeded
-- Test timed out at 2022-04-28 15:23:13 UTC --

I stressed this for 100k runs both with and without the race detector and didn't see any failures. Will bake this on master for a bit and see if the nightlies catch anything.

Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @srosenberg)


pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 393 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

What if there were no errors? In case of EOF, this function returns args.Timestamp, nil. Won't canceling in this case affect other goroutines doing partialRangeFeed?

Contexts are immutable. We're creating a new context here which is scoped to this function and propagated down to callees. It doesn't affect the caller's context.

Any ideas as to why this leak wasn't picked up by any of the unit tests? I am guessing it's because we don't have a specific failpoint which forces an error path. Could we add a regression for this?

We check for goroutine leaks in tests using leaktest, which compares snapshots of goroutines before and after a test. However, the leak here is temporary -- the streams/goroutines will get cleaned up when the overall rangefeed terminates (when the top-level context cancels), they only leak while the rangefeed is running. It is not clear to me how we would detect this in general (how do we distinguish between valid and invalid goroutines during test execution?), especially not when they're spawned by third-party libraries.

We could write a very targeted regression test for singleRangeFeed that sets up a convoluted scenario to check for this one specific leak. However, I'm not sure if that test is going to bring any additional value over the code comment, since it would essentially just test whether someone removes that context cancellation line. It seems too narrow in scope, but I'll add one if you think it's valuable.

Looking at this code, I am a bit terrified that we probably have many other instances of this bug. What could we do to both reduce the chance of introducing the error and increase the detection rate?

leaktest already accounts for the most egregious offenses, and is mandated for all tests via a linter. We could possibly write something specific to gRPC streams, maybe using a gRPC interceptor or something to keep track of expected streams -- but the way the caller would signal that it's done with a gRPC stream is to cancel the context, so it's a bit of a catch-22. I don't immediately have any good ideas for how we would go about this without better language support (e.g. in Rust it might be possible to enforce this at compile-time).

is there a better way of documenting a context that the callee must cancel in case of an error?

This isn't always the case -- if the error came from the gRPC stream itself then the context doesn't need to be cancelled, and it may not always be appropriate/desirable to cancel it. See the gRPC stream contract here:

https://github.com/cockroachdb/vendored/blob/38303b08ef278f367bba4cab699716981ba0038d/google.golang.org/grpc/stream.go#L137-L146

could we add a sweeper which cancels groups which are found to be blocked on ctx.Done?

How would we distinguish between the goroutines we want to cancel and the ones we don't? How would that be different from just cancelling the relevant context?

PS: I feel like this sort of thing might be best tested by long-lived end-to-end full-cluster tests. In a steady state, we would expect goroutine counts to remain relatively stable, so we could e.g. assert that goroutine counts (as exposed via metrics) should not increase by too much over some timespan.

Copy link
Member

@srosenberg srosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)


pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 393 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

PS: I feel like this sort of thing might be best tested by long-lived end-to-end full-cluster tests. In a steady state, we would expect goroutine counts to remain relatively stable, so we could e.g. assert that goroutine counts (as exposed via metrics) should not increase by too much over some timespan.

Thanks Erik! From the gRPC stream contract, 2) Cancel the context provided, seems to be allowed even in the presence of other events. In fact, that's what the above defer is doing. I'll think a bit more about the sweeper for gRPC goroutines. I do agree the long-lived end-to-end cluster testing would be an effective way of detecting this leak.

Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @srosenberg)


pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 393 at r1 (raw file):

From the gRPC stream contract, 2) Cancel the context provided, seems to be allowed even in the presence of other events.

Yes, they're not mutually exclusive. If the stream has already been terminated by other means, the cancelling the context has no effect on it.

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Apr 29, 2022

TFTRs!

bors r=knz,tbg,srosenberg

@craig
Copy link
Contributor

craig bot commented Apr 29, 2022

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants