-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Throttler: Use tmclient pool for CheckThrottler tabletmanager RPC #14979
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -55,7 +55,7 @@ var ( | |
) | ||
|
||
func registerFlags(fs *pflag.FlagSet) { | ||
fs.IntVar(&concurrency, "tablet_manager_grpc_concurrency", concurrency, "concurrency to use to talk to a vttablet server for performance-sensitive RPCs (like ExecuteFetchAs{Dba,AllPrivs,App})") | ||
fs.IntVar(&concurrency, "tablet_manager_grpc_concurrency", concurrency, "concurrency to use to talk to a vttablet server for performance-sensitive RPCs (like ExecuteFetchAs{Dba,App} and CheckThrottler)") | ||
fs.StringVar(&cert, "tablet_manager_grpc_cert", cert, "the cert to use to connect") | ||
fs.StringVar(&key, "tablet_manager_grpc_key", key, "the key to use to connect") | ||
fs.StringVar(&ca, "tablet_manager_grpc_ca", ca, "the server ca to use to validate servers when connecting") | ||
|
@@ -94,10 +94,9 @@ type tmc struct { | |
|
||
// grpcClient implements both dialer and poolDialer. | ||
type grpcClient struct { | ||
// This cache of connections is to maximize QPS for ExecuteFetch. | ||
// Note we'll keep the clients open and close them upon Close() only. | ||
// But that's OK because usually the tasks that use them are | ||
// one-purpose only. | ||
// This cache of connections is to maximize QPS for ExecuteFetchAs{Dba,App} and | ||
// CheckThrottler. Note we'll keep the clients open and close them upon Close() only. | ||
// But that's OK because usually the tasks that use them are one-purpose only. | ||
// The map is protected by the mutex. | ||
mu sync.Mutex | ||
rpcClientMap map[string]chan *tmc | ||
|
@@ -115,16 +114,17 @@ type poolDialer interface { | |
// Client implements tmclient.TabletManagerClient. | ||
// | ||
// Connections are produced by the dialer implementation, which is either the | ||
// grpcClient implementation, which reuses connections only for ExecuteFetch and | ||
// otherwise makes single-purpose connections that are closed after use. | ||
// grpcClient implementation, which reuses connections only for ExecuteFetchAs{Dba,App} | ||
// and CheckThrottler, otherwise making single-purpose connections that are closed | ||
// after use. | ||
// | ||
// In order to more efficiently use the underlying tcp connections, you can | ||
// instead use the cachedConnDialer implementation by specifying | ||
// | ||
// -tablet_manager_protocol "grpc-cached" | ||
// --tablet_manager_protocol "grpc-cached" | ||
// | ||
// The cachedConnDialer keeps connections to up to -tablet_manager_grpc_connpool_size distinct | ||
// tablets open at any given time, for faster per-RPC call time, and less | ||
// The cachedConnDialer keeps connections to up to --tablet_manager_grpc_connpool_size | ||
// distinct tablets open at any given time, for faster per-RPC call time, and less | ||
// connection churn. | ||
type Client struct { | ||
dialer dialer | ||
|
@@ -1002,12 +1002,29 @@ func (client *Client) Backup(ctx context.Context, tablet *topodatapb.Tablet, req | |
} | ||
|
||
// CheckThrottler is part of the tmclient.TabletManagerClient interface. | ||
// It always tries to use a cached client via the dialer pool as this is | ||
// called very frequently between tablets when the throttler is enabled in | ||
// a keyspace and the overhead of creating a new gRPC connection/channel | ||
// and dialing the other tablet every time is not practical. | ||
func (client *Client) CheckThrottler(ctx context.Context, tablet *topodatapb.Tablet, req *tabletmanagerdatapb.CheckThrottlerRequest) (*tabletmanagerdatapb.CheckThrottlerResponse, error) { | ||
c, closer, err := client.dialer.dial(ctx, tablet) | ||
if err != nil { | ||
return nil, err | ||
var c tabletmanagerservicepb.TabletManagerClient | ||
var err error | ||
if poolDialer, ok := client.dialer.(poolDialer); ok { | ||
c, err = poolDialer.dialPool(ctx, tablet) | ||
if err != nil { | ||
return nil, err | ||
} | ||
} | ||
defer closer.Close() | ||
|
||
if c == nil { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What are the cases where we don’t have a pooled dialer here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's not that we don't have a pooled dialer, it's that dialPool returns a nil TabletManagerClient.client. I would not expect it, from reading the function, but it's a fail-safe to prevent RPC failures in unexpected scenarios. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @mattlord but wouldn’t there have been an error then? Can the pooled dialer return nil but also not return an error? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would be unexpected, yes. It's a pointer that anyone could set to nil at any time though as it's a shared object. From looking at the code it (TabletManagerClient) could also exist in the map but the member (client) be nil as we don't check for that there. What's so concerning about the failsafe here? I could add a comment that it should not generally happen. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It’s that it seems very unexpected. In general on Go, you don’t nil check the return value if you have an error to indicate if it went right or wrong. If we’d do that everywhere, we would have very noisy code. So if the fallback can’t happen, we should imho remove it since it’s useless code. It also for example means it can never be properly tested or covered either if it can never happen. It’s also that the code being like this caused me to ask these questions which I think is a signal in itself too that it’s confusing 😃. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @ajm188 what do you think? I see that this pattern came from this PR: #8368 I can only guess that the idea was we don't want to consider a pooled nil client value as an RPC level error here, but instead create a new dedicated tmclient and gRPC connection when needed for any reason so that the RPC should always succeed even when there's something odd going on with the tmclient pool? I wonder then if we shouldn't at least do this at the end of the dialPool function so that this nil value isn't returned yet again in the future:
I'm OK doing a minor refactor here, although it's unrelated to the issue this PR is meant to address. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @mattlord yeah, I think we can tackle this separately for sure. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @mattlord cached_client and client are two different paradigms (you linked a PR that added the cached client whereas this pr changes client), so i'm not sure i understand your question. as to the difference, as i understood them at the time of #8368:
hopefully that helps clarify both this thread and @shlomi-noach's above There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To summarize further: this PR is not using (as I would describe it) the cached client. It is updating the plain (non-cached) client to use a pool of connections for the CheckThrottler RPC (which I agree is a good idea). Separately, I'd like us to consider promoting the cached-client implementation to be the default over the non-cached (or, "oneshot") implementation. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks, @ajm188 ! For the cached client part, I meant that more generally — the tmclients that are cached in the buffered channel. But I agree that it’s not very precise/clear since we also have the |
||
var closer io.Closer | ||
c, closer, err = client.dialer.dial(ctx, tablet) | ||
if err != nil { | ||
return nil, err | ||
} | ||
defer closer.Close() | ||
} | ||
|
||
response, err := c.CheckThrottler(ctx, req) | ||
if err != nil { | ||
return nil, err | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see the same pattern is used in
ExecuteFetchAsDba
and inExecuteFetchAsApp
- so I'm assuming this is correct -- but still unsure -- I'm not sure I understand how this gets a pooled connections; and I don't see the pattern of "get from pool ; defer return to pool", so this does not read like a way of using a cached connection. MaybeExecuteFetchAsApp
's implementation is likewise incorrect?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can see the implementation within this file:
vitess/go/vt/vttablet/grpctmclient/client.go
Lines 155 to 202 in a207a69
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vs dial that was previously being used which creates a new tabletmanagerclient with a new gRPC connection each time:
vitess/go/vt/vttablet/grpctmclient/client.go
Lines 140 to 153 in a207a69