-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clientv3/integration: block grpc.Dial until connection up in client closing tests #8720
Conversation
have you verified that this commit does fix the issue? |
Yeah if we block until connection |
I mean have you tried to reproduce the test failure with and without this patch? |
integration/cluster.go
Outdated
@@ -97,6 +97,7 @@ type ClusterConfig struct { | |||
ClientTLS *transport.TLSInfo | |||
DiscoveryURL string | |||
UseGRPC bool | |||
ClientWithBlock bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ClientDialWithBlock. Need comment on this config too.
…losing tests Test grpc.ErrClientConnClosing, so it must ensure that connection is up before closing it. Otherwise, it can time out when we introduce back-off to client retrial logic. Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>
@xiang90 This happened with retry logic PR #8710. It was happening in my PR because
We don't have step 2 in current master branch, and grpc-go always returns cc.mu.Lock()
if cc.conns == nil {
cc.mu.Unlock()
return ErrClientConnClosing
} So even if the connection is not up, it will return |
why it needs to wait for 5s? and in the current setup, we also wait for the notify chan before the rpc moves forward. |
My PR was just using dial timeout for the time-being. We can change it to shorter. The time-out for connect notify was added to prevent blocking forever in case the new endpoint never comes up and context has no deadline (for ordering tests).
Right, so I don't think this test would fail with current master branch. |
I still do not quite understand this. Even with your PR, the code still wait on |
The test was failing timing out here https://github.com/coreos/etcd/pull/8710/files#diff-c35d636d8a625a8717da3ce4e9e58107R233. When we do the wait, expected behavior is that But it's possible that when we start waiting on
Or
I will double-check tomorrow after cleaning my retry PR in a separate branch. |
why this only affect #8710? With #8710, it at least will timeout after 5 seconds. With current master, it wont even timeout since it will simply block forever. I just do not understand why it wont affect current master, but only #8710. |
Ah... I was too stuck with #8710, so thought totally wrong. We've just reverted watch API wrapper today, so this is not reproducible in master branch (which does not do any |
For master branch,
We don't have step 4, because
|
where do we ensure this in the master branch? |
|
Here you said we do not have step 4 (connection wait on ConnectNotify) since readyWait ensure the address is pined. Here you said ready wait ensures address is pined by So we do have step 4? |
Sorry, I wasn't clear. I re-read the code and you are right.
Expected workflow
|
i am going to close this one. the root cause you cannot reproduce the problem is that we have not enabled retry for watch. we can reevaluate this when retry is enabled on watch. |
Yeah sounds good. We can revisit when we work on the retry watch. |
Test
grpc.ErrClientConnClosing
, so it must ensure that connection is upbefore closing it. Otherwise, it can time out when we introduce back-off
to client retrial logic.
Separate out from #8710.
c.f. #8691.