clientv3/integration: block grpc.Dial until connection up in client closing tests #8720

gyuho · 2017-10-20T01:14:46Z

Test grpc.ErrClientConnClosing, so it must ensure that connection is up
before closing it. Otherwise, it can time out when we introduce back-off
to client retrial logic.

Separate out from #8710.
c.f. #8691.

Integration tests have dial timeout 5-sec, and it's possible that
balancer retry logic waits 5-sec and test times out because gRPC calls
grpc.downErr function after connection wait starts in retrial part.
Just increasing time-out should be ok. In most cases, grpc.downErr gets
called before starting the wait.

e.g.

=== RUN   TestWatchErrConnClosed
INFO: 2017/10/18 23:55:39 clientv3/balancer: pin "localhost:91847156765553894590"
INFO: 2017/10/18 23:55:39 clientv3/retry: wait 5s for healthy endpoint
INFO: 2017/10/18 23:55:39 clientv3/balancer: unpin "localhost:91847156765553894590" ("grpc: the client connection is closing")
INFO: 2017/10/18 23:55:39 clientv3/health-balancer: "localhost:91847156765553894590" becomes unhealthy ("grpc: the client connection is closing")
--- FAIL: TestWatchErrConnClosed (3.07s)
        watch_test.go:682: wc.Watch took too long

xiang90 · 2017-10-20T01:17:23Z

have you verified that this commit does fix the issue?

gyuho · 2017-10-20T01:20:22Z

Yeah if we block until connection Up, we basically make sure endpoint is pinned before RPC is sent, so won't even need to trigger retry.

xiang90 · 2017-10-20T01:22:16Z

I mean have you tried to reproduce the test failure with and without this patch?

xiang90 · 2017-10-20T01:23:03Z

integration/cluster.go

@@ -97,6 +97,7 @@ type ClusterConfig struct {
 	ClientTLS             *transport.TLSInfo
 	DiscoveryURL          string
 	UseGRPC               bool
+	ClientWithBlock       bool


ClientDialWithBlock. Need comment on this config too.

…losing tests Test grpc.ErrClientConnClosing, so it must ensure that connection is up before closing it. Otherwise, it can time out when we introduce back-off to client retrial logic. Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>

gyuho · 2017-10-20T01:54:15Z

@xiang90 This happened with retry logic PR #8710.

It was happening in my PR because

no pinned address (pinned=="") in the beginning of RPC
so, wait 5s for connection up in retry loop
A is pinned just right before 5-sec wait starts, and test times out in 3-sec

We don't have step 2 in current master branch, and grpc-go always returns grpc.ErrClientConnClosing when *grpc.ClientConn.conns==nil:

cc.mu.Lock()
if cc.conns == nil {
	cc.mu.Unlock()
	return ErrClientConnClosing
}

So even if the connection is not up, it will return grpc.ErrClientConnClosing.

xiang90 · 2017-10-20T02:03:03Z

so, wait 5s for connection up in retry loop

why it needs to wait for 5s?
the initial dial should take much shorter than 5second, and notify chan should be fired.

and in the current setup, we also wait for the notify chan before the rpc moves forward.

gyuho · 2017-10-20T02:16:56Z

why it needs to wait for 5s?

My PR was just using dial timeout for the time-being. We can change it to shorter. The time-out for connect notify was added to prevent blocking forever in case the new endpoint never comes up and context has no deadline (for ordering tests).

and in the current setup, we also wait for the notify chan before the rpc moves forward.

Right, so I don't think this test would fail with current master branch.
When we add connect wait in later retry fix PR, we can decrease the dial timeout.
So should we address this later?

xiang90 · 2017-10-20T02:22:21Z

My PR was just using dial timeout for the time-being. We can change it to shorter. The time-out for connect notify was added to prevent blocking forever in case the new endpoint never comes up and context has no deadline (for ordering tests).

I still do not quite understand this. Even with your PR, the code still wait on notify chan (see https://github.com/coreos/etcd/pull/8710/files#diff-c35d636d8a625a8717da3ce4e9e58107R157). There is no difference between your PR and current master on how we wait for the initial connection (see https://github.com/coreos/etcd/blob/master/clientv3/retry.go#L54).

gyuho · 2017-10-20T02:31:45Z

There is no difference between your PR and current master on how we wait for the initial connection.

The test was failing timing out here https://github.com/coreos/etcd/pull/8710/files#diff-c35d636d8a625a8717da3ce4e9e58107R233. When we do the wait, expected behavior is that notify chan is closed right away, so we don't need to wait. Then the test won't time out.

But it's possible that when we start waiting on select <-notify chan, the endpoint gets unpinned from connection close and blocks notify chan.

empty endpoint, so need to wait on connection up
A is pinned
connection wait on ConnectNotify
ConnectNotify received, so no time out

Or

empty endpoint, so need to wait on connection up
A is pinned
A is unpinned from grpc.ErrClientConnClosing
connection wait on ConnectNotify
ConnectNotify blocks, so times out

I will double-check tomorrow after cleaning my retry PR in a separate branch.

xiang90 · 2017-10-20T02:37:51Z

empty endpoint, so need to wait on connection up
A is pinned
A is unpinned from grpc.ErrClientConnClosing
connection wait on ConnectNotify
ConnectNotify blocks, so times out

why this only affect #8710? With #8710, it at least will timeout after 5 seconds. With current master, it wont even timeout since it will simply block forever. I just do not understand why it wont affect current master, but only #8710.

gyuho · 2017-10-20T02:51:10Z

Ah... I was too stuck with #8710, so thought totally wrong. We've just reverted watch API wrapper today, so this is not reproducible in master branch (which does not do any readyWait).

gyuho · 2017-10-20T02:56:21Z

For master branch,

1.    empty endpoint, so need to wait on connection up
2.    A is pinned
3.    A is unpinned from grpc.ErrClientConnClosing
4.    connection wait on ConnectNotify
5.    ConnectNotify blocks, so times out

We don't have step 4, because readyWait ensures that address is pinned. So in master branch, it would be like

empty endpoint, so need to wait on connection up
A is pinned
A is unpinned from grpc.ErrClientConnClosing
RPC is sent without pinned address
gRPC returns grpc.ErrClientConnClosing since there is no connection

xiang90 · 2017-10-20T02:58:45Z

We don't have step 4, because readyWait ensures that address is pinned.

where do we ensure this in the master branch?

gyuho · 2017-10-20T03:02:11Z

https://github.com/coreos/etcd/blob/40b6fcd7614a090c0be9e0be9f8856ad8b0efa89/clientv3/retry.go#L54-L58

readyWait selects on c.balancer.ConnectNotify(), which is balancer.upc, which is closed when an address is pinned.

xiang90 · 2017-10-20T03:06:30Z

We don't have step 4, because readyWait ensures that address is pinned. So in master branch, it would be like

Here you said we do not have step 4 (connection wait on ConnectNotify) since readyWait ensure the address is pined.

#8720 (comment)

Here you said ready wait ensures address is pined by selecting on c.balancer.ConnectNotify().

So we do have step 4? connection wait on ConnectNotify and selecting on c.balancer.ConnectNotify() are exactly same thing to me.

gyuho · 2017-10-20T03:22:47Z

Sorry, I wasn't clear. I re-read the code and you are right.
I was trying to explain this case:

1. A is pinned
2. A is unpinned
3. readyWait blocks until A is pinned (forever)

Expected workflow

1. readyWait blocks until A is pinned
2. A is pinned
3. readyWait returns
4. send RPC

xiang90 · 2017-10-20T04:00:42Z

i am going to close this one. the root cause you cannot reproduce the problem is that we have not enabled retry for watch. we can reevaluate this when retry is enabled on watch.

gyuho · 2017-10-20T04:02:51Z

Yeah sounds good. We can revisit when we work on the retry watch.

gyuho requested a review from xiang90 October 20, 2017 01:15

xiang90 reviewed Oct 20, 2017

View reviewed changes

gyuho force-pushed the aaa branch from 2196fd2 to fff6e14 Compare October 20, 2017 01:55

xiang90 closed this Oct 20, 2017

gyuho deleted the aaa branch October 20, 2017 04:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clientv3/integration: block grpc.Dial until connection up in client closing tests #8720

clientv3/integration: block grpc.Dial until connection up in client closing tests #8720

gyuho commented Oct 20, 2017

xiang90 commented Oct 20, 2017

gyuho commented Oct 20, 2017 •

edited

Loading

xiang90 commented Oct 20, 2017

xiang90 Oct 20, 2017

gyuho commented Oct 20, 2017

xiang90 commented Oct 20, 2017 •

edited

Loading

gyuho commented Oct 20, 2017

xiang90 commented Oct 20, 2017 •

edited

Loading

gyuho commented Oct 20, 2017 •

edited

Loading

xiang90 commented Oct 20, 2017

gyuho commented Oct 20, 2017 •

edited

Loading

gyuho commented Oct 20, 2017

xiang90 commented Oct 20, 2017

gyuho commented Oct 20, 2017 •

edited

Loading

xiang90 commented Oct 20, 2017

gyuho commented Oct 20, 2017 •

edited

Loading

xiang90 commented Oct 20, 2017

gyuho commented Oct 20, 2017

clientv3/integration: block grpc.Dial until connection up in client closing tests #8720

clientv3/integration: block grpc.Dial until connection up in client closing tests #8720

Conversation

gyuho commented Oct 20, 2017

xiang90 commented Oct 20, 2017

gyuho commented Oct 20, 2017 • edited Loading

xiang90 commented Oct 20, 2017

xiang90 Oct 20, 2017

Choose a reason for hiding this comment

gyuho commented Oct 20, 2017

xiang90 commented Oct 20, 2017 • edited Loading

gyuho commented Oct 20, 2017

xiang90 commented Oct 20, 2017 • edited Loading

gyuho commented Oct 20, 2017 • edited Loading

xiang90 commented Oct 20, 2017

gyuho commented Oct 20, 2017 • edited Loading

gyuho commented Oct 20, 2017

xiang90 commented Oct 20, 2017

gyuho commented Oct 20, 2017 • edited Loading

xiang90 commented Oct 20, 2017

gyuho commented Oct 20, 2017 • edited Loading

xiang90 commented Oct 20, 2017

gyuho commented Oct 20, 2017

gyuho commented Oct 20, 2017 •

edited

Loading

xiang90 commented Oct 20, 2017 •

edited

Loading

xiang90 commented Oct 20, 2017 •

edited

Loading

gyuho commented Oct 20, 2017 •

edited

Loading

gyuho commented Oct 20, 2017 •

edited

Loading

gyuho commented Oct 20, 2017 •

edited

Loading

gyuho commented Oct 20, 2017 •

edited

Loading