Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clientv3/integration: block grpc.Dial until connection up in client closing tests #8720

Closed
wants to merge 1 commit into from

Conversation

gyuho
Copy link
Contributor

@gyuho gyuho commented Oct 20, 2017

Test grpc.ErrClientConnClosing, so it must ensure that connection is up
before closing it. Otherwise, it can time out when we introduce back-off
to client retrial logic.

Separate out from #8710.
c.f. #8691.

Integration tests have dial timeout 5-sec, and it's possible that
balancer retry logic waits 5-sec and test times out because gRPC calls
grpc.downErr function after connection wait starts in retrial part.
Just increasing time-out should be ok. In most cases, grpc.downErr gets
called before starting the wait.

e.g.

=== RUN   TestWatchErrConnClosed
INFO: 2017/10/18 23:55:39 clientv3/balancer: pin "localhost:91847156765553894590"
INFO: 2017/10/18 23:55:39 clientv3/retry: wait 5s for healthy endpoint
INFO: 2017/10/18 23:55:39 clientv3/balancer: unpin "localhost:91847156765553894590" ("grpc: the client connection is closing")
INFO: 2017/10/18 23:55:39 clientv3/health-balancer: "localhost:91847156765553894590" becomes unhealthy ("grpc: the client connection is closing")
--- FAIL: TestWatchErrConnClosed (3.07s)
        watch_test.go:682: wc.Watch took too long

@gyuho gyuho requested a review from xiang90 October 20, 2017 01:15
@xiang90
Copy link
Contributor

xiang90 commented Oct 20, 2017

have you verified that this commit does fix the issue?

@gyuho
Copy link
Contributor Author

gyuho commented Oct 20, 2017

Yeah if we block until connection Up, we basically make sure endpoint is pinned before RPC is sent, so won't even need to trigger retry.

@xiang90
Copy link
Contributor

xiang90 commented Oct 20, 2017

I mean have you tried to reproduce the test failure with and without this patch?

@@ -97,6 +97,7 @@ type ClusterConfig struct {
ClientTLS *transport.TLSInfo
DiscoveryURL string
UseGRPC bool
ClientWithBlock bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ClientDialWithBlock. Need comment on this config too.

…losing tests

Test grpc.ErrClientConnClosing, so it must ensure that connection is up
before closing it. Otherwise, it can time out when we introduce back-off
to client retrial logic.

Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>
@gyuho
Copy link
Contributor Author

gyuho commented Oct 20, 2017

@xiang90 This happened with retry logic PR #8710.

It was happening in my PR because

  1. no pinned address (pinned=="") in the beginning of RPC
  2. so, wait 5s for connection up in retry loop
  3. A is pinned just right before 5-sec wait starts, and test times out in 3-sec

We don't have step 2 in current master branch, and grpc-go always returns grpc.ErrClientConnClosing when *grpc.ClientConn.conns==nil:

cc.mu.Lock()
if cc.conns == nil {
	cc.mu.Unlock()
	return ErrClientConnClosing
}

So even if the connection is not up, it will return grpc.ErrClientConnClosing.

@xiang90
Copy link
Contributor

xiang90 commented Oct 20, 2017

so, wait 5s for connection up in retry loop

why it needs to wait for 5s?
the initial dial should take much shorter than 5second, and notify chan should be fired.

and in the current setup, we also wait for the notify chan before the rpc moves forward.

@gyuho
Copy link
Contributor Author

gyuho commented Oct 20, 2017

why it needs to wait for 5s?

My PR was just using dial timeout for the time-being. We can change it to shorter. The time-out for connect notify was added to prevent blocking forever in case the new endpoint never comes up and context has no deadline (for ordering tests).

and in the current setup, we also wait for the notify chan before the rpc moves forward.

Right, so I don't think this test would fail with current master branch.
When we add connect wait in later retry fix PR, we can decrease the dial timeout.
So should we address this later?

@xiang90
Copy link
Contributor

xiang90 commented Oct 20, 2017

My PR was just using dial timeout for the time-being. We can change it to shorter. The time-out for connect notify was added to prevent blocking forever in case the new endpoint never comes up and context has no deadline (for ordering tests).

I still do not quite understand this. Even with your PR, the code still wait on notify chan (see https://github.com/coreos/etcd/pull/8710/files#diff-c35d636d8a625a8717da3ce4e9e58107R157). There is no difference between your PR and current master on how we wait for the initial connection (see https://github.com/coreos/etcd/blob/master/clientv3/retry.go#L54).

@gyuho
Copy link
Contributor Author

gyuho commented Oct 20, 2017

There is no difference between your PR and current master on how we wait for the initial connection.

The test was failing timing out here https://github.com/coreos/etcd/pull/8710/files#diff-c35d636d8a625a8717da3ce4e9e58107R233. When we do the wait, expected behavior is that notify chan is closed right away, so we don't need to wait. Then the test won't time out.

But it's possible that when we start waiting on select <-notify chan, the endpoint gets unpinned from connection close and blocks notify chan.

  1. empty endpoint, so need to wait on connection up
  2. A is pinned
  3. connection wait on ConnectNotify
  4. ConnectNotify received, so no time out

Or

  1. empty endpoint, so need to wait on connection up
  2. A is pinned
  3. A is unpinned from grpc.ErrClientConnClosing
  4. connection wait on ConnectNotify
  5. ConnectNotify blocks, so times out

I will double-check tomorrow after cleaning my retry PR in a separate branch.

@xiang90
Copy link
Contributor

xiang90 commented Oct 20, 2017

empty endpoint, so need to wait on connection up
A is pinned
A is unpinned from grpc.ErrClientConnClosing
connection wait on ConnectNotify
ConnectNotify blocks, so times out

why this only affect #8710? With #8710, it at least will timeout after 5 seconds. With current master, it wont even timeout since it will simply block forever. I just do not understand why it wont affect current master, but only #8710.

@gyuho
Copy link
Contributor Author

gyuho commented Oct 20, 2017

Ah... I was too stuck with #8710, so thought totally wrong. We've just reverted watch API wrapper today, so this is not reproducible in master branch (which does not do any readyWait).

@gyuho
Copy link
Contributor Author

gyuho commented Oct 20, 2017

For master branch,

1.    empty endpoint, so need to wait on connection up
2.    A is pinned
3.    A is unpinned from grpc.ErrClientConnClosing
4.    connection wait on ConnectNotify
5.    ConnectNotify blocks, so times out

We don't have step 4, because readyWait ensures that address is pinned. So in master branch, it would be like

  1. empty endpoint, so need to wait on connection up
  2. A is pinned
  3. A is unpinned from grpc.ErrClientConnClosing
  4. RPC is sent without pinned address
  5. gRPC returns grpc.ErrClientConnClosing since there is no connection

@xiang90
Copy link
Contributor

xiang90 commented Oct 20, 2017

We don't have step 4, because readyWait ensures that address is pinned.

where do we ensure this in the master branch?

@gyuho
Copy link
Contributor Author

gyuho commented Oct 20, 2017

https://github.com/coreos/etcd/blob/40b6fcd7614a090c0be9e0be9f8856ad8b0efa89/clientv3/retry.go#L54-L58

readyWait selects on c.balancer.ConnectNotify(), which is balancer.upc, which is closed when an address is pinned.

@xiang90
Copy link
Contributor

xiang90 commented Oct 20, 2017

We don't have step 4, because readyWait ensures that address is pinned. So in master branch, it would be like

Here you said we do not have step 4 (connection wait on ConnectNotify) since readyWait ensure the address is pined.

#8720 (comment)

Here you said ready wait ensures address is pined by selecting on c.balancer.ConnectNotify().

So we do have step 4? connection wait on ConnectNotify and selecting on c.balancer.ConnectNotify() are exactly same thing to me.

@gyuho
Copy link
Contributor Author

gyuho commented Oct 20, 2017

Sorry, I wasn't clear. I re-read the code and you are right.
I was trying to explain this case:

1. A is pinned
2. A is unpinned
3. readyWait blocks until A is pinned (forever)

Expected workflow

1. readyWait blocks until A is pinned
2. A is pinned
3. readyWait returns
4. send RPC

@xiang90
Copy link
Contributor

xiang90 commented Oct 20, 2017

i am going to close this one. the root cause you cannot reproduce the problem is that we have not enabled retry for watch. we can reevaluate this when retry is enabled on watch.

@xiang90 xiang90 closed this Oct 20, 2017
@gyuho gyuho deleted the aaa branch October 20, 2017 04:02
@gyuho
Copy link
Contributor Author

gyuho commented Oct 20, 2017

Yeah sounds good. We can revisit when we work on the retry watch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants