clientv3/balancer: handle network partition in health check #8669

gyuho · 2017-10-09T16:16:14Z

Only health-check when timeout elapses since last failure.
Otherwise network-partitioned member with active health-check
server would not be gray-listed, making health-balancer stuck
with isolated endpoint. Also clarifies some log messages.

Mark partitioned member as unhealthy in retry code paths.

Previous behavior is when server returns errors, retry
wrapper does not do anything, while passively expecting
balancer to gray-list the isolated endpoint. This is
problematic when multiple endpoints are passed, and
network partition happens.

This patch adds endpointError method to balancer interface
to actively(possibly even before health-check API gets called)
handle RPC errors and gray-list endpoints for the time being,
thus speeding up the endpoint switch.

This is safe in a single-endpoint case, because balancer will
retry no matter what in such case.

Added integration tests fail without this patch.
Also manually tested to confirm this fixes #8660.

Logs

GET 2017-10-09 13:55:28.265074433 +0000 UTC m=+148.050834478 sleeping... 2017-10-09 13:55:33.265267969 +0000 UTC m=+153.051028081 starts get... INFO: 2017/10/09 13:55:38 clientv3/retry: error rpc error: code = DeadlineExceeded desc = context deadline exceeded on pinned endpoint 10.138.0.21:2379 INFO: 2017/10/09 13:55:38 clientv3/health-balancer: marking 10.138.0.21:2379 as unhealthy (rpc error: code = DeadlineExceeded desc = context deadline exceeded) 2017-10-09 13:55:38.26572891 +0000 UTC m=+158.051488955 failed context deadline exceeded 2017-10-09 13:55:38.265765899 +0000 UTC m=+158.051525944 sleeping... INFO: 2017/10/09 13:55:39 clientv3/balancer: unpin 10.138.0.21:2379 (grpc: the connection is drained) INFO: 2017/10/09 13:55:39 clientv3/health-balancer: 10.138.0.21:2379 becomes unhealthy (grpc: the connection is drained) INFO: 2017/10/09 13:55:39 clientv3/balancer: pin 10.138.0.22:2379 INFO: 2017/10/09 13:55:39 clientv3/balancer: 10.138.0.23:2379 is up but not pinned (already pinned 10.138.0.22:2379) INFO: 2017/10/09 13:55:42 clientv3/health-balancer: removes 10.138.0.21:2379 from unhealthy after 3s 2017-10-09 13:55:43.265951664 +0000 UTC m=+163.051711812 starts get... 2017-10-09 13:55:43.269703717 +0000 UTC m=+163.055463762 success! PUT INFO: 2017/10/09 13:58:00 clientv3/balancer: pin 10.138.0.22:2379 INFO: 2017/10/09 13:58:00 clientv3/balancer: 10.138.0.23:2379 is up but not pinned (already pinned 10.138.0.22:2379) INFO: 2017/10/09 13:58:00 clientv3/balancer: 10.138.0.21:2379 is up but not pinned (already pinned 10.138.0.22:2379) 2017-10-09 13:58:03.830506173 +0000 UTC m=+3.002894666 sleeping 2017-10-09 13:58:08.83076711 +0000 UTC m=+8.003155592 starts put... 2017-10-09 13:58:08.83494924 +0000 UTC m=+8.007337654 success! 2017-10-09 13:58:18.842551525 +0000 UTC m=+18.014939934 sleeping 2017-10-09 13:58:23.842763057 +0000 UTC m=+23.015151547 starts put... INFO: 2017/10/09 13:58:28 clientv3/retry: error rpc error: code = DeadlineExceeded desc = context deadline exceeded on pinned endpoint 10.138.0.22:2379 INFO: 2017/10/09 13:58:28 clientv3/health-balancer: marking 10.138.0.22:2379 as unhealthy (rpc error: code = DeadlineExceeded desc = context deadline exceeded) 2017-10-09 13:58:28.843253482 +0000 UTC m=+28.015641904 failed: context deadline exceeded

2017-10-09 13:58:28.84326634 +0000 UTC m=+28.015654761 sleeping INFO: 2017/10/09 13:58:30 clientv3/balancer: unpin 10.138.0.22:2379 (grpc: the connection is drained) INFO: 2017/10/09 13:58:30 clientv3/health-balancer: 10.138.0.22:2379 becomes unhealthy (grpc: the connection is drained) INFO: 2017/10/09 13:58:30 clientv3/balancer: pin 10.138.0.21:2379 WARNING: 2017/10/09 13:58:30 Failed to dial 10.138.0.23:2379: grpc: the connection is closing; please retry. WARNING: 2017/10/09 13:58:30 Failed to dial 10.138.0.23:2379: grpc: the connection is closing; please retry. INFO: 2017/10/09 13:58:33 clientv3/health-balancer: removes 10.138.0.22:2379 from unhealthy after 3s 2017-10-09 13:58:33.843439088 +0000 UTC m=+33.015827591 starts put... 2017-10-09 13:58:33.846525885 +0000 UTC m=+33.018914309 success!

xiang90 · 2017-10-09T16:50:07Z

clientv3/balancer.go

@@ -44,6 +44,8 @@ type balancer interface {
 	endpoints() []string
 	// pinned returns the current pinned endpoint.
 	pinned() string
+	// errored handles error from server-side.
+	errored(addr string, err error)


should not we call through the down func returned from up?

xiang90 · 2017-10-09T17:15:09Z

clientv3/retry.go

@@ -87,14 +88,10 @@ func (c *Client) newRetryWrapper(isStop retryStopErrFunc) retryRpcFunc {
 func (c *Client) newAuthRetryWrapper() retryRpcFunc {
 	return func(rpcCtx context.Context, f rpcFunc) error {
 		for {
-			pinned := c.balancer.pinned()


i guess this is still useful. let us keep this for now?

xiang90 · 2017-10-09T17:17:06Z

clientv3/retry.go

@@ -65,6 +65,7 @@ func (c *Client) newRetryWrapper(isStop retryStopErrFunc) retryRpcFunc {
 			}
 			if logger.V(4) {
 				logger.Infof("clientv3/retry: error %v on pinned endpoint %s", err, pinned)
+				c.balancer.down(pinned, err)


the only error can reach here is timeout error?

xiang90 · 2017-10-09T17:20:58Z

clientv3/integration/network_partition_test.go

+
+func TestNetworkPartitionBalancerWithGet(t *testing.T) {
+	testNetworkPartitionBalancer(t, func(cli *clientv3.Client, ctx context.Context) error {
+		_, err := cli.Get(ctx, "foo")


for get, we should see no error. client should auto retry

xiang90 · 2017-10-09T17:21:13Z

clientv3/integration/network_partition_test.go

+
+func TestNetworkPartitionBalancerWithPut(t *testing.T) {
+	testNetworkPartitionBalancer(t, func(cli *clientv3.Client, ctx context.Context) error {
+		_, err := cli.Put(ctx, "foo", "bar")


for put we should expect deadline exceed error.

xiang90 · 2017-10-09T17:22:14Z

clientv3/integration/network_partition_test.go

+	clus.Members[0].InjectPartition(t, clus.Members[1:])
+	defer clus.Members[0].RecoverPartition(t, clus.Members[1:])
+
+	for i := 0; i < 5; i++ {


get and put expect different behavior in testing. let not try to mix them together. probably we only need to reuse the code for setting up client and partitioning, not the actual test/check part.

xiang90 · 2017-10-09T17:25:30Z

clientv3/balancer.go

@@ -44,6 +44,8 @@ type balancer interface {
 	endpoints() []string
 	// pinned returns the current pinned endpoint.
 	pinned() string
+	// errored handles error from server-side.
+	down(addr string, err error)


ok... endpointError is more appropriate actually.

sorry for the back and forth on this one.

xiang90 · 2017-10-09T17:26:21Z

clientv3/health_balancer.go

 	hb.mu.RUnlock()
 	if skip || !bad {
 		return true
 	}
+	// prevent isolated member's endpoint from being infinitely retried, as follows:


how does this change affect what we are fixing in this PR?

Often, timed-out endpoint gets retried and passes this check, so gets pinned again. This is problematic when the member is the isolated one in network partitions still serving the health server as SERVING. In such case, we keep it in the unhealthy list so that other endpoints can be prioritized within the next time-out.

Basically we need this as well to fix the network partition issue.

is this testable?

Hard to test since we can't control gRPC side. But with debugging enabled, I can confirm manually this happens.

xiang90 · 2017-10-09T17:40:36Z

integration/cluster.go

-		client, err := NewClientV3(m)
-		if err != nil {
-			t.Fatalf("cannot create client: %v", err)
+	if !cfg.NoClient {


do we really need this for this PR?

Let me remove it in the next push.

xiang90 · 2017-10-09T19:07:46Z

clientv3/integration/network_partition_test.go

+		_, err := cli.Get(ctx, "a")
+		cancel()
+		if err != nil {
+			t.Fatal(err)


t.error. also we need a better failure description. want err = nil, got %v

xiang90 · 2017-10-09T19:08:09Z

clientv3/integration/network_partition_test.go

+func TestNetworkPartitionBalancerWithPut(t *testing.T) {
+	testNetworkPartitionBalancer(t, func(cli *clientv3.Client) {
+		var err error
+		for i := 0; i < 5; i++ {


why 5 times? would not a simple second retry go through?

Slow CIs take several seconds to switch endpoints.
Instead, increased wait time with only 2 tries.

xiang90 · 2017-10-09T20:00:48Z

clientv3/integration/network_partition_test.go

+				t.Fatalf("#%d: expected %v, got %v", i, context.DeadlineExceeded, err)
+			}
+			if i == 0 { // give enough time for endpoint switch
+				time.Sleep(5 * time.Second)


we need to get rid of this eventually. probably need a wait endpoint switch func.

We can expose balancer interface with ConnectNotify. Should we?

for this pr. simply add a todo is good enough.

xiang90 · 2017-10-09T20:29:38Z

clientv3/health_balancer.go

 func (hb *healthBalancer) mayPin(addr grpc.Address) bool {
 	hb.mu.RLock()
 	skip := len(hb.addrs) == 1 || len(hb.unhealthy) == 0
-	_, bad := hb.unhealthy[addr.Addr]
+	failed, bad := hb.unhealthy[addr.Addr]


failedTime?

xiang90 · 2017-10-09T20:30:12Z

clientv3/health_balancer.go

+	// instead, return before grpc-healthcheck if failed within healthcheck timeout
+	if elapsed := time.Since(failed); elapsed < dur {
+		if logger.V(4) {
+			logger.Infof("clientv3/health-balancer: %s is up but not pinned (elapsed %v < timeout %v)", addr.Addr, elapsed, dur)


probably also log explicitly that we are not going to pin this since xxx.

xiang90 · 2017-10-09T20:31:17Z

LGTM

Otherwise network-partitioned member with active health-check server would not be gray-listed, making health-balancer stuck with isolated endpoint. Also clarifies some log messages. Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>

Previous behavior is when server returns errors, retry wrapper does not do anything, while passively expecting balancer to gray-list the isolated endpoint. This is problematic when multiple endpoints are passed, and network partition happens. This patch adds 'endpointError' method to 'balancer' interface to actively(possibly even before health-check API gets called) handle RPC errors and gray-list endpoints for the time being, thus speeding up the endpoint switch. This is safe in a single-endpoint case, because balancer will retry no matter what in such case. Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>

Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>

gyuho requested a review from xiang90 October 9, 2017 16:16

gyuho force-pushed the balancer branch 2 times, most recently from bd030f2 to 1286a4e Compare October 9, 2017 16:32

xiang90 reviewed Oct 9, 2017

View reviewed changes

gyuho force-pushed the balancer branch 2 times, most recently from 5e0f81d to f6a3809 Compare October 9, 2017 16:55

xiang90 reviewed Oct 9, 2017

View reviewed changes

gyuho force-pushed the balancer branch 2 times, most recently from 08a9f4f to a043fbc Compare October 9, 2017 18:42

xiang90 reviewed Oct 9, 2017

View reviewed changes

gyuho force-pushed the balancer branch from a043fbc to e0d2e81 Compare October 9, 2017 19:47

xiang90 reviewed Oct 9, 2017

View reviewed changes

gyuho force-pushed the balancer branch from e0d2e81 to bd544ee Compare October 9, 2017 20:10

xiang90 reviewed Oct 9, 2017

View reviewed changes

gyuho force-pushed the balancer branch from bd544ee to 007a6a1 Compare October 9, 2017 20:36

gyuho added 2 commits October 9, 2017 13:40

gyuho force-pushed the balancer branch from 007a6a1 to 0d501d5 Compare October 9, 2017 20:40

gyuho added 2 commits October 9, 2017 14:54

clientv3/integration: add balancer network partition tests

8224c74

Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>

words: whitelist more words

826de3c

Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>

gyuho force-pushed the balancer branch from 4433b12 to af5fb14 Compare October 9, 2017 21:55

clientv3: pin any endpoint when all unhealthy

e9e17e3

Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>

gyuho force-pushed the balancer branch from d388767 to e9e17e3 Compare October 9, 2017 23:02

gyuho merged commit 8329963 into etcd-io:master Oct 9, 2017

gyuho deleted the balancer branch October 9, 2017 23:54

gyuho mentioned this pull request Oct 27, 2017

Test clientv3 balancer under network partitions, other failures #8711

Closed

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clientv3/balancer: handle network partition in health check #8669

clientv3/balancer: handle network partition in health check #8669

gyuho commented Oct 9, 2017 •

edited

Loading

xiang90 Oct 9, 2017

xiang90 Oct 9, 2017

xiang90 Oct 9, 2017

xiang90 Oct 9, 2017

xiang90 Oct 9, 2017

xiang90 Oct 9, 2017

xiang90 Oct 9, 2017

xiang90 Oct 9, 2017

xiang90 Oct 9, 2017

gyuho Oct 9, 2017

xiang90 Oct 9, 2017

gyuho Oct 9, 2017 •

edited

Loading

xiang90 Oct 9, 2017

gyuho Oct 9, 2017 •

edited

Loading

xiang90 Oct 9, 2017

xiang90 Oct 9, 2017

xiang90 Oct 9, 2017

gyuho Oct 9, 2017

xiang90 Oct 9, 2017

gyuho Oct 9, 2017

xiang90 Oct 9, 2017

xiang90 Oct 9, 2017

xiang90 Oct 9, 2017

xiang90 commented Oct 9, 2017

clientv3/balancer: handle network partition in health check #8669

clientv3/balancer: handle network partition in health check #8669

Conversation

gyuho commented Oct 9, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gyuho Oct 9, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gyuho Oct 9, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xiang90 commented Oct 9, 2017

gyuho commented Oct 9, 2017 •

edited

Loading

gyuho Oct 9, 2017 •

edited

Loading

gyuho Oct 9, 2017 •

edited

Loading