You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
With current master branch with health balancer(#8545):
Set up 3-node etcd cluster: A, B, and C
Client D is sending linearizable request to A,B,C
quorum get, mutable operations: put, txn, delete
keep sending requests in for-loop [1]
Client balancer pins endpoint A, for instance
Now, A gets network-partitioned
disconnected to/from B and C, but not to/from client D
sudo iptables -A OUTPUT -p tcp --destination-port PEER_PORT -j DROP
sudo iptables -A INPUT -p tcp --destination-port PEER_PORT -j DROP
Client D errors
clientv3/retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded
INFO: 2017/10/06 20:02:32 Client received GoAway with http2.ErrCodeEnhanceYourCalm.
INFO: 2017/10/06 20:02:32 clientv3/retry: retry for error rpc error: code = Unavailable desc = transport is closing
INFO: 2017/10/06 20:02:32 clientv3/balancer: unpin A:2379 (grpc: failed with network I/O error)
INFO: 2017/10/06 20:02:32 clientv3/auth-retry: retry for error rpc error: code = Unavailable desc = transport is closing
Expect unpinned endpoint A is gray-listed, and then A does not get retried
What is actually happening
a. client D is not isolated from A; healthpb server is still serving
b. thus, A is never gray-listed
c. client D balancer retries endpoint A
d. repeats step 5
e. stuck with endpoint A (no endpoint switch)
See full logs [2].
Solution
Gray-list pinned endpoint on context.DeadlineExceeded error (in retry code path).
Workaround
Pass cli.Put(clientv3.WithRequireLeader(ctx), "foo", "bar") or cli.Get(clientv3.WithRequireLeader(ctx), "foo") makes etcd server return rpc error: code = Unavailable desc = etcdserver: no leader to client, triggering balancer to gray-list the isolated endpoint and switch to the other.
INFO: 2017/10/06 20:01:31 clientv3/balancer: pin 10.138.0.21:2379
INFO: 2017/10/06 20:01:31 clientv3/balancer: 10.138.0.22:2379 is up but not pinned (already pinned 10.138.0.21:2379)
INFO: 2017/10/06 20:01:31 clientv3/balancer: 10.138.0.23:2379 is up but not pinned (already pinned 10.138.0.21:2379)
CREATING PUT!
WRITING...
success!
WRITING...
success!
WRITING...
success!
WRITING...
success!
WRITING...
success!
WRITING...
INFO: 2017/10/06 20:02:24 clientv3/retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.21:2379)
INFO: 2017/10/06 20:02:24 clientv3/auth-retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.21:2379)
context deadline exceeded
WRITING...
INFO: 2017/10/06 20:02:32 Client received GoAway with http2.ErrCodeEnhanceYourCalm.
INFO: 2017/10/06 20:02:32 clientv3/retry: retry for error rpc error: code = Unavailable desc = transport is closing (10.138.0.21:2379)
INFO: 2017/10/06 20:02:32 clientv3/balancer: unpin 10.138.0.21:2379 (grpc: failed with network I/O error)
INFO: 2017/10/06 20:02:32 clientv3/auth-retry: retry for error rpc error: code = Unavailable desc = transport is closing (10.138.0.21:2379)
rpc error: code = Unavailable desc = transport is closing
WARNING: 2017/10/06 20:02:32 Failed to dial 10.138.0.23:2379: grpc: the connection is closing; please retry.
WARNING: 2017/10/06 20:02:32 Failed to dial 10.138.0.23:2379: grpc: the connection is closing; please retry.
WARNING: 2017/10/06 20:02:32 Failed to dial 10.138.0.22:2379: grpc: the connection is closing; please retry.
WARNING: 2017/10/06 20:02:32 Failed to dial 10.138.0.22:2379: grpc: the connection is closing; please retry.
INFO: 2017/10/06 20:02:32 get error from resetTransport grpc: the connection is closing, transportMonitor returning
INFO: 2017/10/06 20:02:32 clientv3/balancer: 10.138.0.21:2379 is healthy
INFO: 2017/10/06 20:02:32 clientv3/balancer: pin 10.138.0.21:2379
INFO: 2017/10/06 20:02:32 clientv3/balancer: 10.138.0.23:2379 is up but not pinned (already pinned 10.138.0.21:2379)
WARNING: 2017/10/06 20:02:32 Failed to dial 10.138.0.22:2379: grpc: the connection is closing; please retry.
WARNING: 2017/10/06 20:02:32 Failed to dial 10.138.0.22:2379: grpc: the connection is closing; please retry.
WRITING...
INFO: 2017/10/06 20:02:45 clientv3/retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.21:2379)
INFO: 2017/10/06 20:02:45 clientv3/auth-retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.21:2379)
context deadline exceeded
WRITING...
INFO: 2017/10/06 20:02:58 clientv3/retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.21:2379)
INFO: 2017/10/06 20:02:58 clientv3/auth-retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.21:2379)
context deadline exceeded
WRITING...
INFO: 2017/10/06 20:03:11 clientv3/retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.21:2379)
INFO: 2017/10/06 20:03:11 clientv3/auth-retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.21:2379)
context deadline exceeded
WRITING...
INFO: 2017/10/06 20:03:19 Client received GoAway with http2.ErrCodeEnhanceYourCalm.
INFO: 2017/10/06 20:03:19 clientv3/balancer: unpin 10.138.0.21:2379 (grpc: failed with network I/O error)
INFO: 2017/10/06 20:03:19 clientv3/retry: retry for error rpc error: code = Unavailable desc = transport is closing (10.138.0.21:2379)
INFO: 2017/10/06 20:03:19 clientv3/auth-retry: retry for error rpc error: code = Unavailable desc = transport is closing (10.138.0.21:2379)
rpc error: code = Unavailable desc = transport is closing
INFO: 2017/10/06 20:03:19 get error from resetTransport grpc: the connection is closing, transportMonitor returning
INFO: 2017/10/06 20:03:19 clientv3/balancer: 10.138.0.21:2379 is healthy
INFO: 2017/10/06 20:03:19 clientv3/balancer: pin 10.138.0.21:2379
WARNING: 2017/10/06 20:03:19 Failed to dial 10.138.0.22:2379: grpc: the connection is closing; please retry.
WARNING: 2017/10/06 20:03:19 Failed to dial 10.138.0.22:2379: grpc: the connection is closing; please retry.
WARNING: 2017/10/06 20:03:19 Failed to dial 10.138.0.23:2379: grpc: the connection is closing; please retry.
WARNING: 2017/10/06 20:03:19 Failed to dial 10.138.0.23:2379: grpc: the connection is closing; please retry.
WRITING...
INFO: 2017/10/06 20:03:32 clientv3/retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.21:2379)
INFO: 2017/10/06 20:03:32 clientv3/auth-retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.21:2379)
context deadline exceeded
WRITING...
INFO: 2017/10/06 20:03:45 clientv3/retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.21:2379)
INFO: 2017/10/06 20:03:45 clientv3/auth-retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.21:2379)
context deadline exceeded
WRITING...
INFO: 2017/10/06 20:03:58 clientv3/retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.21:2379)
INFO: 2017/10/06 20:03:58 clientv3/auth-retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.21:2379)
context deadline exceeded
WRITING...
GET
INFO: 2017/10/06 20:10:24 clientv3/balancer: pin 10.138.0.23:2379
INFO: 2017/10/06 20:10:24 clientv3/balancer: 10.138.0.21:2379 is up but not pinned (already pinned 10.138.0.23:2379)
WARNING: 2017/10/06 20:10:24 Failed to dial 10.138.0.22:2379: grpc: the connection is closing; please retry.
WARNING: 2017/10/06 20:10:24 Failed to dial 10.138.0.22:2379: grpc: the connection is closing; please retry.
CREATING GET!
INFO: 2017/10/06 20:17:38 clientv3/balancer: pin 10.138.0.21:2379
INFO: 2017/10/06 20:17:38 clientv3/balancer: 10.138.0.23:2379 is up but not pinned (already pinned 10.138.0.21:2379)
WARNING: 2017/10/06 20:17:38 Failed to dial 10.138.0.22:2379: grpc: the connection is closing; please retry.
WARNING: 2017/10/06 20:17:38 Failed to dial 10.138.0.22:2379: grpc: the connection is closing; please retry.
CREATING GET!
READING...
success!
READING...
success!
READING...
success!
READING...
INFO: 2017/10/06 20:18:10 clientv3/retry: retry for error rpc error: code = Unavailable desc = etcdserver: no leader (10.138.0.21:2379)
INFO: 2017/10/06 20:18:10 clientv3/balancer: unpin 10.138.0.21:2379 (grpc: the connection is drained)
INFO: 2017/10/06 20:18:10 clientv3/auth-retry: retry for error rpc error: code = Unavailable desc = etcdserver: no leader (10.138.0.21:2379)
INFO: 2017/10/06 20:18:10 clientv3/balancer: pin 10.138.0.22:2379
INFO: 2017/10/06 20:18:10 clientv3/balancer: 10.138.0.23:2379 is up but not pinned (already pinned 10.138.0.22:2379)
INFO: 2017/10/06 20:18:10 clientv3/balancer: 10.138.0.21:2379 is healthy
INFO: 2017/10/06 20:18:10 clientv3/balancer: 10.138.0.21:2379 is up but not pinned (already pinned 10.138.0.22:2379)
success!
READING...
The text was updated successfully, but these errors were encountered:
Problem
With current master branch with health balancer(#8545):
A
,B
, andC
D
is sending linearizable request toA
,B
,C
A
, for instanceA
gets network-partitionedB
andC
, but not to/from clientD
D
errorsA
is gray-listed, and thenA
does not get retrieda. client
D
is not isolated fromA
;healthpb
server is still servingb. thus,
A
is never gray-listedc. client
D
balancer retries endpointA
d. repeats step 5
e. stuck with endpoint
A
(no endpoint switch)See full logs [2].
Solution
Gray-list pinned endpoint on
context.DeadlineExceeded
error (in retry code path).Workaround
Pass
cli.Put(clientv3.WithRequireLeader(ctx), "foo", "bar")
orcli.Get(clientv3.WithRequireLeader(ctx), "foo")
makes etcd server returnrpc error: code = Unavailable desc = etcdserver: no leader
to client, triggering balancer to gray-list the isolated endpoint and switch to the other.Reference
Directly related
[1] Click to expand
[2] Click to expand
The text was updated successfully, but these errors were encountered: