Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client balancer with linearized request does not work under network partition #8660

Closed
gyuho opened this issue Oct 6, 2017 · 1 comment · Fixed by #8669
Closed

Client balancer with linearized request does not work under network partition #8660

gyuho opened this issue Oct 6, 2017 · 1 comment · Fixed by #8669
Assignees
Milestone

Comments

@gyuho
Copy link
Contributor

gyuho commented Oct 6, 2017

Problem

With current master branch with health balancer(#8545):

  1. Set up 3-node etcd cluster: A, B, and C
  2. Client D is sending linearizable request to A,B,C
    • quorum get, mutable operations: put, txn, delete
    • keep sending requests in for-loop [1]
  3. Client balancer pins endpoint A, for instance
  4. Now, A gets network-partitioned
    • disconnected to/from B and C, but not to/from client D
     sudo iptables -A OUTPUT -p tcp --destination-port PEER_PORT -j DROP
     sudo iptables -A INPUT -p tcp --destination-port PEER_PORT -j DROP
    
  5. Client D errors

    clientv3/retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded
    INFO: 2017/10/06 20:02:32 Client received GoAway with http2.ErrCodeEnhanceYourCalm.
    INFO: 2017/10/06 20:02:32 clientv3/retry: retry for error rpc error: code = Unavailable desc = transport is closing
    INFO: 2017/10/06 20:02:32 clientv3/balancer: unpin A:2379 (grpc: failed with network I/O error)
    INFO: 2017/10/06 20:02:32 clientv3/auth-retry: retry for error rpc error: code = Unavailable desc = transport is closing

  6. Expect unpinned endpoint A is gray-listed, and then A does not get retried
  7. What is actually happening
    a. client D is not isolated from A; healthpb server is still serving
    b. thus, A is never gray-listed
    c. client D balancer retries endpoint A
    d. repeats step 5
    e. stuck with endpoint A (no endpoint switch)

See full logs [2].

Solution

Gray-list pinned endpoint on context.DeadlineExceeded error (in retry code path).

Workaround

Pass cli.Put(clientv3.WithRequireLeader(ctx), "foo", "bar") or cli.Get(clientv3.WithRequireLeader(ctx), "foo") makes etcd server return rpc error: code = Unavailable desc = etcdserver: no leader to client, triggering balancer to gray-list the isolated endpoint and switch to the other.

Reference

Directly related


[1] Click to expand

clientv3.SetLogger(grpclog.NewLoggerV2WithVerbosity(os.Stderr, os.Stderr, os.Stderr, 4))

cli, err := clientv3.New(clientv3.Config{
Endpoints: eps,
DialKeepAliveTime: 2 * time.Second,
DialKeepAliveTimeout: 2 * time.Second,
})
if err != nil {
panic(err)
}
defer cli.Close()

time.Sleep(3 * time.Second)

fmt.Println("CREATING GET!")
println()

for {
fmt.Println("READING...")
time.Sleep(5 * time.Second)
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
_, err = cli.Get(ctx, "foo")
cancel()
if err != nil {
fmt.Println(err)
} else {
fmt.Println("success!")
}
time.Sleep(3 * time.Second)
}

[2] Click to expand

PUT

INFO: 2017/10/06 20:01:31 clientv3/balancer: pin 10.138.0.21:2379
INFO: 2017/10/06 20:01:31 clientv3/balancer: 10.138.0.22:2379 is up but not pinned (already pinned 10.138.0.21:2379)
INFO: 2017/10/06 20:01:31 clientv3/balancer: 10.138.0.23:2379 is up but not pinned (already pinned 10.138.0.21:2379)
CREATING PUT!

WRITING...
success!
WRITING...
success!
WRITING...
success!
WRITING...
success!
WRITING...
success!
WRITING...
INFO: 2017/10/06 20:02:24 clientv3/retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.21:2379)
INFO: 2017/10/06 20:02:24 clientv3/auth-retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.21:2379)
context deadline exceeded
WRITING...
INFO: 2017/10/06 20:02:32 Client received GoAway with http2.ErrCodeEnhanceYourCalm.
INFO: 2017/10/06 20:02:32 clientv3/retry: retry for error rpc error: code = Unavailable desc = transport is closing (10.138.0.21:2379)
INFO: 2017/10/06 20:02:32 clientv3/balancer: unpin 10.138.0.21:2379 (grpc: failed with network I/O error)
INFO: 2017/10/06 20:02:32 clientv3/auth-retry: retry for error rpc error: code = Unavailable desc = transport is closing (10.138.0.21:2379)
rpc error: code = Unavailable desc = transport is closing
WARNING: 2017/10/06 20:02:32 Failed to dial 10.138.0.23:2379: grpc: the connection is closing; please retry.
WARNING: 2017/10/06 20:02:32 Failed to dial 10.138.0.23:2379: grpc: the connection is closing; please retry.
WARNING: 2017/10/06 20:02:32 Failed to dial 10.138.0.22:2379: grpc: the connection is closing; please retry.
WARNING: 2017/10/06 20:02:32 Failed to dial 10.138.0.22:2379: grpc: the connection is closing; please retry.
INFO: 2017/10/06 20:02:32 get error from resetTransport grpc: the connection is closing, transportMonitor returning
INFO: 2017/10/06 20:02:32 clientv3/balancer: 10.138.0.21:2379 is healthy
INFO: 2017/10/06 20:02:32 clientv3/balancer: pin 10.138.0.21:2379
INFO: 2017/10/06 20:02:32 clientv3/balancer: 10.138.0.23:2379 is up but not pinned (already pinned 10.138.0.21:2379)
WARNING: 2017/10/06 20:02:32 Failed to dial 10.138.0.22:2379: grpc: the connection is closing; please retry.
WARNING: 2017/10/06 20:02:32 Failed to dial 10.138.0.22:2379: grpc: the connection is closing; please retry.
WRITING...
INFO: 2017/10/06 20:02:45 clientv3/retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.21:2379)
INFO: 2017/10/06 20:02:45 clientv3/auth-retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.21:2379)
context deadline exceeded
WRITING...
INFO: 2017/10/06 20:02:58 clientv3/retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.21:2379)
INFO: 2017/10/06 20:02:58 clientv3/auth-retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.21:2379)
context deadline exceeded
WRITING...
INFO: 2017/10/06 20:03:11 clientv3/retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.21:2379)
INFO: 2017/10/06 20:03:11 clientv3/auth-retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.21:2379)
context deadline exceeded
WRITING...
INFO: 2017/10/06 20:03:19 Client received GoAway with http2.ErrCodeEnhanceYourCalm.
INFO: 2017/10/06 20:03:19 clientv3/balancer: unpin 10.138.0.21:2379 (grpc: failed with network I/O error)
INFO: 2017/10/06 20:03:19 clientv3/retry: retry for error rpc error: code = Unavailable desc = transport is closing (10.138.0.21:2379)
INFO: 2017/10/06 20:03:19 clientv3/auth-retry: retry for error rpc error: code = Unavailable desc = transport is closing (10.138.0.21:2379)
rpc error: code = Unavailable desc = transport is closing
INFO: 2017/10/06 20:03:19 get error from resetTransport grpc: the connection is closing, transportMonitor returning
INFO: 2017/10/06 20:03:19 clientv3/balancer: 10.138.0.21:2379 is healthy
INFO: 2017/10/06 20:03:19 clientv3/balancer: pin 10.138.0.21:2379
WARNING: 2017/10/06 20:03:19 Failed to dial 10.138.0.22:2379: grpc: the connection is closing; please retry.
WARNING: 2017/10/06 20:03:19 Failed to dial 10.138.0.22:2379: grpc: the connection is closing; please retry.
WARNING: 2017/10/06 20:03:19 Failed to dial 10.138.0.23:2379: grpc: the connection is closing; please retry.
WARNING: 2017/10/06 20:03:19 Failed to dial 10.138.0.23:2379: grpc: the connection is closing; please retry.
WRITING...
INFO: 2017/10/06 20:03:32 clientv3/retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.21:2379)
INFO: 2017/10/06 20:03:32 clientv3/auth-retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.21:2379)
context deadline exceeded
WRITING...
INFO: 2017/10/06 20:03:45 clientv3/retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.21:2379)
INFO: 2017/10/06 20:03:45 clientv3/auth-retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.21:2379)
context deadline exceeded
WRITING...
INFO: 2017/10/06 20:03:58 clientv3/retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.21:2379)
INFO: 2017/10/06 20:03:58 clientv3/auth-retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.21:2379)
context deadline exceeded
WRITING...

GET

INFO: 2017/10/06 20:10:24 clientv3/balancer: pin 10.138.0.23:2379
INFO: 2017/10/06 20:10:24 clientv3/balancer: 10.138.0.21:2379 is up but not pinned (already pinned 10.138.0.23:2379)
WARNING: 2017/10/06 20:10:24 Failed to dial 10.138.0.22:2379: grpc: the connection is closing; please retry.
WARNING: 2017/10/06 20:10:24 Failed to dial 10.138.0.22:2379: grpc: the connection is closing; please retry.
CREATING GET!

READING...
success!
READING...
success!
READING...
success!
READING...
INFO: 2017/10/06 20:11:01 clientv3/retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.23:2379)
INFO: 2017/10/06 20:11:01 clientv3/auth-retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.23:2379)
context deadline exceeded
READING...
INFO: 2017/10/06 20:11:09 Client received GoAway with http2.ErrCodeEnhanceYourCalm.
INFO: 2017/10/06 20:11:09 clientv3/balancer: unpin 10.138.0.23:2379 (grpc: failed with network I/O error)
INFO: 2017/10/06 20:11:09 clientv3/balancer: 10.138.0.23:2379 is healthy
INFO: 2017/10/06 20:11:09 clientv3/balancer: pin 10.138.0.23:2379
WARNING: 2017/10/06 20:11:09 Failed to dial 10.138.0.21:2379: grpc: the connection is closing; please retry.
WARNING: 2017/10/06 20:11:09 Failed to dial 10.138.0.21:2379: grpc: the connection is closing; please retry.
WARNING: 2017/10/06 20:11:09 Failed to dial 10.138.0.22:2379: grpc: the connection is closing; please retry.
WARNING: 2017/10/06 20:11:09 Failed to dial 10.138.0.22:2379: grpc: the connection is closing; please retry.
INFO: 2017/10/06 20:11:14 clientv3/retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.23:2379)
INFO: 2017/10/06 20:11:14 clientv3/auth-retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.23:2379)
context deadline exceeded
READING...
INFO: 2017/10/06 20:11:27 clientv3/retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.23:2379)
INFO: 2017/10/06 20:11:27 clientv3/auth-retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.23:2379)
context deadline exceeded
READING...
INFO: 2017/10/06 20:11:40 clientv3/retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.23:2379)
INFO: 2017/10/06 20:11:40 clientv3/auth-retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.23:2379)
context deadline exceeded

GET with require leader

INFO: 2017/10/06 20:17:38 clientv3/balancer: pin 10.138.0.21:2379
INFO: 2017/10/06 20:17:38 clientv3/balancer: 10.138.0.23:2379 is up but not pinned (already pinned 10.138.0.21:2379)
WARNING: 2017/10/06 20:17:38 Failed to dial 10.138.0.22:2379: grpc: the connection is closing; please retry.
WARNING: 2017/10/06 20:17:38 Failed to dial 10.138.0.22:2379: grpc: the connection is closing; please retry.
CREATING GET!

READING...
success!
READING...
success!
READING...
success!
READING...
INFO: 2017/10/06 20:18:10 clientv3/retry: retry for error rpc error: code = Unavailable desc = etcdserver: no leader (10.138.0.21:2379)
INFO: 2017/10/06 20:18:10 clientv3/balancer: unpin 10.138.0.21:2379 (grpc: the connection is drained)
INFO: 2017/10/06 20:18:10 clientv3/auth-retry: retry for error rpc error: code = Unavailable desc = etcdserver: no leader (10.138.0.21:2379)
INFO: 2017/10/06 20:18:10 clientv3/balancer: pin 10.138.0.22:2379
INFO: 2017/10/06 20:18:10 clientv3/balancer: 10.138.0.23:2379 is up but not pinned (already pinned 10.138.0.22:2379)
INFO: 2017/10/06 20:18:10 clientv3/balancer: 10.138.0.21:2379 is healthy
INFO: 2017/10/06 20:18:10 clientv3/balancer: 10.138.0.21:2379 is up but not pinned (already pinned 10.138.0.22:2379)
success!
READING...

@xiang90
Copy link
Contributor

xiang90 commented Oct 7, 2017

good analysis. thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

2 participants