Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3.3.7 panic: send on closed channel #9956

Closed
ekle opened this issue Jul 23, 2018 · 31 comments
Closed

3.3.7 panic: send on closed channel #9956

ekle opened this issue Jul 23, 2018 · 31 comments

Comments

@ekle
Copy link

ekle commented Jul 23, 2018

hi,
i got a panic on 3.3.7.
happend today and 10 days ago
version:

 ./etcd --version
etcd Version: 3.3.7
Git SHA: 56536de55
Go Version: go1.9.6
Go OS/Arch: linux/amd64

cmdline:

/opt/etcd/etcd --name xxx --discovery-srv xxx --initial-advertise-peer-urls=https://xxx:2380 --initial-cluster-token xxx --initial-cluster-state new --advertise-client-urls=https://xxxx --listen-client-urls=https://xxx:2379 --listen-peer-urls=https://xxx:2380 --cert-file=/etc/certificates/etcd/server.crt --key-file=/etc/certificates/etcd/server.key --trusted-ca-file=/etc/certificates/etcd/ca.crt --client-cert-auth --peer-cert-file=/etc/certificates/etcd/cluster.crt --peer-key-file=/etc/certificates/etcd/cluster.key --peer-trusted-ca-file=/etc/certificates/etcd/ca.crt --auto-compaction-retention=2 --proxy off

stacktrace:

panic: send on closed channel
goroutine 167547782 [running]:
github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/transport.(*serverHandlerTransport).do(0xc4258a4ba0, 0xc4202b0960, 0x411648, 0x10)
#011/tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/transport/handler_server.go:169 +0x115
github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/transport.(*serverHandlerTransport).Write(0xc4258a4ba0, 0xc420507900, 0xc424ed134a, 0x5, 0x5, 0xc4201b5650, 0x2b, 0x2b, 0xc424ed1350, 0x5, ...)
#011/tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/transport/handler_server.go:255 +0xe6
github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc.(*serverStream).SendMsg(0xc422327d80, 0xf4d7a0, 0xc42497baa0, 0x0, 0x0)
#011/tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/stream.go:611 +0x275
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc.(*serverStreamWithCtx).SendMsg(0xc422a85e90, 0xf4d7a0, 0xc42497baa0, 0xc42000d2b8, 0xc4204b2df8)
#011<autogenerated>:1 +0x50
github.com/coreos/etcd/cmd/vendor/github.com/grpc-ecosystem/go-grpc-prometheus.(*monitoredServerStream).SendMsg(0xc422b1dcc0, 0xf4d7a0, 0xc42497baa0, 0x160b820, 0xc425e6f4c0)
#011/tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/grpc-ecosystem/go-grpc-prometheus/server_metrics.go:179 +0x4b
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/etcdserverpb.(*leaseLeaseKeepAliveServer).Send(0xc423b2b640, 0xc42497baa0, 0xc425e6f4c0, 0xc8a648e8a853ccb)
#011/tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/etcdserverpb/rpc.pb.go:3687 +0x49
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc.(*LeaseServer).leaseKeepAlive(0xc4244d0580, 0x16110c0, 0xc423b2b640, 0xc4258a4ba0, 0xc420507900)
#011/tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc/lease.go:138 +0x18d
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc.(*LeaseServer).LeaseKeepAlive.func1(0xc4258a4c60, 0xc4244d0580, 0x16110c0, 0xc423b2b640)
#011/tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc/lease.go:89 +0x3f
created by github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc.(*LeaseServer).LeaseKeepAlive
#011/tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc/lease.go:88 +0x91
@cfc4n
Copy link
Contributor

cfc4n commented Jul 24, 2018

It also happened in my case.I know how to reproduce.
but,It's a bug of grpc,not etcd.Is that right?

@gyuho
Copy link
Contributor

gyuho commented Jul 24, 2018

@cfc4n Can you provide a way to reproduce?

@cfc4n
Copy link
Contributor

cfc4n commented Jul 27, 2018

yeah, I'm tring...

@gyuho
Copy link
Contributor

gyuho commented Jul 27, 2018

@cfc4n If you can't reproduce, can you tell me more about how this happened (e.g. sending writes and panic, etc.)?

@ticpu
Copy link

ticpu commented Oct 15, 2018

It seems to happen randomly multiple times a day on the version shipped in Bionic, I had to setup systemd to Restart=on-failure to make sure it stays up.

$ etcd --version
etcd Version: 3.2.17
Git SHA: Not provided (use ./build instead of go build)
Go Version: go1.10
Go OS/Arch: linux/amd64

$ apt-cache policy etcd
etcd:
Installed: 3.2.17+dfsg-1

$ journalctl -u etcd

etcd[3186091]: panic: send on closed channel
etcd[3186091]: goroutine 341451 [running]:
etcd[3186091]: google.golang.org/grpc/transport.(*serverHandlerTransport).do(0xc420215b20, 0xc4264605a0, 0x412d68, 0x8)
etcd[3186091]:         /build/etcd-4uSRAK/etcd-3.2.17+dfsg/_build/src/google.golang.org/grpc/transport/handler_server.go:170 +0x10b
etcd[3186091]: google.golang.org/grpc/transport.(*serverHandlerTransport).WriteStatus(0xc420215b20, 0xc423231b00, 0xc426694310, 0xc426f71110, 0xc427a7cf00)
etcd[3186091]:         /build/etcd-4uSRAK/etcd-3.2.17+dfsg/_build/src/google.golang.org/grpc/transport/handler_server.go:187 +0xbb
etcd[3186091]: google.golang.org/grpc.(*serverStream).RecvMsg.func1(0xc422e6f710, 0xc424c06ca0, 0xe63960, 0xc420206730)
etcd[3186091]:         /build/etcd-4uSRAK/etcd-3.2.17+dfsg/_build/src/google.golang.org/grpc/stream.go:654 +0xb0
etcd[3186091]: google.golang.org/grpc.(*serverStream).RecvMsg(0xc422e6f710, 0xe63960, 0xc420206730, 0xfc7ae0, 0xc426f71110)
etcd[3186091]:         /build/etcd-4uSRAK/etcd-3.2.17+dfsg/_build/src/google.golang.org/grpc/stream.go:668 +0x163
etcd[3186091]: github.com/grpc-ecosystem/go-grpc-prometheus.(*monitoredServerStream).RecvMsg(0xc423255500, 0xe63960, 0xc420206730, 0xe65b20, 0xc420d13e01)
etcd[3186091]:         /build/etcd-4uSRAK/etcd-3.2.17+dfsg/_build/src/github.com/grpc-ecosystem/go-grpc-prometheus/server_metrics.go:187 +0x4b
etcd[3186091]: github.com/coreos/etcd/etcdserver/etcdserverpb.(*watchWatchServer).Recv(0xc420206680, 0x0, 0x0, 0x0)
etcd[3186091]:         /build/etcd-4uSRAK/etcd-3.2.17+dfsg/_build/src/github.com/coreos/etcd/etcdserver/etcdserverpb/rpc.pb.go:3258 +0x62
etcd[3186091]: github.com/coreos/etcd/etcdserver/api/v3rpc.(*serverWatchStream).recvLoop(0xc4237cc460, 0xc424cbddc0, 0x0)
etcd[3186091]:         /build/etcd-4uSRAK/etcd-3.2.17+dfsg/_build/src/github.com/coreos/etcd/etcdserver/api/v3rpc/watch.go:181 +0x4c
etcd[3186091]: github.com/coreos/etcd/etcdserver/api/v3rpc.(*watchServer).Watch.func2(0xc4237cc460, 0xfd9be0, 0xc420206680, 0xc4263b1b60)
etcd[3186091]:         /build/etcd-4uSRAK/etcd-3.2.17+dfsg/_build/src/github.com/coreos/etcd/etcdserver/api/v3rpc/watch.go:143 +0x40
etcd[3186091]: created by github.com/coreos/etcd/etcdserver/api/v3rpc.(*watchServer).Watch
etcd[3186091]:         /build/etcd-4uSRAK/etcd-3.2.17+dfsg/_build/src/github.com/coreos/etcd/etcdserver/api/v3rpc/watch.go:142 +0x2b1
systemd[1]: etcd.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
systemd[1]: etcd.service: Failed with result 'exit-code'.

Since this is a testing environment, we're doing a lot of test reboots, and this happens once in a while too, we get more details but same panic message.

systemd[1]: Started etcd - highly-available key value store.
etcd[3360411]: lost the TCP streaming connection with peer 191ed67d2bee85a8 (stream Message reader)
etcd[3360411]: failed to read 191ed67d2bee85a8 on stream Message (unexpected EOF)
etcd[3360411]: peer 191ed67d2bee85a8 became inactive
etcd[3360411]: lost the TCP streaming connection with peer 191ed67d2bee85a8 (stream MsgApp v2 reader)
etcd[3360411]: peer 191ed67d2bee85a8 became active
etcd[3360411]: closed an existing TCP streaming connection with peer 191ed67d2bee85a8 (stream Message writer)
etcd[3360411]: established a TCP streaming connection with peer 191ed67d2bee85a8 (stream Message writer)
etcd[3360411]: closed an existing TCP streaming connection with peer 191ed67d2bee85a8 (stream MsgApp v2 writer)
etcd[3360411]: established a TCP streaming connection with peer 191ed67d2bee85a8 (stream MsgApp v2 writer)
etcd[3360411]: established a TCP streaming connection with peer 191ed67d2bee85a8 (stream Message reader)
etcd[3360411]: established a TCP streaming connection with peer 191ed67d2bee85a8 (stream MsgApp v2 reader)
etcd[3360411]: 103d3ab12194ae45 [term: 25] received a MsgApp message with higher term from 191ed67d2bee85a8 [term: 26]
etcd[3360411]: 103d3ab12194ae45 became follower at term 26
etcd[3360411]: 103d3ab12194ae45 [logterm: 26, index: 48394, vote: 0] rejected MsgVote from 191ed67d2bee85a8 [logterm: 25, index: 48393] at term 26
etcd[3360411]: panic: send on closed channel
etcd[3360411]: goroutine 776 [running]:
etcd[3360411]: google.golang.org/grpc/transport.(*serverHandlerTransport).do(0xc4271007e0, 0xc4266ade80, 0x412d68, 0x8)
etcd[3360411]:         /build/etcd-4uSRAK/etcd-3.2.17+dfsg/_build/src/google.golang.org/grpc/transport/handler_server.go:170 +0x10b
etcd[3360411]: google.golang.org/grpc/transport.(*serverHandlerTransport).WriteStatus(0xc4271007e0, 0xc42710c600, 0xc42468c200, 0xc4240622d0, 0xc4271c16e0)
etcd[3360411]:         /build/etcd-4uSRAK/etcd-3.2.17+dfsg/_build/src/google.golang.org/grpc/transport/handler_server.go:187 +0xbb
etcd[3360411]: google.golang.org/grpc.(*Server).processStreamingRPC(0xc42014a000, 0xfd96a0, 0xc4271007e0, 0xc42710c600, 0xc420230e70, 0x1642960, 0x0, 0x0, 0x0)
etcd[3360411]:         /build/etcd-4uSRAK/etcd-3.2.17+dfsg/_build/src/google.golang.org/grpc/server.go:942 +0x36f
etcd[3360411]: google.golang.org/grpc.(*Server).handleStream(0xc42014a000, 0xfd96a0, 0xc4271007e0, 0xc42710c600, 0x0)
etcd[3360411]:         /build/etcd-4uSRAK/etcd-3.2.17+dfsg/_build/src/google.golang.org/grpc/server.go:1010 +0x12b1
etcd[3360411]: google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc426680020, 0xc42014a000, 0xfd96a0, 0xc4271007e0, 0xc42710c600)
etcd[3360411]:         /build/etcd-4uSRAK/etcd-3.2.17+dfsg/_build/src/google.golang.org/grpc/server.go:552 +0x9f
etcd[3360411]: created by google.golang.org/grpc.(*Server).serveStreams.func1
etcd[3360411]:         /build/etcd-4uSRAK/etcd-3.2.17+dfsg/_build/src/google.golang.org/grpc/server.go:550 +0xa1
systemd[1]: etcd.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
systemd[1]: etcd.service: Failed with result 'exit-code'.

@ticpu
Copy link

ticpu commented Oct 16, 2018

So in fact, it seems to happen on all server at once when it happens, here is a log from our syslog server grepping only the servers in quorum.
etcd-crashfest.log

@wenjiaswe
Copy link
Contributor

/cc @jpbetz

@daxtens
Copy link

daxtens commented Oct 31, 2018

This happens to me as well. I have etcd providing storage for a kubernetes master (from hyperkube:v1.11.4), and I've installed the etcd from bionic. It just randomly falls over, almost always in < 1 hour.

My etcd version is:

ubuntu@crustle:~$ etcd --version
etcd Version: 3.2.17
Git SHA: Not provided (use ./build instead of go build)
Go Version: go1.10
Go OS/Arch: linux/amd64

@ticpu
Copy link

ticpu commented Nov 1, 2018

It really feels like I should open an issue on Ubuntu's launchpad, feels like a downstream issue.

@mcginne
Copy link

mcginne commented Nov 8, 2018

We have also hit this issue when issuing a hard power off of a host during failure testing.

We had 1332 etcd clusters (3 members each) spread across 111 nodes.
The 111 nodes are spread across 3 zones (so 37 in each)
Each etcd cluster has 1 member in each zone

We performed a test where we did a hard shutdown of all nodes in 1 of the zones. We expected all etcds to maintain quorum as they had 2 members in the remaining zones.

Out of 1332 clusters 32 of them hit this issue at the time the hosts in 1 zone were powered off. The other 1300 clusters survived as expected.

The 32 clusters that hit the issue lost quorum as they had lost 1 member due to the power offs, and 1 other member hit the panic issue, so terminated, leaving just 1 surviving member - so this is not good for HA.

The etcd clusters members that got the panic seemed to be spread across a variety of nodes.
The etcd clusters that got the panic issue had members running on many different nodes in the zone that was powered down.

We ran a similar test previously, but just powering off 3 nodes in a zone, and none of the etcd clusters had any issues - so this seems to be a timing condition that can happen when a host shuts down/crashes that is quite difficult to hit.

We are using etcd-operator to manage the etcdclusters.

In some cases, from the logs it looks like the original peer gets removed from the cluster (as its node was shutdown), and etcd-operator starts another peer, but presumably this peer then dies (as the node it was scheduled on was subsequently shutdown)

2018-11-07 14:07:57.930811 W | etcdserver: read-only range request "key:\"/registry/initializerconfigurations/\" range_end:\"/registry/initializerconfigurations0\" " took too long (129.257275ms) to execute
2018-11-07 14:08:30.832009 W | etcdserver: read-only range request "key:\"/registry/statefulsets\" range_end:\"/registry/statefulsett\" count_only:true " took too long (184.074063ms) to execute
2018-11-07 14:11:55.829403 I | mvcc: store.index: compact 3804958
2018-11-07 14:11:55.831824 I | mvcc: finished scheduled compaction at 3804958 (took 1.473947ms)
2018-11-07 14:13:19.412073 W | etcdserver: read-only range request "key:\"/registry/initializerconfigurations/\" range_end:\"/registry/initializerconfigurations0\" " took too long (126.314268ms) to execute
2018-11-07 14:15:38.989657 I | etcdserver/membership: removed member fbada998cf3a70e5 from cluster 53753c2aa1bcfebe
2018-11-07 14:15:38.989706 I | rafthttp: stopping peer fbada998cf3a70e5...
2018-11-07 14:15:38.991351 I | rafthttp: closed the TCP streaming connection with peer fbada998cf3a70e5 (stream MsgApp v2 writer)
2018-11-07 14:15:38.991383 I | rafthttp: stopped streaming with peer fbada998cf3a70e5 (writer)
2018-11-07 14:15:38.992881 I | rafthttp: closed the TCP streaming connection with peer fbada998cf3a70e5 (stream Message writer)
2018-11-07 14:15:38.992907 I | rafthttp: stopped streaming with peer fbada998cf3a70e5 (writer)
2018-11-07 14:15:38.993208 I | rafthttp: stopped HTTP pipelining with peer fbada998cf3a70e5
2018-11-07 14:15:38.993239 I | rafthttp: stopped streaming with peer fbada998cf3a70e5 (stream MsgApp v2 reader)
2018-11-07 14:15:38.993298 I | rafthttp: stopped streaming with peer fbada998cf3a70e5 (stream Message reader)
2018-11-07 14:15:38.993311 I | rafthttp: stopped peer fbada998cf3a70e5
2018-11-07 14:15:38.993328 I | rafthttp: removed peer fbada998cf3a70e5
2018-11-07 14:16:43.133431 I | etcdserver/membership: added member 88bec81ac7d4f69a [https://etcd-fakecruiser-00300-17-1-tmbt9xd4gc.etcd-fakecruiser-00300-17-1.kubx-etcd-16.svc:2380] to cluster 53753c2aa1bcfebe
2018-11-07 14:16:43.133485 I | rafthttp: starting peer 88bec81ac7d4f69a...
2018-11-07 14:16:43.133551 I | rafthttp: started HTTP pipelining with peer 88bec81ac7d4f69a
2018-11-07 14:16:43.138484 I | rafthttp: started streaming with peer 88bec81ac7d4f69a (writer)
2018-11-07 14:16:43.139235 I | rafthttp: started peer 88bec81ac7d4f69a
2018-11-07 14:16:43.139273 I | rafthttp: added peer 88bec81ac7d4f69a
2018-11-07 14:16:43.139352 I | rafthttp: started streaming with peer 88bec81ac7d4f69a (writer)
2018-11-07 14:16:43.139386 I | rafthttp: started streaming with peer 88bec81ac7d4f69a (stream MsgApp v2 reader)
2018-11-07 14:16:43.139933 I | rafthttp: started streaming with peer 88bec81ac7d4f69a (stream Message reader)
2018-11-07 14:16:48.140114 W | rafthttp: health check for peer 88bec81ac7d4f69a could not connect: dial tcp: lookup etcd-fakecruiser-00300-17-1-tmbt9xd4gc.etcd-fakecruiser-00300-17-1.kubx-etcd-16.svc on 172.20.0.10:53: no such ho
st
2018-11-07 14:16:53.140406 W | rafthttp: health check for peer 88bec81ac7d4f69a could not connect: dial tcp: lookup etcd-fakecruiser-00300-17-1-tmbt9xd4gc.etcd-fakecruiser-00300-17-1.kubx-etcd-16.svc on 172.20.0.10:53: no such ho
st
2018-11-07 14:16:55.850451 I | mvcc: store.index: compact 3805350
2018-11-07 14:16:55.852660 I | mvcc: finished scheduled compaction at 3805350 (took 1.73239ms)
2018-11-07 14:16:58.140694 W | rafthttp: health check for peer 88bec81ac7d4f69a could not connect: dial tcp: lookup etcd-fakecruiser-00300-17-1-tmbt9xd4gc.etcd-fakecruiser-00300-17-1.kubx-etcd-16.svc on 172.20.0.10:53: no such ho
st
2018-11-07 14:17:03.141261 W | rafthttp: health check for peer 88bec81ac7d4f69a could not connect: dial tcp: lookup etcd-fakecruiser-00300-17-1-tmbt9xd4gc.etcd-fakecruiser-00300-17-1.kubx-etcd-16.svc on 172.20.0.10:53: no such ho
st
2018-11-07 14:17:08.141515 W | rafthttp: health check for peer 88bec81ac7d4f69a could not connect: dial tcp: lookup etcd-fakecruiser-00300-17-1-tmbt9xd4gc.etcd-fakecruiser-00300-17-1.kubx-etcd-16.svc on 172.20.0.10:53: no such ho
st
2018-11-07 14:17:13.142290 W | rafthttp: health check for peer 88bec81ac7d4f69a could not connect: dial tcp: lookup etcd-fakecruiser-00300-17-1-tmbt9xd4gc.etcd-fakecruiser-00300-17-1.kubx-etcd-16.svc on 172.20.0.10:53: no such ho
st
2018-11-07 14:17:18.142634 W | rafthttp: health check for peer 88bec81ac7d4f69a could not connect: dial tcp: lookup etcd-fakecruiser-00300-17-1-tmbt9xd4gc.etcd-fakecruiser-00300-17-1.kubx-etcd-16.svc on 172.20.0.10:53: no such ho
st
2018-11-07 14:17:23.144316 W | rafthttp: health check for peer 88bec81ac7d4f69a could not connect: dial tcp: lookup etcd-fakecruiser-00300-17-1-tmbt9xd4gc.etcd-fakecruiser-00300-17-1.kubx-etcd-16.svc on 172.20.0.10:53: no such ho
st
2018-11-07 14:17:28.144576 W | rafthttp: health check for peer 88bec81ac7d4f69a could not connect: dial tcp: lookup etcd-fakecruiser-00300-17-1-tmbt9xd4gc.etcd-fakecruiser-00300-17-1.kubx-etcd-16.svc on 172.20.0.10:53: no such ho
st
2018-11-07 14:17:33.145053 W | rafthttp: health check for peer 88bec81ac7d4f69a could not connect: dial tcp: lookup etcd-fakecruiser-00300-17-1-tmbt9xd4gc.etcd-fakecruiser-00300-17-1.kubx-etcd-16.svc on 172.20.0.10:53: no such ho
st
2018-11-07 14:17:38.145365 W | rafthttp: health check for peer 88bec81ac7d4f69a could not connect: dial tcp: lookup etcd-fakecruiser-00300-17-1-tmbt9xd4gc.etcd-fakecruiser-00300-17-1.kubx-etcd-16.svc on 172.20.0.10:53: no such ho
st
2018-11-07 14:17:43.147031 W | rafthttp: health check for peer 88bec81ac7d4f69a could not connect: dial tcp: lookup etcd-fakecruiser-00300-17-1-tmbt9xd4gc.etcd-fakecruiser-00300-17-1.kubx-etcd-16.svc on 172.20.0.10:53: no such ho
st
2018-11-07 14:17:48.147397 W | rafthttp: health check for peer 88bec81ac7d4f69a could not connect: dial tcp: lookup etcd-fakecruiser-00300-17-1-tmbt9xd4gc.etcd-fakecruiser-00300-17-1.kubx-etcd-16.svc on 172.20.0.10:53: no such ho
st
2018-11-07 14:17:53.150801 W | rafthttp: health check for peer 88bec81ac7d4f69a could not connect: dial tcp: lookup etcd-fakecruiser-00300-17-1-tmbt9xd4gc.etcd-fakecruiser-00300-17-1.kubx-etcd-16.svc on 172.20.0.10:53: no such ho
st
2018-11-07 14:17:58.151930 W | rafthttp: health check for peer 88bec81ac7d4f69a could not connect: dial tcp: lookup etcd-fakecruiser-00300-17-1-tmbt9xd4gc.etcd-fakecruiser-00300-17-1.kubx-etcd-16.svc on 172.20.0.10:53: no such ho
st
2018-11-07 14:18:03.153258 W | rafthttp: health check for peer 88bec81ac7d4f69a could not connect: dial tcp: lookup etcd-fakecruiser-00300-17-1-tmbt9xd4gc.etcd-fakecruiser-00300-17-1.kubx-etcd-16.svc on 172.20.0.10:53: no such ho
st
2018-11-07 14:18:08.154310 W | rafthttp: health check for peer 88bec81ac7d4f69a could not connect: dial tcp: lookup etcd-fakecruiser-00300-17-1-tmbt9xd4gc.etcd-fakecruiser-00300-17-1.kubx-etcd-16.svc on 172.20.0.10:53: no such ho
st
2018-11-07 14:18:13.154663 W | rafthttp: health check for peer 88bec81ac7d4f69a could not connect: dial tcp: lookup etcd-fakecruiser-00300-17-1-tmbt9xd4gc.etcd-fakecruiser-00300-17-1.kubx-etcd-16.svc on 172.20.0.10:53: no such ho
st
2018-11-07 14:18:18.155445 W | rafthttp: health check for peer 88bec81ac7d4f69a could not connect: dial tcp: lookup etcd-fakecruiser-00300-17-1-tmbt9xd4gc.etcd-fakecruiser-00300-17-1.kubx-etcd-16.svc on 172.20.0.10:53: no such ho
st
2018-11-07 14:18:23.156258 W | rafthttp: health check for peer 88bec81ac7d4f69a could not connect: dial tcp: i/o timeout
2018-11-07 14:18:28.156580 W | rafthttp: health check for peer 88bec81ac7d4f69a could not connect: dial tcp: i/o timeout
2018-11-07 14:18:33.156848 W | rafthttp: health check for peer 88bec81ac7d4f69a could not connect: dial tcp: i/o timeout
2018-11-07 14:18:38.157108 W | rafthttp: health check for peer 88bec81ac7d4f69a could not connect: dial tcp: lookup etcd-fakecruiser-00300-17-1-tmbt9xd4gc.etcd-fakecruiser-00300-17-1.kubx-etcd-16.svc on 172.20.0.10:53: read udp 1
72.16.113.134:42711->172.20.0.10:53: i/o timeout
2018-11-07 14:18:43.158100 W | rafthttp: health check for peer 88bec81ac7d4f69a could not connect: dial tcp: lookup etcd-fakecruiser-00300-17-1-tmbt9xd4gc.etcd-fakecruiser-00300-17-1.kubx-etcd-16.svc on 172.20.0.10:53: read udp 1
72.16.113.134:42711->172.20.0.10:53: i/o timeout
2018-11-07 14:18:48.158404 W | rafthttp: health check for peer 88bec81ac7d4f69a could not connect: dial tcp: lookup etcd-fakecruiser-00300-17-1-tmbt9xd4gc.etcd-fakecruiser-00300-17-1.kubx-etcd-16.svc on 172.20.0.10:53: no such ho
st
2018-11-07 14:18:53.158949 W | rafthttp: health check for peer 88bec81ac7d4f69a could not connect: dial tcp: lookup etcd-fakecruiser-00300-17-1-tmbt9xd4gc.etcd-fakecruiser-00300-17-1.kubx-etcd-16.svc on 172.20.0.10:53: no such ho
st
2018-11-07 14:18:58.159497 W | rafthttp: health check for peer 88bec81ac7d4f69a could not connect: dial tcp: lookup etcd-fakecruiser-00300-17-1-tmbt9xd4gc.etcd-fakecruiser-00300-17-1.kubx-etcd-16.svc on 172.20.0.10:53: no such ho
st
2018-11-07 14:19:03.159769 W | rafthttp: health check for peer 88bec81ac7d4f69a could not connect: dial tcp: lookup etcd-fakecruiser-00300-17-1-tmbt9xd4gc.etcd-fakecruiser-00300-17-1.kubx-etcd-16.svc on 172.20.0.10:53: no such ho
st
2018-11-07 14:19:08.160055 W | rafthttp: health check for peer 88bec81ac7d4f69a could not connect: dial tcp: lookup etcd-fakecruiser-00300-17-1-tmbt9xd4gc.etcd-fakecruiser-00300-17-1.kubx-etcd-16.svc on 172.20.0.10:53: no such ho
st
panic: send on closed channel

goroutine 1428 [running]:
github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/transport.(*serverHandlerTransport).do(0xc420eb4180, 0xc423963720, 0x411648, 0x2)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/transport/handler_server.go:169 +0x115
github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/transport.(*serverHandlerTransport).Write(0xc420eb4180, 0xc420cac400, 0xc42415a708, 0x5, 0x5, 0xc422274600, 0x1f1, 0x1f1, 0xc42415a70e, 0x5, ...)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/transport/handler_server.go:255 +0xe6
github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc.(*serverStream).SendMsg(0xc420c85300, 0xf8bae0, 0xc4239636d0, 0x0, 0x0)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/stream.go:611 +0x275
github.com/coreos/etcd/cmd/vendor/github.com/grpc-ecosystem/go-grpc-prometheus.(*monitoredServerStream).SendMsg(0xc4217e6800, 0xf8bae0, 0xc4239636d0, 0x3ff0000000000000, 0x3ff0000000000000)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/grpc-ecosystem/go-grpc-prometheus/server_metrics.go:179 +0x4b
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/etcdserverpb.(*watchWatchServer).Send(0xc420ccc170, 0xc4239636d0, 0x0, 0xc421b21ae8)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/etcdserverpb/rpc.pb.go:3495 +0x49
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc.(*serverWatchStream).sendLoop(0xc420261040)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc/watch.go:345 +0x8ad
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc.(*watchServer).Watch.func1(0xc420261040)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc/watch.go:132 +0x2b
created by github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc.(*watchServer).Watch
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc/watch.go:131 +0x29b

But I also have cases where the panic occurs before another member is scheduled:

2018-11-07 14:04:53.782030 I | mvcc: store.index: compact 2242113
2018-11-07 14:04:53.783298 I | mvcc: finished scheduled compaction at 2242113 (took 947.153µs)
2018-11-07 14:09:53.792107 I | mvcc: store.index: compact 2242504
2018-11-07 14:09:53.793893 I | mvcc: finished scheduled compaction at 2242504 (took 1.386502ms)
2018-11-07 14:11:41.279838 W | rafthttp: lost the TCP streaming connection with peer e2be25a05c05f6fa (stream MsgApp v2 reader)
2018-11-07 14:11:41.279916 E | rafthttp: failed to read e2be25a05c05f6fa on stream MsgApp v2 (read tcp 172.16.127.52:33756->172.16.119.144:2380: i/o timeout)
2018-11-07 14:11:41.279927 I | rafthttp: peer e2be25a05c05f6fa became inactive
2018-11-07 14:11:41.280494 W | rafthttp: lost the TCP streaming connection with peer e2be25a05c05f6fa (stream Message reader)
2018-11-07 14:14:53.809462 I | mvcc: store.index: compact 2242885
2018-11-07 14:14:53.810581 I | mvcc: finished scheduled compaction at 2242885 (took 810.65µs)
2018-11-07 14:19:53.822868 I | mvcc: store.index: compact 2243245
2018-11-07 14:19:53.824116 I | mvcc: finished scheduled compaction at 2243245 (took 919.883µs)
2018-11-07 14:23:05.440861 I | etcdserver: start to snapshot (applied: 3330335, lastsnap: 3320334)
2018-11-07 14:23:05.444606 I | etcdserver: saved snapshot at index 3330335
2018-11-07 14:23:05.445155 I | etcdserver: compacted raft log at 3325335
2018-11-07 14:23:05.960706 I | pkg/fileutil: purged file /var/etcd/data/member/snap/0000000000000002-0000000000320dca.snap successfully
panic: send on closed channel

goroutine 2691 [running]:
github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/transport.(*serverHandlerTransport).do(0xc42051f2c0, 0xc42231d4a0, 0x411648, 0x2)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/transport/handler_server.go:169 +0x115
github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/transport.(*serverHandlerTransport).Write(0xc42051f2c0, 0xc420d3d400, 0xc421d180d8, 0x5, 0x5, 0xc423fac6c0, 0x203, 0x203, 0xc421d180de, 0x5, ...)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/transport/handler_server.go:255 +0xe6
github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc.(*serverStream).SendMsg(0xc42052b100, 0xf8bae0, 0xc42231d450, 0x0, 0x0)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/stream.go:611 +0x275
github.com/coreos/etcd/cmd/vendor/github.com/grpc-ecosystem/go-grpc-prometheus.(*monitoredServerStream).SendMsg(0xc421b33860, 0xf8bae0, 0xc42231d450, 0x3ff0000000000000, 0x3ff0000000000000)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/grpc-ecosystem/go-grpc-prometheus/server_metrics.go:179 +0x4b
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/etcdserverpb.(*watchWatchServer).Send(0xc421cc6010, 0xc42231d450, 0x0, 0xc4226ffae8)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/etcdserverpb/rpc.pb.go:3495 +0x49
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc.(*serverWatchStream).sendLoop(0xc421bd8960)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc/watch.go:345 +0x8ad
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc.(*watchServer).Watch.func1(0xc421bd8960)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc/watch.go:132 +0x2b
created by github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc.(*watchServer).Watch
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc/watch.go:131 +0x29b

Etcd version:

2018-10-18 15:56:26.579220 I | etcdmain: etcd Version: 3.3.1
2018-10-18 15:56:26.579237 I | etcdmain: Git SHA: 28f3f26c0
2018-10-18 15:56:26.579245 I | etcdmain: Go Version: go1.9.4
2018-10-18 15:56:26.579253 I | etcdmain: Go OS/Arch: linux/amd64

@gyuho
Copy link
Contributor

gyuho commented Nov 8, 2018

In some cases, from the logs it looks like the original peer gets removed from the cluster (as its node was shutdown), and etcd-operator starts another peer, but presumably this peer then dies

So, this panic happens in the new node?

@mcginne
Copy link

mcginne commented Nov 8, 2018

No, the panic is always on one of the existing nodes that was not shutdown (Although it is caused by one of its peers being shutdown)

@mcginne
Copy link

mcginne commented Dec 13, 2018

Does anyone have any thoughts on a potential fix for this? I can recreate it quite easily on our test system. We haven't hit it in prod yet, but it's quite serious if this does occur.
The fact the panic is coming from under go-grpc-prometheus/server_metrics.go:179 makes it look like the issue is coming from the code collecting grpc metrics - quite frustrating to lose the etcd cluster because of a panic under metrics code...

I'd be quite happy to have the option to sacrifice the GRPC metrics if it meant we could avoid this panic.

@hexfusion
Copy link
Contributor

Does anyone have any thoughts on a potential fix for this? I can recreate it quite easily on our test system.

Reviewing this thread it is unclear to me the reproducible steps can you outline them? Once we can reproduce this issue we can help resolve.

@mcginne
Copy link

mcginne commented Dec 13, 2018

I'm not sure how easy it will be to reproduce, as in our case we hit it something like 32 out of 1300 cases..
My scenario that hits this regularly requires a 3 member etcd cluster - in our case it is backing store for Kubernetes, so it is processing constant traffic.
If I then do a hard shutdown of one of the hosts (VSI) I sometimes hit the issue on one of the other etcd members in the cluster.
In our case we are actually running 1332 etcd clusters across 111 nodes, and when I shutdown 37 of the nodes we usually see the issue on between 10 - 40 of the etcd clusters.

I'm not sure if having multiple etcds on the same node is contributing, or if you would hit this with fewer etcds on each node. The fact others have seen it probably means it can happen at lower scale, it's just that running at such scale we increase the chances of hitting whatever timing condition is causing it.

So in terms of reproduction I think it's just a matter of issuing a hard shutdown - but it seems the window to hit the issue is small.

If there's any debug patch or anything that would help to diagnose I should be able to run that on our test system.

@hexfusion
Copy link
Contributor

So in terms of reproduction I think it's just a matter of issuing a hard shutdown - but it seems the window to hit the issue is small.

@mcginne could you try to recreate this in your testing ENV in a small scale such as single cluster on a single node? Use something to randomly perform the hard shutdown and see if you capture the panic logging. If you can get that far it would be greatly appreciated and will go a long way to getting to the bottom of this. Meanwhile I will take a look at the panic and see if @gyuho has any ideas.

@mcginne
Copy link

mcginne commented Dec 19, 2018

Just so you're aware I have setup 3 nodes with a single etcd cluster running across them. I have it running in a loop to power-off/power-on one of the hosts in the hope of recreating the issue at smaller scale. It has currently got through ~50 loops with no issues so far, but will leave it running.

@hexfusion
Copy link
Contributor

@mcginne awesome! If anyone is going to recreate this is will be you! Thanks a lot for doing this and please keep us posted if you have any questions.

@relyt0925
Copy link

Adding to this we saw it again. This was a 3 node etcd cluster

peer 1's logs

2019-01-08 23:31:44.935233 W | rafthttp: health check for peer 99a6e5492354482e could not connect: dial tcp 172.16.100.214:2380: i/o timeout
2019-01-08 23:31:45.130200 W | rafthttp: health check for peer 7733eb6f83e0e73 could not connect: dial tcp 172.16.0.112:2380: i/o timeout
2019-01-08 23:31:45.866573 I | embed: rejected connection from "172.16.233.128:12815" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:31:46.549867 I | embed: rejected connection from "172.16.233.128:22172" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:31:46.683891 I | embed: rejected connection from "172.16.233.128:14865" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:31:46.942291 I | embed: rejected connection from "172.16.233.128:53909" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:31:47.102318 I | embed: rejected connection from "172.16.233.128:17024" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:31:47.102758 I | embed: rejected connection from "172.16.233.128:41870" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:31:47.416660 I | embed: rejected connection from "172.16.233.128:40391" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:31:48.657340 I | embed: rejected connection from "172.16.233.128:63993" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:31:49.208777 I | embed: rejected connection from "172.16.233.128:38989" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:31:49.280381 I | embed: rejected connection from "172.16.233.128:36104" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:31:49.343921 I | embed: rejected connection from "172.16.233.128:45501" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:31:49.345204 I | embed: rejected connection from "172.16.233.128:5915" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:31:49.397227 I | embed: rejected connection from "10.127.72.17:34614" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:31:49.935455 W | rafthttp: health check for peer 99a6e5492354482e could not connect: dial tcp 172.16.100.214:2380: i/o timeout
2019-01-08 23:31:50.130474 W | rafthttp: health check for peer 7733eb6f83e0e73 could not connect: dial tcp 172.16.0.112:2380: i/o timeout
2019-01-08 23:31:50.861716 I | embed: rejected connection from "172.16.233.128:52025" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:31:51.266883 I | embed: rejected connection from "172.16.233.128:9514" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:31:51.716401 I | embed: rejected connection from "172.16.233.128:40469" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:31:53.722939 I | embed: rejected connection from "172.16.233.128:21346" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:31:53.758902 I | embed: rejected connection from "172.16.233.128:41267" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:31:53.761387 I | embed: rejected connection from "172.16.233.128:23906" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:31:54.935664 W | rafthttp: health check for peer 99a6e5492354482e could not connect: dial tcp 172.16.100.214:2380: i/o timeout
2019-01-08 23:31:55.131524 W | rafthttp: health check for peer 7733eb6f83e0e73 could not connect: dial tcp 172.16.0.112:2380: i/o timeout
2019-01-08 23:31:55.529517 I | embed: rejected connection from "172.16.149.128:48568" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:31:55.570072 W | etcdserver/api/v3rpc: failed to send watch response to gRPC stream ("rpc error: code = Unavailable desc = transport is closing")
2019-01-08 23:31:55.570334 W | etcdserver/api/v3rpc: failed to send watch response to gRPC stream ("rpc error: code = Unavailable desc = transport is closing")
panic: send on closed channel

goroutine 2486646159 [running]:
github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/transport.(*serverHandlerTransport).do(0xc431e2c4e0, 0xc433c86b90, 0x411648, 0x2)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/transport/handler_server.go:169 +0x115
github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/transport.(*serverHandlerTransport).Write(0xc431e2c4e0, 0xc427d71500, 0xc438d58fb8, 0x5, 0x5, 0xc43975c000, 0x5aa, 0x5aa, 0xc438d58fbe, 0x5, ...)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/transport/handler_server.go:255 +0xe6
github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc.(*serverStream).SendMsg(0xc425b55780, 0xf8bae0, 0xc433c86b40, 0x0, 0x0)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/stream.go:611 +0x275
github.com/coreos/etcd/cmd/vendor/github.com/grpc-ecosystem/go-grpc-prometheus.(*monitoredServerStream).SendMsg(0xc429147d00, 0xf8bae0, 0xc433c86b40, 0x3ff0000000000000, 0x3ff0000000000000)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/grpc-ecosystem/go-grpc-prometheus/server_metrics.go:179 +0x4b
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/etcdserverpb.(*watchWatchServer).Send(0xc429aeb290, 0xc433c86b40, 0x1, 0xc42aba0548)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/etcdserverpb/rpc.pb.go:3495 +0x49
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc.(*serverWatchStream).sendLoop(0xc423f6fea0)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc/watch.go:345 +0x8ad
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc.(*watchServer).Watch.func1(0xc423f6fea0)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc/watch.go:132 +0x2b
created by github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc.(*watchServer).Watch
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc/watch.go:131 +0x29b

peers 2 logs:

2019-01-08 23:22:59.968392 I | mvcc: store.index: compact 203737831
2019-01-08 23:23:00.088395 I | mvcc: finished scheduled compaction at 203737831 (took 114.027972ms)
2019-01-08 23:23:01.819083 I | embed: rejected connection from "172.16.233.128:24494" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:23:05.180764 I | embed: rejected connection from "172.16.132.132:52830" (error "EOF", ServerName "")
2019-01-08 23:23:13.900669 I | embed: rejected connection from "172.16.233.128:55755" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:23:20.800852 I | embed: rejected connection from "172.16.233.128:49692" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:23:21.969389 I | embed: rejected connection from "172.16.233.128:8497" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:23:22.369821 I | embed: rejected connection from "172.16.233.128:7990" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:23:28.190057 I | embed: rejected connection from "172.16.233.128:43183" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:23:28.653442 I | embed: rejected connection from "172.16.233.128:60902" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:23:29.263253 I | embed: rejected connection from "172.16.233.128:58410" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:23:38.663006 I | embed: rejected connection from "172.16.233.128:17856" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:23:43.425465 I | embed: rejected connection from "172.16.233.128:45143" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:23:45.250362 I | embed: rejected connection from "172.16.132.132:53122" (error "EOF", ServerName "")
2019-01-08 23:23:45.838758 I | embed: rejected connection from "172.16.233.128:4097" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:23:46.497070 I | embed: rejected connection from "172.16.233.128:15084" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:23:46.654271 I | embed: rejected connection from "172.16.233.128:65208" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:23:46.906865 I | embed: rejected connection from "172.16.233.128:46269" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:23:47.077042 I | embed: rejected connection from "172.16.233.128:34188" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:23:49.179995 I | embed: rejected connection from "172.16.233.128:23493" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:23:49.371474 I | embed: rejected connection from "172.16.219.128:33984" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:23:50.591717 I | embed: rejected connection from "172.16.233.128:51087" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:23:53.731111 I | embed: rejected connection from "172.16.233.128:19774" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:24:03.472036 I | embed: rejected connection from "172.16.233.128:39624" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:24:20.803456 I | embed: rejected connection from "172.16.233.128:50490" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:24:20.824384 I | embed: rejected connection from "172.16.233.128:6305" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:24:22.373378 I | embed: rejected connection from "172.16.233.128:8338" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:24:23.561307 I | embed: rejected connection from "172.16.233.128:8958" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:24:23.910082 I | embed: rejected connection from "172.16.233.128:4187" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:24:27.140366 I | embed: rejected connection from "172.16.233.128:17280" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:24:29.265277 I | embed: rejected connection from "172.16.233.128:44052" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:24:29.266538 I | embed: rejected connection from "172.16.233.128:58768" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:24:39.228656 I | embed: rejected connection from "172.16.233.128:27617" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:24:39.630815 I | embed: rejected connection from "172.16.233.128:64386" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:24:39.744563 I | embed: rejected connection from "172.16.233.128:64743" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:24:41.407853 I | wal: segmented wal file /var/etcd/data/member/wal/0000000000000542-00000000014a4b0c.wal is created
2019-01-08 23:24:42.065607 I | pkg/fileutil: purged file /var/etcd/data/member/wal/000000000000053d-0000000001494cba.wal successfully
2019-01-08 23:24:43.327190 I | embed: rejected connection from "172.16.233.128:8928" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:24:45.362476 I | embed: rejected connection from "172.16.132.132:53500" (error "EOF", ServerName "")
2019-01-08 23:24:46.910891 I | embed: rejected connection from "172.16.233.128:46539" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:24:47.075454 I | embed: rejected connection from "172.16.233.128:8179" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:24:49.261786 I | embed: rejected connection from "172.16.233.128:34218" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:24:49.323385 I | embed: rejected connection from "172.16.233.128:2155" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:24:55.379410 I | embed: rejected connection from "172.16.132.132:53562" (error "EOF", ServerName "")
2019-01-08 23:24:57.022588 I | embed: rejected connection from "172.16.233.128:51508" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:24:59.507221 I | embed: rejected connection from "172.16.233.128:45654" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:25:15.415427 I | embed: rejected connection from "172.16.132.132:53694" (error "EOF", ServerName "")
panic: send on closed channel

goroutine 24200312 [running]:
github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/transport.(*serverHandlerTransport).do(0xc427c7e540, 0xc4203f0be0, 0x411648, 0x2)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/transport/handler_server.go:169 +0x115
github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/transport.(*serverHandlerTransport).Write(0xc427c7e540, 0xc427e86800, 0xc42322d050, 0x5, 0x5, 0xc42a204500, 0x248, 0x248, 0xc42322d056, 0x5, ...)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/transport/handler_server.go:255 +0xe6
github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc.(*serverStream).SendMsg(0xc42a9be400, 0xf8bae0, 0xc4203f0b90, 0x0, 0x0)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/stream.go:611 +0x275
github.com/coreos/etcd/cmd/vendor/github.com/grpc-ecosystem/go-grpc-prometheus.(*monitoredServerStream).SendMsg(0xc422888320, 0xf8bae0, 0xc4203f0b90, 0x3ff0000000000000, 0x3ff0000000000000)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/grpc-ecosystem/go-grpc-prometheus/server_metrics.go:179 +0x4b
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/etcdserverpb.(*watchWatchServer).Send(0xc428292d50, 0xc4203f0b90, 0x7, 0xc42881e4a8)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/etcdserverpb/rpc.pb.go:3495 +0x49
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc.(*serverWatchStream).sendLoop(0xc421098e60)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc/watch.go:345 +0x8ad
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc.(*watchServer).Watch.func1(0xc421098e60)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc/watch.go:132 +0x2b
created by github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc.(*watchServer).Watch
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc/watch.go:131 +0x29b

instance 3's logs

2019-01-08 23:23:33.196517 I | etcdserver: start to snapshot (applied: 21644854, lastsnap: 21634853)
2019-01-08 23:23:33.222556 I | etcdserver: saved snapshot at index 21644854
2019-01-08 23:23:33.222729 I | etcdserver: compacted raft log at 21639854
2019-01-08 23:23:35.232887 I | embed: rejected connection from "172.16.132.132:53024" (error "EOF", ServerName "")
2019-01-08 23:23:36.741398 I | pkg/fileutil: purged file /var/etcd/data/member/snap/0000000000000006-00000000014982e1.snap successfully
2019-01-08 23:23:36.901221 W | etcdserver: read-only range request "key:\"/registry/secrets\" range_end:\"/registry/secrett\" count_only:true " took too long (148.206229ms) to execute
2019-01-08 23:23:38.997467 I | embed: rejected connection from "172.16.233.128:9896" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:23:39.226457 I | embed: rejected connection from "172.16.233.128:27253" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:23:39.741538 I | embed: rejected connection from "172.16.233.128:63587" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:23:40.524709 I | embed: rejected connection from "172.16.233.128:50071" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:23:47.071693 I | embed: rejected connection from "172.16.233.128:7940" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:23:51.240236 I | embed: rejected connection from "172.16.233.128:3526" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:23:51.694661 I | embed: rejected connection from "172.16.233.128:36541" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:23:55.276754 I | embed: rejected connection from "172.16.132.132:53186" (error "EOF", ServerName "")
2019-01-08 23:23:55.499360 I | embed: rejected connection from "172.16.149.128:47656" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:23:59.504589 I | embed: rejected connection from "172.16.233.128:45542" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:24:01.822749 I | embed: rejected connection from "172.16.233.128:24696" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:24:15.312683 I | embed: rejected connection from "172.16.132.132:53324" (error "EOF", ServerName "")
2019-01-08 23:24:18.493390 I | embed: rejected connection from "172.16.233.128:18320" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:24:20.790775 I | embed: rejected connection from "172.16.233.128:58013" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:24:23.664415 I | embed: rejected connection from "172.16.233.128:34663" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:24:24.275505 I | embed: rejected connection from "172.16.233.128:46282" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:24:25.330540 I | embed: rejected connection from "172.16.132.132:53380" (error "EOF", ServerName "")
2019-01-08 23:24:27.319741 I | embed: rejected connection from "172.16.233.128:14217" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:24:28.193509 I | embed: rejected connection from "172.16.233.128:43423" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:24:28.491085 I | embed: rejected connection from "172.16.233.128:55430" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:24:28.657285 I | embed: rejected connection from "172.16.233.128:61154" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:24:35.346440 I | embed: rejected connection from "172.16.132.132:53444" (error "EOF", ServerName "")
2019-01-08 23:24:38.666616 I | embed: rejected connection from "172.16.233.128:18206" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:24:39.268167 I | embed: rejected connection from "172.16.233.128:3692" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:24:40.526129 I | embed: rejected connection from "172.16.233.128:53155" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:24:43.427438 I | embed: rejected connection from "172.16.233.128:45449" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:24:46.499424 I | embed: rejected connection from "172.16.233.128:15488" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:24:46.658151 I | embed: rejected connection from "172.16.233.128:1027" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:24:47.388624 I | embed: rejected connection from "172.16.233.128:25539" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:24:48.630134 I | embed: rejected connection from "172.16.233.128:60341" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:24:49.183169 I | embed: rejected connection from "172.16.233.128:24229" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:24:49.321561 I | embed: rejected connection from "172.16.233.128:38309" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:24:49.375322 I | embed: rejected connection from "172.16.219.128:34060" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:24:51.245082 I | embed: rejected connection from "172.16.233.128:6478" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:24:51.696712 I | embed: rejected connection from "172.16.233.128:37119" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:24:56.432382 I | embed: rejected connection from "172.16.233.128:52210" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:25:01.826662 I | embed: rejected connection from "172.16.233.128:24890" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:25:01.846917 I | embed: rejected connection from "172.16.233.128:3861" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:25:03.138818 I | embed: rejected connection from "172.16.233.128:65256" (error "tls: client offered an unsupported, maximum protocol version of 300", ServerName "")
2019-01-08 23:25:05.397012 I | embed: rejected connection from "172.16.132.132:53626" (error "EOF", ServerName "")
2019-01-08 23:25:16.157851 W | rafthttp: lost the TCP streaming connection with peer 99a6e5492354482e (stream Message reader)
2019-01-08 23:25:16.157887 W | rafthttp: lost the TCP streaming connection with peer 99a6e5492354482e (stream MsgApp v2 reader)
2019-01-08 23:25:16.157960 E | rafthttp: failed to read 99a6e5492354482e on stream Message (unexpected EOF)
2019-01-08 23:25:16.157977 I | rafthttp: peer 99a6e5492354482e became inactive
2019-01-08 23:25:16.219846 W | etcdserver/api/v3rpc: failed to send watch response to gRPC stream ("rpc error: code = Unavailable desc = transport is closing")
2019-01-08 23:25:16.219987 W | etcdserver/api/v3rpc: failed to send watch response to gRPC stream ("rpc error: code = Unavailable desc = transport is closing")
panic: send on closed channel

goroutine 875493 [running]:
github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/transport.(*serverHandlerTransport).do(0xc423bb3e60, 0xc4274eba40, 0x411648, 0x2)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/transport/handler_server.go:169 +0x115
github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/transport.(*serverHandlerTransport).Write(0xc423bb3e60, 0xc422ac8100, 0xc42c90f3e8, 0x5, 0x5, 0xc428843200, 0x5a7, 0x5a7, 0xc42c90f3ee, 0x5, ...)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/transport/handler_server.go:255 +0xe6
github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc.(*serverStream).SendMsg(0xc4215d4a00, 0xf8bae0, 0xc4274eb9f0, 0x0, 0x0)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/stream.go:611 +0x275
github.com/coreos/etcd/cmd/vendor/github.com/grpc-ecosystem/go-grpc-prometheus.(*monitoredServerStream).SendMsg(0xc423637a40, 0xf8bae0, 0xc4274eb9f0, 0x3ff0000000000000, 0x3ff0000000000000)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/grpc-ecosystem/go-grpc-prometheus/server_metrics.go:179 +0x4b
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/etcdserverpb.(*watchWatchServer).Send(0xc427f514b0, 0xc4274eb9f0, 0x4, 0xc4271ca138)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/etcdserverpb/rpc.pb.go:3495 +0x49
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc.(*serverWatchStream).sendLoop(0xc42108f7c0)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc/watch.go:345 +0x8ad
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc.(*watchServer).Watch.func1(0xc42108f7c0)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc/watch.go:132 +0x2b
created by github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc.(*watchServer).Watch
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc/watch.go:131 +0x29b

@relyt0925
Copy link

Note that instance 2 and 3 die within a second of one another. Instance 2 is the first to die with the panic followed by instance 3. Then Instance 1 dies 6 minutes later time with the same error.

@mcginne
Copy link

mcginne commented Jan 16, 2019

I've run hundreds of loops of powering off/on hosts in my simple recreate, and haven't been able to recreate it unfortunately.
Can still hit it regularly in the large scale environment.

@relyt0925
Copy link

Note we still see this in v3.3.11 as well.

@thsnr
Copy link

thsnr commented Mar 13, 2019

It really feels like I should open an issue on Ubuntu's launchpad, feels like a downstream issue.

@ticpu Yes, these are two different problems here. However since I am probably not the only one Googling ending up in this thread, I will report my findings.

The original issue stems from a Write call, whereas your stack trace indicates WriteStatus. The latter was already reported in #8627 and "fixed" in etcd 3.2.10 by upgrading the grpc-go dependency to 1.7.0.

However I also encountered this on 3.2.17 packaged in Ubuntu Bionic. The problem is

Built-Using: ..., golang-google-grpc (= 1.6.0-3), ...

Ubuntu still packages and uses an old grpc-go which does not include grpc/grpc-go@22c3f92. I will update https://bugs.launchpad.net/ubuntu/+source/etcd/+bug/1800973 to highlight this.

@xiang90
Copy link
Contributor

xiang90 commented Mar 28, 2019

@gyuho @hexfusion

grpc/grpc-go#2695 is merged.

We should update etcd grpc dependency to fix this bug after grpc 1.12 is released.

@xiang90 xiang90 added this to the etcd-v3.4 milestone Mar 28, 2019
@hexfusion
Copy link
Contributor

@xiang90 I will make sure to cover this.

@hexfusion hexfusion self-assigned this Mar 28, 2019
aanm added a commit to aanm/etcd that referenced this issue Apr 10, 2019
This version includes patch from
grpc/grpc-go#2695 which
fixes etcd-io#9956

Signed-off-by: André Martins <aanm90@gmail.com>
aanm added a commit to aanm/etcd that referenced this issue Apr 10, 2019
This version includes patch from
grpc/grpc-go#2695 which
fixes etcd-io#9956

Signed-off-by: André Martins <aanm90@gmail.com>
aanm added a commit to aanm/etcd that referenced this issue Apr 23, 2019
This version includes patch from
grpc/grpc-go#2695 which
fixes etcd-io#9956

Signed-off-by: André Martins <aanm90@gmail.com>
@tgraf
Copy link

tgraf commented May 13, 2019

@hexfusion @xiang90 @aanm grpc-go 1.20 has been released with the fix included [0] Can we help to get the rebase done? We are hitting this daily on any large scale etcd deployment.

[0] https://github.com/grpc/grpc-go/releases/tag/v1.20.0

aanm added a commit to aanm/etcd that referenced this issue May 13, 2019
This version includes patch from
grpc/grpc-go#2695 which
fixes etcd-io#9956

Signed-off-by: André Martins <aanm90@gmail.com>
@gyuho
Copy link
Contributor

gyuho commented May 13, 2019

@tgraf Sorry for delay. Yes, @aanm sent out a PR to bump up the gRPC, but a lot of our client balancer tests are failing. We will look into this.

aanm added a commit to aanm/etcd that referenced this issue May 18, 2019
This version includes patch from
grpc/grpc-go#2695 which
fixes etcd-io#9956

Signed-off-by: André Martins <aanm90@gmail.com>
@hexfusion
Copy link
Contributor

Going to review and move this forward for 3.3. backport.

@jingyih
Copy link
Contributor

jingyih commented Jul 29, 2019

This is fixed by #10911, correct?

@hexfusion
Copy link
Contributor

This is fixed by #10911, correct?

@jingyih it is fixed in master/3.4 but not in release-3.3. I am going to try and bump gRPC deps in 3.3 to resolve.

@xiang90
Copy link
Contributor

xiang90 commented Aug 12, 2019

already fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.