Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upgrading etcd version to 3.3.15, doesn't fix API Server crash issue #11078

Closed
pariminaresh opened this issue Aug 26, 2019 · 3 comments
Closed

Comments

@pariminaresh
Copy link

We are experiencing the same issue as API Server is crashing out every time the first etcd server goes down. Earlier using etcd v3.3.13 and upgraded this to v3.3.15 after coming across this PR and kubernetes/kubernetes#72102.

But I still see that apiserver is crashing out.After spending some time to make sure I'm using the proper versions, decided to take help from other folks here. I believe I might be missing something. Please correct me if so. Thanks!

Below are the details:

  1. 3 node cluster : 10.66.0.162,10.66.0.166,10.66.0.168
  2. shutdown 10.66.0.166 & running test on 10.66.0.162
1# etcdctl -version
etcdctl version: 3.3.15
API version: 2

2# etcd -version
etcd Version: 3.3.15
Git SHA: 94745a4ee
Go Version: go1.12.9
Go OS/Arch: linux/amd64

3# ETCDCTL_API=3 etcdctl --endpoints=10.66.0.166:2379,10.66.0.162:2379,10.66.0.168:2379 --cacert /etc/kubernetes/pki/etcd/ca.pem --cert /etc/kubernetes/pki/etcd/client.pem put --key /etc/kubernetes/pki/etcd/client-key.pem foo bar
{"level":"warn","ts":"2019-08-22T21:37:36.613Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-3fc8c37e-a22f-4452-a372-9c98664d645e/10.66.0.166:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest connection error: connection error: desc = \"transport: authentication handshake failed: x509: certificate is valid for 10.66.0.162, not 10.66.0.166\""}

4# ETCDCTL_API=3 etcdctl --endpoints=10.66.0.162:2379,10.66.0.166:2379,10.66.0.168:2379 --cacert /etc/kubernetes/pki/etcd/ca.pem --cert /etc/kubernetes/pki/etcd/client.pem put --key /etc/kubernetes/pki/etcd/client-key.pem foo bar
OK

As you notice here, I expected 'OK' response in the 3rd step as well. Do I need re-generate the certificates in any specific way? the same certs works fine when the first server is up. Please point me in right direction.

Any help is highly appreciated. Thanks.

Originally posted by @pariminaresh in #10911 (comment)

@kulong0105
Copy link

kulong0105 commented Sep 2, 2019

test etcd 3.4 release,meet the same issue :

[root@skyaxe-app-1 SkyDiscovery-2019-08-23]# systemctl status etcd
● etcd.service - etcd
   Loaded: loaded (/etc/systemd/system/etcd.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2019-09-02 12:44:26 CST; 1min 49s ago
     Docs: https://github.com/coreos/etcd
 Main PID: 1449969 (etcd)
    Tasks: 44
   Memory: 41.2M
   CGroup: /system.slice/etcd.service
           └─1449969 /usr/local/bin/etcd --name skyaxe-app-1.localdomain --data-dir /SkyDiscovery/etcd --listen-client-urls https://0.0.0.0:2379 --advertise-client-urls https://192.168.50.111:2379 --listen-peer-urls https://0.0.0.0:238...

Sep 02 12:44:26 skyaxe-app-1.localdomain etcd[1449969]: raft2019/09/02 12:44:26 INFO: b7edf80c4e688f4d [term: 1] received a MsgVote message with higher term from 9d3f7d3d4b55a9cf [term: 2]
Sep 02 12:44:26 skyaxe-app-1.localdomain etcd[1449969]: raft2019/09/02 12:44:26 INFO: b7edf80c4e688f4d became follower at term 2
Sep 02 12:44:26 skyaxe-app-1.localdomain etcd[1449969]: raft2019/09/02 12:44:26 INFO: b7edf80c4e688f4d [logterm: 1, index: 3, vote: 0] cast MsgVote for 9d3f7d3d4b55a9cf [logterm: 1, index: 3] at term 2
Sep 02 12:44:26 skyaxe-app-1.localdomain etcd[1449969]: raft2019/09/02 12:44:26 INFO: raft.node: b7edf80c4e688f4d elected leader 9d3f7d3d4b55a9cf at term 2
Sep 02 12:44:26 skyaxe-app-1.localdomain etcd[1449969]: ready to serve client requests
Sep 02 12:44:26 skyaxe-app-1.localdomain etcd[1449969]: published {Name:skyaxe-app-1.localdomain ClientURLs:[https://192.168.50.111:2379]} to cluster acbb8d883be0e687
Sep 02 12:44:26 skyaxe-app-1.localdomain systemd[1]: Started etcd.
Sep 02 12:44:26 skyaxe-app-1.localdomain etcd[1449969]: serving client requests on [::]:2379
Sep 02 12:44:26 skyaxe-app-1.localdomain etcd[1449969]: set the initial cluster version to 3.4
Sep 02 12:44:26 skyaxe-app-1.localdomain etcd[1449969]: enabled capabilities for version 3.4
[root@skyaxe-app-1 SkyDiscovery-2019-08-23]# ETCDCTL_API=3 etcdctl  --cert /etc/kubernetes/pki/etcd/server.pem --key /etc/kubernetes/pki/etcd/server-key.pem --cacert /etc/kubernetes/pki/etcd/ca.pem --endpoints=192.168.50.111:2379,192.168.50.112:2379,192.168.50.113:2379 put foo bar
OK
[root@skyaxe-app-1 SkyDiscovery-2019-08-23]# systemctl stop etcd
[root@skyaxe-app-1 SkyDiscovery-2019-08-23]# ETCDCTL_API=3 etcdctl  --cert /etc/kubernetes/pki/etcd/server.pem --key /etc/kubernetes/pki/etcd/server-key.pem --cacert /etc/kubernetes/pki/etcd/ca.pem --endpoints=192.168.50.111:2379,192.168.50.112:2379,192.168.50.113:2379 put foo bar
{"level":"warn","ts":"2019-09-02T12:46:48.284+0800","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-535efebf-1c2c-4cdb-afe2-54b3b8fe133f/192.168.50.111:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 192.168.50.111:2379: connect: connection refused\""}
Error: context deadline exceeded
[root@skyaxe-app-1 SkyDiscovery-2019-08-23]#

@gjcarneiro
Copy link

For kubernetes, a workaround that we found was (assuming the apiserver replicas all have a collocated etcd cluster member) to customise the endpoints list that you give to each apiserver to connect to etcd cluster: always put the collocated etcd member first in the list.

Regarding 3.4, I saw "Improvements to client balancer failover logic" in the announcement. Perhaps k8s apiserver needs to be upgraded to embed the new 3.4 client library in order to fix this. It's not enough to upgrade the server, as it's the client that is mostly responsible for this.

@jingyih
Copy link
Contributor

jingyih commented Sep 3, 2019

The fix was back ported to v3.3.14+. Kubernestes 1.16 will include etcd client v3.3.15 [1]. The decision was made to not back port the fix to prior kubernetes versions. If you need a hot fix, try [2].

[1] kubernetes/kubernetes#81434
[2] kubernetes/kubernetes#72102 (comment)

Closing because the issue was already addressed in kubernetes/kubernetes#72102.

@jingyih jingyih closed this as completed Sep 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

5 participants