Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The latest etcdctl seems buggy: etcdctl will request fail when the first etcd cluster endpoint down #11176

Closed
lining2020x opened this issue Sep 24, 2019 · 11 comments

Comments

@lining2020x
Copy link

lining2020x commented Sep 24, 2019

This issue is related to kubernetes/kubernetes#72102. I can still reproduce that problem with 'etcdctl' of the latest release etcd(v3.4.1).

Environment:

[root@ln-node0 etcd-test# ./etcd-v3.4.1/etcdctl version
etcdctl version: 3.4.1
API version: 3.4

[root@ln-node0 etcd-test]# ./etcd-v3.4.1/etcd -version
etcd Version: 3.4.1
Git SHA: a14579fbf
Go Version: go1.12.9
Go OS/Arch: linux/amd64

How to reproduce the problems:
The repro methods comes from #10911 (comment) and kubernetes/kubernetes#72102 (comment)

  1. Create a 3 node etcd cluster with TLS enabled. Each certificate should only contain the name/IP of the node that will be serving it.
  2. Close the first etcd node
  3. Use etcdctl to request

What you expected to happen:
I expect the 3rd step to be OK

1 # ETCDCTL_API=3 ./etcd-v3.4.1/etcdctl --endpoints=https://ln-node1:2379,https://ln-node2:2379,https://ln-node3:2379 --cacert=/tmp/etcd/ca.pem --cert=/tmp/etcd/client.pem --key=/tmp/etcd/client-key.pem put bar foo
OK

2 # ssh ln-node1 systemctl stop etcd@ln-node1

3 # ETCDCTL_API=3 ./etcd-v3.4.1/etcdctl --endpoints=https://ln-node1:2379,https://ln-node2:2379,https://ln-node3:2379 --cacert=/tmp/etcd/ca.pem --cert=/tmp/etcd/client.pem --key=/tmp/etcd/client-key.pem put bar foo
{"level":"warn","ts":"2019-09-24T15:59:20.611+0800","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-3b69f8a2-9b9a-4399-aedd-6244012167a0/ln-node1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.150.7.194:2379: connect: connection refused\""}

4 # ETCDCTL_API=3 ./etcd-v3.3.13/etcdctl --endpoints=https://ln-node1:2379,https://ln-node2:2379,https://ln-node3:2379 --cacert=/tmp/etcd/ca.pem --cert=/tmp/etcd/client.pem --key=/tmp/etcd/client-key.pem put bar foo
OK

As the above shows, the etcdctl v3.4.1 can't work but v3.3.13 can. (It's very strange).

Here is the script I was using to setup the etcd cluster.

[root@ln-node0 etcd-test]# cat etcd-issue-repro.sh 
HOST1=10.150.7.194
HOST2=10.150.7.131
HOST3=10.150.7.132

NAME1=ln-node1
NAME2=ln-node2
NAME3=ln-node3

HOSTS=(${HOST1} ${HOST2} ${HOST3})
NAMES=(${NAME1} ${NAME2} ${NAME3})

rm -rf /tmp/etcd
mkdir -p /tmp/etcd && cd /tmp/etcd

# generate the root CA
cat >ca-config.json <<EOF
{
    "signing": {
        "default": {
            "expiry": "87600h"
        },
        "profiles": {
            "server": {
                "expiry": "87600h",
                "usages": [
                    "signing",
                    "key encipherment",
                    "server auth",
                    "client auth"
                ]
            },
            "client": {
                "expiry": "87600h",
                "usages": [
                    "signing",
                    "key encipherment",
                    "client auth"
                ]
            },
            "peer": {
                "expiry": "87600h",
                "usages": [
                    "signing",
                    "key encipherment",
                    "server auth",
                    "client auth"
                ]
            }
        }
    }
}
EOF
cat >ca-csr.json <<EOF
{
    "CN": "etcd",
    "key": {
        "algo": "rsa",
        "size": 2048
    }
}
EOF
cfssl gencert -initca ca-csr.json | cfssljson -bare ca -

# generate the CA for client using the root CA
cat >client.json <<EOF
{
    "CN": "client cn",
    "key": {
        "algo": "rsa",
        "size": 2048
    },
    "names": [
        {
            "C": "US",
            "L": "CA",
            "ST": "San Francisco"
        }
    ]
}
EOF
cfssl gencert -ca=ca.pem -ca-key=ca-key.pem -config=ca-config.json -profile=client client.json | cfssljson -bare client


# generate CA for each etcd cluste node
for i in "${!HOSTS[@]}"; do
	HOST=${HOSTS[$i]}
	NAME=${NAMES[$i]}

	cat >config.json <<EOF
{
    "CN": "etcd cn",
    "hosts": [
        "${NAME}",
        "${HOST}"
    ],
    "key": {
        "algo": "rsa",
        "size": 2048
    },
    "names": [
        {
            "C": "US",
            "L": "CA",
            "ST": "San Francisco"
        }
    ]
}
EOF
	cfssl gencert -ca=ca.pem -ca-key=ca-key.pem -config=ca-config.json -profile=server config.json | cfssljson -bare server-${NAME}
	cfssl gencert -ca=ca.pem -ca-key=ca-key.pem -config=ca-config.json -profile=peer config.json | cfssljson -bare peer-${NAME}

	ssh ${HOST} "rm -rf /etc/etcd/pki && mkdir -p /etc/etcd/pki"
	scp ca.pem server-${NAME}.pem server-${NAME}-key.pem peer-${NAME}.pem peer-${NAME}-key.pem ${HOST}:/etc/etcd/pki/
done


# generate etcd systemd service unit for each node
for i in "${!HOSTS[@]}"; do
	HOST=${HOSTS[$i]}
	NAME=${NAMES[$i]}
	ssh ${HOST} "systemctl stop etcd@${NAME}"
	ssh ${HOST} "rm -rf /var/lib/etcd && mkdir -p /var/lib/etcd"
	ssh ${HOST} "cat > /etc/systemd/system/etcd@.service <<EOF
[Unit]
Description=etcd
Documentation=https://github.com/coreos/etcd

[Service]
Type=notify
Restart=always
RestartSec=5s
LimitNOFILE=40000
TimeoutStartSec=0

ExecStart=/usr/local/bin/etcd \
    --name etcd-%H \
    --data-dir /var/lib/etcd \
    --initial-advertise-peer-urls https://${HOST}:2380 \
    --listen-peer-urls https://${HOST}:2380 \
    --listen-client-urls https://${HOST}:2379 \
    --advertise-client-urls https://${HOST}:2379 \
    --initial-cluster-token etcd-cluster-token \
    --initial-cluster etcd-${NAME1}=https://${HOST1}:2380,etcd-${NAME2}=https://${HOST2}:2380,etcd-${NAME3}=https://${HOST3}:2380 \
    --initial-cluster-state new \
    --cert-file=/etc/etcd/pki/server-%H.pem --key-file=/etc/etcd/pki/server-%H-key.pem \
    --client-cert-auth --trusted-ca-file=/etc/etcd/pki/ca.pem \
    --peer-client-cert-auth --peer-trusted-ca-file=/etc/etcd/pki/ca.pem \
    --peer-cert-file=/etc/etcd/pki/peer-%H.pem --peer-key-file=/etc/etcd/pki/peer-%H-key.pem

[Install]
WantedBy=multi-user.target
EOF"
	ssh ${HOST} "cat  /etc/systemd/system/etcd@.service"
done

# start etcd
ssh ${NAME1} systemctl daemon-reload
ssh ${NAME2} systemctl daemon-reload
ssh ${NAME3} systemctl daemon-reload

ssh ${NAME1} systemctl restart etcd@${NAME1} &
ssh ${NAME2} systemctl restart etcd@${NAME2} &
ssh ${NAME3} systemctl restart etcd@${NAME3}


echo ""
echo "RUN THE FOLLOWING COMMAND TO VERIFY:"
echo "ETCDCTL_API=3 etcdctl --endpoints=https://${NAME1}:2379,https://${NAME2}:2379,https://${NAME3}:2379 --cacert=/tmp/etcd/ca.pem --cert=/tmp/etcd/client.pem --key=/tmp/etcd/client-key.pem  put foo bar"
@lining2020x lining2020x changed the title etcdctl seems buggy from v3.3.14: etcdctl will request failed when the first etcd cluster endpoint down etcdctl seems buggy: etcdctl will request failed when the first etcd cluster endpoint down Sep 24, 2019
@lining2020x lining2020x changed the title etcdctl seems buggy: etcdctl will request failed when the first etcd cluster endpoint down The latest etcdctl seems buggy: etcdctl will request failed when the first etcd cluster endpoint down Sep 24, 2019
@kulong0105
Copy link

i also meet the same issue

@lining2020x lining2020x changed the title The latest etcdctl seems buggy: etcdctl will request failed when the first etcd cluster endpoint down The latest etcdctl seems buggy: etcdctl will request fail when the first etcd cluster endpoint down Sep 24, 2019
@gyuho
Copy link
Contributor

gyuho commented Sep 24, 2019

db61ee1 doesn't handle DNS endpoints...

@lining2020x
Copy link
Author

lining2020x commented Sep 25, 2019

@gyuho The problem can still be reproduced using IP instead. Anything wrong I was?Or what 's the workable way? 🤔

# ETCDCTL_API=3 etcdctl --endpoints=https://10.150.7.194:2379,https://10.150.7.131:2379,https://10.150.7.132:2379 --cacert=/tmp/etcd/ca.pem --cert=/tmp/etcd/client.pem --key=/tmp/etcd/client-key.pem  put foo bar
OK

# ssh ln-node1 systemctl stop etcd@ln-node1

# ETCDCTL_API=3 etcdctl --endpoints=https://10.150.7.194:2379,https://10.150.7.131:2379,https://10.150.7.132:2379 --cacert=/tmp/etcd/ca.pem --cert=/tmp/etcd/client.pem --key=/tmp/etcd/client-key.pem  put foo bar
{"level":"warn","ts":"2019-09-25T10:26:45.716+0800","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-dfccc1ea-749a-4477-939f-45423cdc9e1d/10.150.7.194:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest connection error: connection error: desc = \"transport: authentication handshake failed: x509: certificate is valid for 10.150.7.132, not 10.150.7.194\""}
Error: context deadline exceeded

versions:

# ETCDCTL_API=3 etcdctl --endpoints=https://10.150.7.194:2379,https://10.150.7.131:2379,https://10.150.7.132:2379 --cacert=/tmp/etcd/ca.pem --cert=/tmp/etcd/client.pem --key=/tmp/etcd/client-key.pem  endpoint status
https://10.150.7.194:2379, b8f6df17d8142ea6, 3.4.1, 20 kB, true, false, 2, 9, 9, 
https://10.150.7.131:2379, f8f61e713fc4f8c7, 3.4.1, 20 kB, false, false, 2, 9, 9, 
https://10.150.7.132:2379, 90379fb39af4a9c3, 3.4.1, 20 kB, false, false, 2, 9, 9, 

# etcdctl version
etcdctl version: 3.4.1
API version: 3.4

@xtrusia
Copy link

xtrusia commented Sep 27, 2019

just fyi
I've run test program as below
https://pastebin.ubuntu.com/p/pcynNM4W3W/

when i shutdown leader etcd node, there is one context deadline exceeded, but recovered soon.
https://pastebin.ubuntu.com/p/ChZG4NxPyC/

@lining2020x
Copy link
Author

lining2020x commented Sep 27, 2019

@xtrusia Thanks for your info.

I have tried your program with a little modification. And I still hit the problem.

Here are the code and results.
code: https://pastebin.ubuntu.com/p/rBRbNZgzg4/
results: https://pastebin.ubuntu.com/p/6cqVWRbfgP/

My steps:

  1. Checkout to the v3.4.1 tag and place the test program code in the etcd source root dir.
  2. Build the test program
  3. run
  4. stop the leader etcd service
  5. finally I see the error

How did your make the certification? A or B?
A: Each certificate contains the name/IPs of all etcd nodes
B: Each certificate contains only the name/IP of the node that will be serving it.

And can you use etcdctl(v3.4.1) to have a try?

@xtrusia
Copy link

xtrusia commented Sep 28, 2019

@scott0000
it is created via easyrsa. there is no specified domain name or ip with below command
openssl x509 -text -noout -in etcd-ca

Do i need to check domain name or ip for this testing?

@lining2020x
Copy link
Author

The certificate of my etcd nodes contains only the name/IP of the node , the problem can be repro in my enviroment. But your etcd cluster can still be accessed when the leader etcd node shutdown , so I just want to know why.

Anyway thanks for reply me🙂,Let's stop talking about the testing and wait for the fix of maintainer.

@lining2020x
Copy link
Author

@xtrusia can you show me your hosts in cert (not CA file: etcd-ca)?

@xtrusia
Copy link

xtrusia commented Sep 30, 2019

@xtrusia can you show me your hosts in cert (not CA file: etcd-ca)?

etcd-cert you mean? there is nothing as well

@lining2020x
Copy link
Author

lining2020x commented Sep 30, 2019

@xtrusia yes,i mean the etcd-cert.
ok. thanks🙂

@lining2020x
Copy link
Author

Already fixed in 97388ce and test ok. So close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants