-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
etcd pod stuck at bootstrap and kept "failed resolving host" #7798
Comments
moved to 3.4 |
I seem to have stumbled upon this issue or at least a similar error output:
When starting an etcd service:
With these env vars:
Somehow I have the feeling that this is a DNS resolving issue or port problem. But opening 2379 and 2380 didn't solve the problem and I wasn't able to dig deep enough into Docker DNS and hostname resolving to be able to analyse. This is running on a Photon OS host:
|
Hi, @gyuho @xiang90 @hongchaodeng I have the same issue but a little different scenario: The rolling log about 'failed resolving host ' was expected in my testing environment. But since we had etcd0 and etcd1, I expected etcd1 at lease could come up again and join the cluster with its restarting. I'm using an old etcd version, 3.3. Was this issue fixed in later release or still there? thanks a lot for your information. |
Hi! Steps to reproduce the problem.
In the current code, the result will be an error at the start of the e2 node
The fall is caused by the netutil.URLStringsEqual function (tag v3.5.0) After studying the use of this function, I came to the conclusion that it would be completely valid behavior to add a short path to the execution of the function. If the comparison of URLs before performing address resolution was successful, the function does not need to access the resolver, the function can already return To make it easier to reproduce the current problem, I wrote a bash script that fully automates the steps leading to the problem described in this issue #!/bin/bash
set -xue
HOSTS=${HOSTS:-/etc/hosts}
if ! grep docker /proc/1/cgroup; then
echo "Run me inside docker"
exit 1
fi
if ! test -w $HOSTS; then
echo "It is assumed that you have write rights to the $HOSTS"
exit 1
fi
trap 'pkill -9 -x etcd' TERM EXIT ERR
TYPE="${1:-v3.5.0}"
instance() {
rm -vrf "e$1"
name="e$1"
declare -A ports=(
[e1]=2379
[e2]=22379
[e3]=32379
)
exec /$TYPE/etcd --name $name \
--data-dir $name \
--listen-client-urls http://127.0.0.1"$1":${ports[$name]} \
--advertise-client-urls http://${name}.lan:${ports[$name]} \
--listen-peer-urls http://127.0.0.1"$1":$((${ports[$name]}+1)) \
--initial-advertise-peer-urls http://${name}.lan:$((${ports[$name]}+1)) \
--initial-cluster e1=http://e1.lan:2380,e2=http://e2.lan:22380,e3=http://e3.lan:32380 \
--initial-cluster-token tkn \
--initial-cluster-state ${2:-new} &> $name.log
}
check() {
ETCDCTL_API=3 /$TYPE/etcdctl \
--endpoints e1.lan:2379,e2.lan:22379,e3.lan:32379 \
endpoint health
}
ls -lh /$TYPE/
DIR="etcd-$TYPE"
mkdir -vp $DIR
pushd $DIR
echo "Fix $HOSTS"
echo "127.0.0.11 e1.lan
127.0.0.12 e2.lan
127.0.0.13 e3.lan" > $HOSTS
grep -P 'e\d.lan' $HOSTS
instance 1 &
E1PID=$!
instance 2 &
E2PID=$!
instance 3 &
E3PID=$!
until check; do
sleep 5 # wait for cluster
done
kill -9 $E2PID $E3PID
echo "127.0.0.11 e1.lan
127.0.0.12 e2.lan" > $HOSTS
cat $HOSTS
sleep 5
instance 2 existing &
E2PID=$!
until check; do
sleep 5
ps axf|grep '\<[e]tcd '
tail -n1 e*.log
done First you need to prepare two directories with binary files |
If one of the nodes in the cluster has lost a dns record, restarting the second node will break it. This PR makes an attempt to add a comparison without using a resolver, which allows to protect cluster from dns errors and does not break the current logic of comparing urls in the URLStringsEqual function. You can read more in the issue etcd-io#7798 Fixes etcd-io#7798
If one of the nodes in the cluster has lost a dns record, restarting the second node will break it. This PR makes an attempt to add a comparison without using a resolver, which allows to protect cluster from dns errors and does not break the current logic of comparing urls in the URLStringsEqual function. You can read more in the issue etcd-io#7798 Fixes etcd-io#7798
If one of the nodes in the cluster has lost a dns record, restarting the second node will break it. This PR makes an attempt to add a comparison without using a resolver, which allows to protect cluster from dns errors and does not break the current logic of comparing urls in the URLStringsEqual function. You can read more in the issue etcd-io#7798 Fixes etcd-io#7798
If one of the nodes in the cluster has lost a dns record, restarting the second node will break it. This PR makes an attempt to add a comparison without using a resolver, which allows to protect cluster from dns errors and does not break the current logic of comparing urls in the URLStringsEqual function. You can read more in the issue etcd-io#7798 Fixes etcd-io#7798 Signed-off-by: Prasad Chandrasekaran <prasadc@vmware.com>
If one of the nodes in the cluster has lost a dns record, restarting the second node will break it. This PR makes an attempt to add a comparison without using a resolver, which allows to protect cluster from dns errors and does not break the current logic of comparing urls in the URLStringsEqual function. You can read more in the issue etcd-io#7798 Fixes etcd-io#7798
If one of the nodes in the cluster has lost a dns record, restarting the second node will break it. This PR makes an attempt to add a comparison without using a resolver, which allows to protect cluster from dns errors and does not break the current logic of comparing urls in the URLStringsEqual function. You can read more in the issue etcd-io#7798 Fixes etcd-io#7798 Signed-off-by: Prasad Chandrasekaran <prasadc@vmware.com>
If one of the nodes in the cluster has lost a dns record, restarting the second node will break it. This PR makes an attempt to add a comparison without using a resolver, which allows to protect cluster from dns errors and does not break the current logic of comparing urls in the URLStringsEqual function. You can read more in the issue etcd-io#7798 Fixes etcd-io#7798 Signed-off-by: Prasad Chandrasekaran <prasadc@vmware.com>
If one of the nodes in the cluster has lost a dns record, restarting the second node will break it. This PR makes an attempt to add a comparison without using a resolver, which allows to protect cluster from dns errors and does not break the current logic of comparing urls in the URLStringsEqual function. You can read more in the issue etcd-io#7798 Fixes etcd-io#7798 Signed-off-by: Prasad Chandrasekaran <prasadc@vmware.com>
Failure scenario:
pkg/netutil: failed resolving host xxx-00:2380 (lookup xxx-00 on 10.43.240.10:53: no such host); retrying in 1s
.Expected output:
"02" was basically stuck at bootstrap. But since both "01" and "02" were up, they were expected to form a quorum and able to serve requests.
The text was updated successfully, but these errors were encountered: