etcd pod stuck at bootstrap and kept "failed resolving host" #7798

hongchaodeng · 2017-04-21T16:27:38Z

Failure scenario:

Tried to add 3 members one by one: "00", "01", "02".
After "02" was added, but before "02" started (there is 5s sleep), "00" was removed.
Since "00" was removed, its DNS record was also removed
After "02" started, it kept failing with error pkg/netutil: failed resolving host xxx-00:2380 (lookup xxx-00 on 10.43.240.10:53: no such host); retrying in 1s.

Expected output:
"02" was basically stuck at bootstrap. But since both "01" and "02" were up, they were expected to form a quorum and able to serve requests.

The text was updated successfully, but these errors were encountered:

xiang90 · 2017-10-04T23:04:19Z

moved to 3.4

worp1900 · 2018-04-24T19:05:15Z

I seem to have stumbled upon this issue or at least a similar error output:

docker container logs percona_etcd_1                                                                                                                                                                            411ms  Tue Apr 24 20:55:43 2018
2018-04-24 18:54:03.741246 I | pkg/flags: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=http://galera_etcd:2379,http://galera_etcd:4001
2018-04-24 18:54:03.741314 I | pkg/flags: recognized and used environment variable ETCD_DATA_DIR=/opt/etcd/data
2018-04-24 18:54:03.741343 I | pkg/flags: recognized and used environment variable ETCD_INITIAL_ADVERTISE_PEER_URLS=http://galera_etcd:2380
2018-04-24 18:54:03.741352 I | pkg/flags: recognized and used environment variable ETCD_INITIAL_CLUSTER=etcd0=http://galera_etcd:2380
2018-04-24 18:54:03.741356 I | pkg/flags: recognized and used environment variable ETCD_INITIAL_CLUSTER_STATE=new
2018-04-24 18:54:03.741363 I | pkg/flags: recognized and used environment variable ETCD_INITIAL_CLUSTER_TOKEN=etcd-cluster-1
2018-04-24 18:54:03.741371 I | pkg/flags: recognized and used environment variable ETCD_LISTEN_CLIENT_URLS=http://0.0.0.0:2379,http://0.0.0.0:4001
2018-04-24 18:54:03.741381 I | pkg/flags: recognized and used environment variable ETCD_LISTEN_PEER_URLS=http://0.0.0.0:2380
2018-04-24 18:54:03.741396 I | pkg/flags: recognized and used environment variable ETCD_NAME=etcd0
2018-04-24 18:54:03.741442 I | etcdmain: etcd Version: 3.3.3
2018-04-24 18:54:03.741448 I | etcdmain: Git SHA: e348b1aed
2018-04-24 18:54:03.741451 I | etcdmain: Go Version: go1.9.5
2018-04-24 18:54:03.741454 I | etcdmain: Go OS/Arch: linux/amd64
2018-04-24 18:54:03.741461 I | etcdmain: setting maximum number of CPUs to 4, total number of available CPUs is 4
2018-04-24 18:54:03.741473 N | etcdmain: failed to detect default host (could not find default route)
2018-04-24 18:54:03.741581 I | embed: listening for peers on http://0.0.0.0:2380
2018-04-24 18:54:03.741609 I | embed: listening for client requests on 0.0.0.0:2379
2018-04-24 18:54:03.741629 I | embed: listening for client requests on 0.0.0.0:4001
2018-04-24 18:54:13.743978 W | pkg/netutil: failed resolving host galera_etcd:2380 (lookup galera_etcd on 127.0.0.11:53: read udp 127.0.0.1:55975->127.0.0.11:53: i/o timeout); retrying in 1s
2018-04-24 18:54:24.744523 W | pkg/netutil: failed resolving host galera_etcd:2380 (lookup galera_etcd on 127.0.0.11:53: read udp 127.0.0.1:43092->127.0.0.11:53: i/o timeout); retrying in 1s
2018-04-24 18:54:33.743635 W | pkg/netutil: failed resolving host galera_etcd:2380 (i/o timeout); retrying in 1s
2018-04-24 18:54:33.743722 E | pkg/netutil: could not resolve host galera_etcd:2380
2018-04-24 18:54:33.744670 C | etcdmain: failed to resolve http://galera_etcd:2380 to match --initial-cluster=etcd0=http://galera_etcd:2380 (failed to resolve "http://galera_etcd:2380" (i/o timeout))

When starting an etcd service:

etcd:
    image: quay.io/coreos/etcd
    command: etcd
    volumes:
    - etcd_data:/etc/ssl/certs
    ports:
    - "2379:2379"
    - "2380:2380"
    env_file: etcd.env
    networks:
    - etcd

With these env vars:

ETCD_DATA_DIR=/opt/etcd/data
ETCD_NAME=etcd0
ETCD_LISTEN_CLIENT_URLS=http://0.0.0.0:2379,http://0.0.0.0:4001
ETCD_ADVERTISE_CLIENT_URLS=http://galera_etcd:2379,http://galera_etcd:4001
ETCD_LISTEN_PEER_URLS=http://0.0.0.0:2380
ETCD_INITIAL_ADVERTISE_PEER_URLS=http://galera_etcd:2380
ETCD_INITIAL_CLUSTER=etcd0=http://galera_etcd:2380
ETCD_INITIAL_CLUSTER_STATE=new
ETCD_INITIAL_CLUSTER_TOKEN=etcd-cluster-1

Somehow I have the feeling that this is a DNS resolving issue or port problem. But opening 2379 and 2380 didn't solve the problem and I wasn't able to dig deep enough into Docker DNS and hostname resolving to be able to analyse.

This is running on a Photon OS host:

root@test-010 [ /etc/systemd/scripts ]# cat /etc/*-release
DISTRIB_ID="VMware Photon OS"
DISTRIB_RELEASE="2.0"
DISTRIB_CODENAME=Photon
DISTRIB_DESCRIPTION="VMware Photon OS 2.0"
NAME="VMware Photon OS"
VERSION="2.0"
ID=photon
VERSION_ID=2.0
PRETTY_NAME="VMware Photon OS/Linux"
ANSI_COLOR="1;34"
HOME_URL="https://vmware.github.io/photon/"
BUG_REPORT_URL="https://github.com/vmware/photon/issues"
VMware Photon OS 2.0
PHOTON_BUILD_NUMBER=304b817


root@test-010 [ /etc/systemd/scripts ]# docker version
Client:
Version:      17.06.0-ce
API version:  1.30
Go version:   go1.8.1
Git commit:   02c1d87
Built:        Thu Oct 26 06:33:23 2017
OS/Arch:      linux/amd64

Server:
Version:      17.06.0-ce
API version:  1.30 (minimum version 1.12)
Go version:   go1.8.1
Git commit:   02c1d87
Built:        Thu Oct 26 06:34:46 2017
OS/Arch:      linux/amd64
Experimental: false

futangwa · 2020-05-07T09:44:50Z

Hi, @gyuho @xiang90 @hongchaodeng

I have the same issue but a little different scenario:
I had 3 etcds as a cluster, for example, etcd0, etcd1, etcd2. The cluster was setup successfully. Then, for example, I took etcd2 down and did not bring it back. The cluster works with etcd0 and etcd1. But if now I reboot the etcd1, and etcd1 failed to come up and join the cluster successfully. It keeps rolling the log below and restarting:
... ...
2020-05-07 16:23:39.252814 W | pkg/netutil: failed resolving host etcd2.etcd:7001 (lookup etcd2.etcd on 10.234.0.2:53: server misbehaving); retrying in 1s
2020-05-07 16:23:40.261997 W | pkg/netutil: failed resolving host etcd2.etcd:7001 (lookup etcd2.etcd on 10.234.0.2:53: server misbehaving); retrying in 1s
2020-05-07 16:23:41.272573 W | pkg/netutil: failed resolving host etcd2.etcd:7001 (lookup etcd2.etcd on 10.234.0.2:53: server misbehaving); retrying in 1s
2020-05-07 16:23:42.278582 W | pkg/netutil: failed resolving host etcd2.etcd:7001 (lookup etcd2.etcd on 10.234.0.2:53: server misbehaving); retrying in 1s
2020-05-07 16:23:43.287286 W | pkg/netutil: failed resolving host etcd2.etcd:7001 (lookup etcd2.etcd on 10.234.0.2:53: server misbehaving); retrying in 1s
2020-05-07 16:23:44.300703 W | pkg/netutil: failed resolving host etcd2.etcd:7001 (lookup etcd2.etcd on 10.234.0.2:53: server misbehaving); retrying in 1s
2020-05-07 16:23:44.875263 E | pkg/netutil: could not resolve host etcd2.etcd:7001
2020-05-07 16:23:44.877676 C | etcdmain: error validating peerURLs {ClusterID:4e064b7f85ef47b8 Members:[&{ID:27cec8c992d21488 RaftAttributes:{PeerURLs:[http://etcd0.etcd:7001]} Attributes:{Name:etcd0 ClientURLs:[http://10.254.90.234:4001]}} &{ID:4a258c8c3afe2411 RaftAttributes:{PeerURLs:[http://etcd2.etcd:7001]} Attributes:{Name:etcd2 ClientURLs:[http://10.254.10.10:4001]}} &{ID:87fd0805633d77bd RaftAttributes:{PeerURLs:[http://etcd1.etcd:7001]} Attributes:{Name:etcd1 ClientURLs:[http://10.254.168.221:4001]}}] RemovedMemberIDs:[]}: unmatched member while checking PeerURLs (failed to resolve "http://etcd2.etcd:7001" (lookup etcd2.etcd on 10.234.0.2:53: server misbehaving))
2020-05-07 16:23:44,897 INFO exited: etcd (exit status 1; not expected)

The rolling log about 'failed resolving host ' was expected in my testing environment. But since we had etcd0 and etcd1, I expected etcd1 at lease could come up again and join the cluster with its restarting.

I'm using an old etcd version, 3.3. Was this issue fixed in later release or still there?

thanks a lot for your information.

sakateka · 2021-07-18T18:39:47Z

Hi!
I have investigated this problem.
The main reason for the falling is the lack of a record in the DNS.

Steps to reproduce the problem.

Run a cluster of 3 nodes (let's say e1 e2 e3)
Delete the dns record for one of the nodes (let's say e3)
Wait for the dns cache to stop resolving the deleted record.
Turn off one of the nodes (let's say e2)
Delete the data for this node (WAL for e2)
Try to start the node (e2) with ' --initial-cluster-state existing`

In the current code, the result will be an error at the start of the e2 node
In the log, the last line will be something like the following

... PeerURLs: no match found for existing member (3d070a4fab288fc1, [http://e3.lan:32380]), last resolver error (failed to resolve \"http://e3.lan:32380\" (context deadline exceeded))" ...

The fall is caused by the netutil.URLStringsEqual function (tag v3.5.0)

server/etcdserver/server.go#L421
- server/etcdserver/api/membership/cluster.go#L768
  - pkg/netutil/netutil.go#L199
    - pkg/netutil/netutil.go#L151
      - pkg/netutil/netutil.go#L78
        
        pkg/netutil/netutil.go#L110
        
        context deadline exceeded

After studying the use of this function, I came to the conclusion that it would be completely valid behavior to add a short path to the execution of the function. If the comparison of URLs before performing address resolution was successful, the function does not need to access the resolver, the function can already return true

To make it easier to reproduce the current problem, I wrote a bash script that fully automates the steps leading to the problem described in this issue

#!/bin/bash
set -xue
HOSTS=${HOSTS:-/etc/hosts}

if ! grep docker /proc/1/cgroup; then
    echo "Run me inside docker"
    exit 1
fi

if ! test -w $HOSTS; then
    echo "It is assumed that you have write rights to the $HOSTS"
    exit 1
fi

trap 'pkill -9 -x etcd' TERM EXIT ERR

TYPE="${1:-v3.5.0}"

instance() {
    rm -vrf "e$1"
    name="e$1"

    declare -A ports=(
        [e1]=2379
        [e2]=22379
        [e3]=32379
    )

    exec /$TYPE/etcd --name $name \
      --data-dir $name \
      --listen-client-urls http://127.0.0.1"$1":${ports[$name]} \
      --advertise-client-urls http://${name}.lan:${ports[$name]} \
      --listen-peer-urls http://127.0.0.1"$1":$((${ports[$name]}+1)) \
      --initial-advertise-peer-urls http://${name}.lan:$((${ports[$name]}+1)) \
      --initial-cluster e1=http://e1.lan:2380,e2=http://e2.lan:22380,e3=http://e3.lan:32380 \
      --initial-cluster-token tkn \
      --initial-cluster-state ${2:-new} &> $name.log
}

check() {
    ETCDCTL_API=3 /$TYPE/etcdctl \
      --endpoints e1.lan:2379,e2.lan:22379,e3.lan:32379 \
      endpoint health
}

ls -lh /$TYPE/
DIR="etcd-$TYPE"
mkdir -vp $DIR
pushd $DIR

echo "Fix $HOSTS"
echo "127.0.0.11 e1.lan
127.0.0.12 e2.lan
127.0.0.13 e3.lan" > $HOSTS
grep -P 'e\d.lan' $HOSTS

instance 1 &
E1PID=$!

instance 2 &
E2PID=$!

instance 3 &
E3PID=$!

until check; do
    sleep 5 # wait for cluster
done

kill -9 $E2PID $E3PID

echo "127.0.0.11 e1.lan
127.0.0.12 e2.lan" > $HOSTS
cat $HOSTS
sleep 5

instance 2 existing &
E2PID=$!
until check; do
    sleep 5
    ps axf|grep '\<[e]tcd '
    tail -n1 e*.log
done

First you need to prepare two directories with binary files etcd etcdctl for version v3.5.0 and for the version from my PR #13224
Then run this command docker run -it --rm -v $PWD/v3.5.0:/v3.5.0 -v $PWD:/cwd -v $PWD/../bin/:/patched ubuntu /cwd/issue-7798.sh [patched], where pathced is an optional argument
I did it from the directory ~/github.com/etcd-io/etcd/tmp, pre-launch the code build ~/github.com/etcd-io/etcd/build.sh

sakateka · 2021-08-16T07:41:00Z

Hi, @gyuho @xiang90 @ptabor @spzala!
Please can you take a look at this issue and related PR.
Thank you!

If one of the nodes in the cluster has lost a dns record, restarting the second node will break it. This PR makes an attempt to add a comparison without using a resolver, which allows to protect cluster from dns errors and does not break the current logic of comparing urls in the URLStringsEqual function. You can read more in the issue etcd-io#7798 Fixes etcd-io#7798

If one of the nodes in the cluster has lost a dns record, restarting the second node will break it. This PR makes an attempt to add a comparison without using a resolver, which allows to protect cluster from dns errors and does not break the current logic of comparing urls in the URLStringsEqual function. You can read more in the issue etcd-io#7798 Fixes etcd-io#7798 Signed-off-by: Prasad Chandrasekaran <prasadc@vmware.com>

If one of the nodes in the cluster has lost a dns record, restarting the second node will break it. This PR makes an attempt to add a comparison without using a resolver, which allows to protect cluster from dns errors and does not break the current logic of comparing urls in the URLStringsEqual function. You can read more in the issue etcd-io#7798 Fixes etcd-io#7798

If one of the nodes in the cluster has lost a dns record, restarting the second node will break it. This PR makes an attempt to add a comparison without using a resolver, which allows to protect cluster from dns errors and does not break the current logic of comparing urls in the URLStringsEqual function. You can read more in the issue etcd-io#7798 Fixes etcd-io#7798 Signed-off-by: Prasad Chandrasekaran <prasadc@vmware.com>

hongchaodeng mentioned this issue Apr 21, 2017

fail to remove etcd member (xxx): etcdserver: re-configuration failed due to not enough started members coreos/etcd-operator#975

Closed

heyitsanthony added this to the v3.3.0 milestone Apr 25, 2017

heyitsanthony added the area/usability label Apr 25, 2017

xiang90 modified the milestones: v3.3.0, v3.4.0 Oct 4, 2017

gyuho added stage/investigating and removed area/usability labels Feb 25, 2018

gyuho modified the milestones: etcd-v3.4, etcd-v3.5 Aug 5, 2019

sakateka mentioned this issue Jul 18, 2021

netutil: add url comparison without resolver to URLStringsEqual #13224

Merged

ptabor closed this as completed in #13224 Oct 7, 2021

serathius mentioned this issue Nov 14, 2022

Release v3.5.6 #14750

Closed

22 tasks

tjungblu mentioned this issue Jan 5, 2023

urlsEqual might wrongly skip resolving DNS #15062

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd pod stuck at bootstrap and kept "failed resolving host" #7798

etcd pod stuck at bootstrap and kept "failed resolving host" #7798

hongchaodeng commented Apr 21, 2017

xiang90 commented Oct 4, 2017

worp1900 commented Apr 24, 2018

futangwa commented May 7, 2020

sakateka commented Jul 18, 2021

sakateka commented Aug 16, 2021

etcd pod stuck at bootstrap and kept "failed resolving host" #7798

etcd pod stuck at bootstrap and kept "failed resolving host" #7798

Comments

hongchaodeng commented Apr 21, 2017

xiang90 commented Oct 4, 2017

worp1900 commented Apr 24, 2018

futangwa commented May 7, 2020

sakateka commented Jul 18, 2021

sakateka commented Aug 16, 2021