Unreachable host removed from membership, but kept as node #371

lalyos · 2014-09-29T05:55:54Z

Maybe it's just a wrong assumption, but i expected that if a peer is removed from the membership,
it should be removed from the list of nodes.

consul members shows only 1 member

$ consul members

Node   Address          Status  Type    Build  Protocol
node1  172.19.0.3:8301  alive   server  0.4.0  2

while in the catalog there are still 2 nodes

curl -s 127.0.0.1:8500/v1/catalog/nodes|jq .
[
  {
    "Node": "node1",
    "Address": "172.19.0.3"
  },
  {
    "Node": "node2",
    "Address": "172.19.0.8"
  }
]

This can be seen in the log:

$ consul monitor|grep 172.19.0.8

2014/09/28 20:11:27 [INFO] raft: Removed peer 172.19.0.8:8300, stopping replication (Index: 510)
2014/09/28 20:11:30 [ERR] raft: Failed to heartbeat to 172.19.0.8:8300: dial tcp 172.19.0.8:8300: no route to host
2014/09/28 20:11:30 [ERR] raft: Failed to AppendEntries to 172.19.0.8:8300: dial tcp 172.19.0.8:8300: no route to host
2014/09/28 20:11:33 [ERR] raft: Failed to heartbeat to 172.19.0.8:8300: dial tcp 172.19.0.8:8300: no route to host
2014/09/28 20:11:33 [ERR] raft: Failed to AppendEntries to 172.19.0.8:8300: dial tcp 172.19.0.8:8300: no route to host

Another strange thing is, that although node2 is detected as no route to host, it has a passing health status

curl -s 127.0.0.1:8500/v1/health/node/node2|jq .
[
  {
    "Node": "node2",
    "CheckID": "serfHealth",
    "Name": "Serf Health Status",
    "Status": "passing",
    "Notes": "",
    "Output": "Agent alive and reachable",
    "ServiceID": "",
    "ServiceName": ""
  }
]

The text was updated successfully, but these errors were encountered:

armon · 2014-09-29T20:53:18Z

This might be related to #360. Thanks for reporting!

wallnerryan · 2014-11-26T15:45:42Z

+1 seeing this same exact issue on CentOS 6.5 using consul 0.4.1

#consul version
Consul v0.4.1

armon · 2014-11-26T19:29:11Z

Hmm this should be fixed as of 0.4.1. Can you gist the output of /v1/catalog/nodes, /v1/agent/members, /v1/health/state/any?

lalyos · 2014-11-28T17:46:30Z

The problem is, I don't know how to reproduce this state. Once it has happend on an ec2 cloud.

I tried to reproduce by consul leave or force-leave a node, but this way the node is listed by consul members. The same happens with 0.4.0 and 0.4.1.

If I kill a node, its still listed as failed. Somhow we should reach a state whenn consul members doesn't lists a node. @armon how long a node stays in failed state? Does it gets removed after a specific time?

@wallnerryan can you reproduce this error?

armon · 2014-11-29T23:09:17Z

The node should stay in the failed state for about 72h. There were a few issues around reaping of nodes that should have been fixed in 0.4.1, so without a repro case it's hard to triage this.

mapa3m · 2016-08-27T21:40:55Z

Getting the same issue - 3 node cluster, 4th node joined temporarily and was removed a few minutes later (VM powered off), but Raft keeps spamming logs with

Aug 27 17:37:47 consul03 consul[2365]: raft: Failed to AppendEntries to 10.10.10.155:8300: dial tcp 10.10.10.155:8300: getsockopt: no route to host
Aug 27 17:37:47 consul03 consul[2365]: raft: Failed to heartbeat to 10.10.10.155:8300: dial tcp 10.10.10.155:8300: getsockopt: no route to host
Aug 27 17:37:34 consul03 consul[2365]: raft: Failed to heartbeat to 10.10.10.155:8300: dial tcp 10.10.10.155:8300: getsockopt: no route to host
Aug 27 17:37:34 consul03 consul[2365]: raft: Failed to AppendEntries to 10.10.10.155:8300: dial tcp 10.10.10.155:8300: getsockopt: no route to host
Aug 27 17:37:21 consul03 consul[2365]: raft: Failed to heartbeat to 10.10.10.155:8300: dial tcp 10.10.10.155:8300: getsockopt: no route to host

CentOS 6.7, Consul 0.6.4

slackpad · 2016-09-20T23:19:53Z

Hi @mapa3m it probably didn't leave cleanly. Updated the "Failure of a Server in a Multi-Server Cluster" section of the outage guide with some info on what to do in this case, and Consul 0.7 added some tools to remove a Raft peer manually.

mapa3m · 2016-09-21T01:48:24Z

@slackpad yep, that was it. Thanks!

armon added the type/bug Feature does not function as expected label Oct 14, 2014

armon closed this as completed in e39d2ee Oct 14, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unreachable host removed from membership, but kept as node #371

Unreachable host removed from membership, but kept as node #371

lalyos commented Sep 29, 2014

armon commented Sep 29, 2014

wallnerryan commented Nov 26, 2014

armon commented Nov 26, 2014

lalyos commented Nov 28, 2014

armon commented Nov 29, 2014

mapa3m commented Aug 27, 2016

slackpad commented Sep 20, 2016

mapa3m commented Sep 21, 2016

Unreachable host removed from membership, but kept as node #371

Unreachable host removed from membership, but kept as node #371

Comments

lalyos commented Sep 29, 2014

armon commented Sep 29, 2014

wallnerryan commented Nov 26, 2014

armon commented Nov 26, 2014

lalyos commented Nov 28, 2014

armon commented Nov 29, 2014

mapa3m commented Aug 27, 2016

slackpad commented Sep 20, 2016

mapa3m commented Sep 21, 2016