Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unreachable host removed from membership, but kept as node #371

Closed
lalyos opened this issue Sep 29, 2014 · 8 comments
Closed

Unreachable host removed from membership, but kept as node #371

lalyos opened this issue Sep 29, 2014 · 8 comments
Labels
type/bug Feature does not function as expected

Comments

@lalyos
Copy link
Contributor

lalyos commented Sep 29, 2014

Maybe it's just a wrong assumption, but i expected that if a peer is removed from the membership,
it should be removed from the list of nodes.

consul members shows only 1 member

$ consul members

Node   Address          Status  Type    Build  Protocol
node1  172.19.0.3:8301  alive   server  0.4.0  2

while in the catalog there are still 2 nodes

curl -s 127.0.0.1:8500/v1/catalog/nodes|jq .
[
  {
    "Node": "node1",
    "Address": "172.19.0.3"
  },
  {
    "Node": "node2",
    "Address": "172.19.0.8"
  }
]

This can be seen in the log:

$ consul monitor|grep 172.19.0.8

2014/09/28 20:11:27 [INFO] raft: Removed peer 172.19.0.8:8300, stopping replication (Index: 510)
2014/09/28 20:11:30 [ERR] raft: Failed to heartbeat to 172.19.0.8:8300: dial tcp 172.19.0.8:8300: no route to host
2014/09/28 20:11:30 [ERR] raft: Failed to AppendEntries to 172.19.0.8:8300: dial tcp 172.19.0.8:8300: no route to host
2014/09/28 20:11:33 [ERR] raft: Failed to heartbeat to 172.19.0.8:8300: dial tcp 172.19.0.8:8300: no route to host
2014/09/28 20:11:33 [ERR] raft: Failed to AppendEntries to 172.19.0.8:8300: dial tcp 172.19.0.8:8300: no route to host

Another strange thing is, that although node2 is detected as no route to host, it has a passing health status

curl -s 127.0.0.1:8500/v1/health/node/node2|jq .
[
  {
    "Node": "node2",
    "CheckID": "serfHealth",
    "Name": "Serf Health Status",
    "Status": "passing",
    "Notes": "",
    "Output": "Agent alive and reachable",
    "ServiceID": "",
    "ServiceName": ""
  }
]
@armon
Copy link
Member

armon commented Sep 29, 2014

This might be related to #360. Thanks for reporting!

@armon armon added the type/bug Feature does not function as expected label Oct 14, 2014
@armon armon closed this as completed in e39d2ee Oct 14, 2014
@wallnerryan
Copy link

+1 seeing this same exact issue on CentOS 6.5 using consul 0.4.1

#consul version
Consul v0.4.1

@armon
Copy link
Member

armon commented Nov 26, 2014

Hmm this should be fixed as of 0.4.1. Can you gist the output of /v1/catalog/nodes, /v1/agent/members, /v1/health/state/any?

@lalyos
Copy link
Contributor Author

lalyos commented Nov 28, 2014

The problem is, I don't know how to reproduce this state. Once it has happend on an ec2 cloud.

I tried to reproduce by consul leave or force-leave a node, but this way the node is listed by consul members. The same happens with 0.4.0 and 0.4.1.

If I kill a node, its still listed as failed. Somhow we should reach a state whenn consul members doesn't lists a node. @armon how long a node stays in failed state? Does it gets removed after a specific time?

@wallnerryan can you reproduce this error?

@armon
Copy link
Member

armon commented Nov 29, 2014

The node should stay in the failed state for about 72h. There were a few issues around reaping of nodes that should have been fixed in 0.4.1, so without a repro case it's hard to triage this.

@mapa3m
Copy link

mapa3m commented Aug 27, 2016

Getting the same issue - 3 node cluster, 4th node joined temporarily and was removed a few minutes later (VM powered off), but Raft keeps spamming logs with

Aug 27 17:37:47 consul03 consul[2365]: raft: Failed to AppendEntries to 10.10.10.155:8300: dial tcp 10.10.10.155:8300: getsockopt: no route to host
Aug 27 17:37:47 consul03 consul[2365]: raft: Failed to heartbeat to 10.10.10.155:8300: dial tcp 10.10.10.155:8300: getsockopt: no route to host
Aug 27 17:37:34 consul03 consul[2365]: raft: Failed to heartbeat to 10.10.10.155:8300: dial tcp 10.10.10.155:8300: getsockopt: no route to host
Aug 27 17:37:34 consul03 consul[2365]: raft: Failed to AppendEntries to 10.10.10.155:8300: dial tcp 10.10.10.155:8300: getsockopt: no route to host
Aug 27 17:37:21 consul03 consul[2365]: raft: Failed to heartbeat to 10.10.10.155:8300: dial tcp 10.10.10.155:8300: getsockopt: no route to host

CentOS 6.7, Consul 0.6.4

@slackpad
Copy link
Contributor

Hi @mapa3m it probably didn't leave cleanly. Updated the "Failure of a Server in a Multi-Server Cluster" section of the outage guide with some info on what to do in this case, and Consul 0.7 added some tools to remove a Raft peer manually.

@mapa3m
Copy link

mapa3m commented Sep 21, 2016

@slackpad yep, that was it. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Feature does not function as expected
Projects
None yet
Development

No branches or pull requests

5 participants