Consul Autopilot doesn't seem to want to remove old nodes that have since left (1.0.6) #4118

seafoodbuffet · 2018-05-13T12:46:23Z

When filing a bug, please include the following headings if
possible. Any example text in this template can be deleted.

Overview of the Issue

We have an issue quite similar to #3611 We had a failure of one of our consul nodes yesterday. In trying to fix this, the admin deleted consul data and re-tried to join. This resulted in a still-bad node with a slightly different name in the cluster. We gracefully left one and used force-leave on the other. It's been over eight hours and the two good servers are still trying to reconnect to the bad node.

Reproduction Steps

Three-node server cluster
Leave one of the server nodes
The remaining two servers keep trying to access the good node on the WAN port

Consul info for both Client and Server

Client info

agent:
        check_monitors = 0
        check_ttls = 0
        checks = 3
        services = 6
build:
        prerelease = 
        revision = 9a494b5f
        version = 1.0.6
consul:
        bootstrap = false
        known_datacenters = 1
        leader = false
        leader_addr = 172.22.10.51:8300
        server = true
raft:
        applied_index = 86294598
        commit_index = 86294598
        fsm_pending = 0
        last_contact = 38.584408ms
        last_log_index = 86294598
        last_log_term = 52
        last_snapshot_index = 86292835
        last_snapshot_term = 52
        latest_configuration = [{Suffrage:Voter ID:0824f824-2cb5-f756-1210-9542f998bc7e Address:172.22.10.50:8300} {Suffrage:Voter ID:81af471c-2521-4c28-fd37-4ecf24609f3f Address:172.22.10.51:8300}]
        latest_configuration_index = 86169220
        num_peers = 1
        protocol_version = 3
        protocol_version_max = 3
        protocol_version_min = 0
        snapshot_version_max = 1
        snapshot_version_min = 0
        state = Follower
        term = 52
runtime:
        arch = amd64
        cpu_count = 2
        goroutines = 477
        max_procs = 2
        os = linux
        version = go1.9.3
serf_lan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 0
        event_time = 21
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 2
        member_time = 363
        members = 67
        query_queue = 0
        query_time = 1
serf_wan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 0
        event_time = 1
        failed = 1
        health_score = 0
        intent_queue = 0
        left = 1
        member_time = 27
        members = 4
        query_queue = 0
        query_time = 1

Server info

agent:
        check_monitors = 0
        check_ttls = 0
        checks = 0
        services = 0
build:
        prerelease = 
        revision = 9a494b5f
        version = 1.0.6
consul:
        bootstrap = false
        known_datacenters = 1
        leader = true
        leader_addr = 172.22.10.51:8300
        server = true
raft:
        applied_index = 86295057
        commit_index = 86295057
        fsm_pending = 0
        last_contact = 0
        last_log_index = 86295057
        last_log_term = 52
        last_snapshot_index = 86291530
        last_snapshot_term = 52
        latest_configuration = [{Suffrage:Voter ID:0824f824-2cb5-f756-1210-9542f998bc7e Address:172.22.10.50:8300} {Suf
frage:Voter ID:81af471c-2521-4c28-fd37-4ecf24609f3f Address:172.22.10.51:8300}]
        latest_configuration_index = 86169220
        num_peers = 1
        protocol_version = 3
        protocol_version_max = 3
        protocol_version_min = 0
        snapshot_version_max = 1
        snapshot_version_min = 0
        state = Leader
        term = 52
runtime:
        arch = amd64
        cpu_count = 2
        goroutines = 474
        max_procs = 2
        os = linux
        version = go1.9.3
serf_lan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 0
        event_time = 21
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 2
        member_time = 363
        members = 67
        query_queue = 0
        query_time = 1
serf_wan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 0
        event_time = 1
        failed = 2
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 27
        members = 4
        query_queue = 0
        query_time = 1

Operating system and Environment details

RHEL 7.5 on Linux on VMWare

Log Fragments

on the node that gracefully left

May 12 22:30:50 consul01 consul[9025]: 2018/05/12 22:30:50 [INFO] consul: server starting leave
May 12 22:30:51 consul01 consul[9025]: 2018/05/12 22:30:51 [INFO] serf: EventMemberLeave: consul01.aus 172.16.10.50
May 12 22:30:51 consul01 consul[9025]: 2018/05/12 22:30:51 [INFO] consul: Handled member-leave event for server "consul01.aus" in area "wan"
May 12 22:30:53 consul01 consul[9025]: 2018/05/12 22:30:53 [INFO] serf: EventMemberLeave: consul01 172.16.10.50
May 12 22:30:53 consul01 consul[9025]: 2018/05/12 22:30:53 [INFO] consul: Removing LAN server consul01 (Addr: tcp/172.16.10.50:8300) (DC: aus)
May 12 22:30:53 consul01 consul[9025]: 2018/05/12 22:30:53 [INFO] consul: Waiting 5s to drain RPC traffic
May 12 22:30:58 consul01 consul[9025]: 2018/05/12 22:30:58 [INFO] agent: Requesting shutdown
May 12 22:30:58 consul01 consul[9025]: 2018/05/12 22:30:58 [INFO] consul: shutting down server
May 12 22:30:58 consul01 consul[9025]: 2018/05/12 22:30:58 [INFO] manager: shutting down
May 12 22:30:58 consul01 consul[9025]: 2018/05/12 22:30:58 [INFO] agent: consul server down
May 12 22:30:58 consul01 consul[9025]: 2018/05/12 22:30:58 [INFO] agent: shutdown complete
May 12 22:30:58 consul01 consul[9025]: 2018/05/12 22:30:58 [INFO] agent: Stopping DNS server 172.16.10.50:8600 (tcp)
May 12 22:30:58 consul01 consul[9025]: 2018/05/12 22:30:58 [INFO] agent: Stopping DNS server 172.16.10.50:8600 (udp)
May 12 22:30:58 consul01 consul[9025]: 2018/05/12 22:30:58 [INFO] agent: Stopping HTTP server 172.16.10.50:8500 (tcp)
May 12 22:30:58 consul01 consul[9025]: 2018/05/12 22:30:58 [INFO] agent: Waiting for endpoints to shut down
May 12 22:30:58 consul01 consul[9025]: 2018/05/12 22:30:58 [INFO] agent: Endpoints down
May 12 22:30:58 consul01 consul[9025]: 2018/05/12 22:30:58 [INFO] Exit code: 0

on the remaining server node

May 13 07:40:22 consul03 consul[5974]: 2018/05/13 07:40:22 [INFO] serf: attempting reconnect to consul01.aus 172.16.10.50:8302
May 13 07:40:52 consul03 consul[5974]: 2018/05/13 07:40:52 [INFO] serf: attempting reconnect to consul01.aus 172.16.10.50:8302
May 13 07:41:22 consul03 consul[5974]: 2018/05/13 07:41:22 [INFO] serf: attempting reconnect to consul01.aus 172.16.10.50:8302
May 13 07:41:52 consul03 consul[5974]: 2018/05/13 07:41:52 [INFO] serf: attempting reconnect to consul01.aus 172.16.10.50:8302
May 13 07:42:22 consul03 consul[5974]: 2018/05/13 07:42:22 [INFO] serf: attempting reconnect to consul01.aus 172.16.10.50:8302
May 13 07:42:52 consul03 consul[5974]: 2018/05/13 07:42:52 [INFO] serf: attempting reconnect to consul01.aus 172.16.10.50:8302

Autopilot health reports that the cluster was stable as of when the sick node left

{
  "Healthy": true,
  "FailureTolerance": 0,
  "Servers": [
    {
      "ID": "0824f824-2cb5-f756-1210-9542f998bc7e",
      "Name": "consul02",
      "Address": "172.22.10.50:8300",
      "SerfStatus": "alive",
      "Version": "1.0.6",
      "Leader": false,
      "LastContact": "48.118308ms",
      "LastTerm": 52,
      "LastIndex": 86285568,
      "Healthy": true,
      "Voter": true,
      "StableSince": "2018-05-13T03:25:40Z"
    },
    {
      "ID": "81af471c-2521-4c28-fd37-4ecf24609f3f",
      "Name": "consul03",
      "Address": "172.22.10.51:8300",
      "SerfStatus": "alive",
      "Version": "1.0.6",
      "Leader": true,
      "LastContact": "0s",
      "LastTerm": 52,
      "LastIndex": 86285568,
      "Healthy": true,
      "Voter": true,
      "StableSince": "2018-05-13T03:25:40Z"
    }
  ]
}

I've already tried to force-leave the bad nodes (both by node name and ip) but that's not resolved this issue.

I'm planning to stand up a new node to join the cluster to give it fault tolerance but is there a way to ditch this old, bad node? Is it a problem that it persists?

The text was updated successfully, but these errors were encountered:

seafoodbuffet · 2018-05-13T14:59:32Z

I wanted to come back and update this issue. We ended up create a new consul server node and when it joined the cluster, the old bad node was remove from the list. Oddly, the non leader is still trying to reconnect the bad node. Not sure what to do about that.

seafoodbuffet · 2018-05-14T17:45:55Z

I guess this was a matter of time. Adding new nodes caused left nodes to be removed. And eventually, the consul server(s) stopped trying to contact the left nodes.

seafoodbuffet closed this as completed May 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consul Autopilot doesn't seem to want to remove old nodes that have since left (1.0.6) #4118

Consul Autopilot doesn't seem to want to remove old nodes that have since left (1.0.6) #4118

seafoodbuffet commented May 13, 2018

seafoodbuffet commented May 13, 2018

seafoodbuffet commented May 14, 2018

Consul Autopilot doesn't seem to want to remove old nodes that have since left (1.0.6) #4118

Consul Autopilot doesn't seem to want to remove old nodes that have since left (1.0.6) #4118

Comments

seafoodbuffet commented May 13, 2018

Overview of the Issue

Reproduction Steps

Consul info for both Client and Server

Operating system and Environment details

Log Fragments

seafoodbuffet commented May 13, 2018

seafoodbuffet commented May 14, 2018