Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consul Autopilot doesn't seem to want to remove old nodes that have since left (1.0.6) #4118

Closed
seafoodbuffet opened this issue May 13, 2018 · 2 comments

Comments

@seafoodbuffet
Copy link

When filing a bug, please include the following headings if
possible. Any example text in this template can be deleted.

Overview of the Issue

We have an issue quite similar to #3611 We had a failure of one of our consul nodes yesterday. In trying to fix this, the admin deleted consul data and re-tried to join. This resulted in a still-bad node with a slightly different name in the cluster. We gracefully left one and used force-leave on the other. It's been over eight hours and the two good servers are still trying to reconnect to the bad node.

Reproduction Steps

  1. Three-node server cluster
  2. Leave one of the server nodes
  3. The remaining two servers keep trying to access the good node on the WAN port

Consul info for both Client and Server

Client info
agent:
        check_monitors = 0
        check_ttls = 0
        checks = 3
        services = 6
build:
        prerelease = 
        revision = 9a494b5f
        version = 1.0.6
consul:
        bootstrap = false
        known_datacenters = 1
        leader = false
        leader_addr = 172.22.10.51:8300
        server = true
raft:
        applied_index = 86294598
        commit_index = 86294598
        fsm_pending = 0
        last_contact = 38.584408ms
        last_log_index = 86294598
        last_log_term = 52
        last_snapshot_index = 86292835
        last_snapshot_term = 52
        latest_configuration = [{Suffrage:Voter ID:0824f824-2cb5-f756-1210-9542f998bc7e Address:172.22.10.50:8300} {Suffrage:Voter ID:81af471c-2521-4c28-fd37-4ecf24609f3f Address:172.22.10.51:8300}]
        latest_configuration_index = 86169220
        num_peers = 1
        protocol_version = 3
        protocol_version_max = 3
        protocol_version_min = 0
        snapshot_version_max = 1
        snapshot_version_min = 0
        state = Follower
        term = 52
runtime:
        arch = amd64
        cpu_count = 2
        goroutines = 477
        max_procs = 2
        os = linux
        version = go1.9.3
serf_lan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 0
        event_time = 21
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 2
        member_time = 363
        members = 67
        query_queue = 0
        query_time = 1
serf_wan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 0
        event_time = 1
        failed = 1
        health_score = 0
        intent_queue = 0
        left = 1
        member_time = 27
        members = 4
        query_queue = 0
        query_time = 1
Server info
agent:
        check_monitors = 0
        check_ttls = 0
        checks = 0
        services = 0
build:
        prerelease = 
        revision = 9a494b5f
        version = 1.0.6
consul:
        bootstrap = false
        known_datacenters = 1
        leader = true
        leader_addr = 172.22.10.51:8300
        server = true
raft:
        applied_index = 86295057
        commit_index = 86295057
        fsm_pending = 0
        last_contact = 0
        last_log_index = 86295057
        last_log_term = 52
        last_snapshot_index = 86291530
        last_snapshot_term = 52
        latest_configuration = [{Suffrage:Voter ID:0824f824-2cb5-f756-1210-9542f998bc7e Address:172.22.10.50:8300} {Suf
frage:Voter ID:81af471c-2521-4c28-fd37-4ecf24609f3f Address:172.22.10.51:8300}]
        latest_configuration_index = 86169220
        num_peers = 1
        protocol_version = 3
        protocol_version_max = 3
        protocol_version_min = 0
        snapshot_version_max = 1
        snapshot_version_min = 0
        state = Leader
        term = 52
runtime:
        arch = amd64
        cpu_count = 2
        goroutines = 474
        max_procs = 2
        os = linux
        version = go1.9.3
serf_lan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 0
        event_time = 21
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 2
        member_time = 363
        members = 67
        query_queue = 0
        query_time = 1
serf_wan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 0
        event_time = 1
        failed = 2
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 27
        members = 4
        query_queue = 0
        query_time = 1

Operating system and Environment details

RHEL 7.5 on Linux on VMWare

Log Fragments

on the node that gracefully left

May 12 22:30:50 consul01 consul[9025]: 2018/05/12 22:30:50 [INFO] consul: server starting leave
May 12 22:30:51 consul01 consul[9025]: 2018/05/12 22:30:51 [INFO] serf: EventMemberLeave: consul01.aus 172.16.10.50
May 12 22:30:51 consul01 consul[9025]: 2018/05/12 22:30:51 [INFO] consul: Handled member-leave event for server "consul01.aus" in area "wan"
May 12 22:30:53 consul01 consul[9025]: 2018/05/12 22:30:53 [INFO] serf: EventMemberLeave: consul01 172.16.10.50
May 12 22:30:53 consul01 consul[9025]: 2018/05/12 22:30:53 [INFO] consul: Removing LAN server consul01 (Addr: tcp/172.16.10.50:8300) (DC: aus)
May 12 22:30:53 consul01 consul[9025]: 2018/05/12 22:30:53 [INFO] consul: Waiting 5s to drain RPC traffic
May 12 22:30:58 consul01 consul[9025]: 2018/05/12 22:30:58 [INFO] agent: Requesting shutdown
May 12 22:30:58 consul01 consul[9025]: 2018/05/12 22:30:58 [INFO] consul: shutting down server
May 12 22:30:58 consul01 consul[9025]: 2018/05/12 22:30:58 [INFO] manager: shutting down
May 12 22:30:58 consul01 consul[9025]: 2018/05/12 22:30:58 [INFO] agent: consul server down
May 12 22:30:58 consul01 consul[9025]: 2018/05/12 22:30:58 [INFO] agent: shutdown complete
May 12 22:30:58 consul01 consul[9025]: 2018/05/12 22:30:58 [INFO] agent: Stopping DNS server 172.16.10.50:8600 (tcp)
May 12 22:30:58 consul01 consul[9025]: 2018/05/12 22:30:58 [INFO] agent: Stopping DNS server 172.16.10.50:8600 (udp)
May 12 22:30:58 consul01 consul[9025]: 2018/05/12 22:30:58 [INFO] agent: Stopping HTTP server 172.16.10.50:8500 (tcp)
May 12 22:30:58 consul01 consul[9025]: 2018/05/12 22:30:58 [INFO] agent: Waiting for endpoints to shut down
May 12 22:30:58 consul01 consul[9025]: 2018/05/12 22:30:58 [INFO] agent: Endpoints down
May 12 22:30:58 consul01 consul[9025]: 2018/05/12 22:30:58 [INFO] Exit code: 0

on the remaining server node

May 13 07:40:22 consul03 consul[5974]: 2018/05/13 07:40:22 [INFO] serf: attempting reconnect to consul01.aus 172.16.10.50:8302
May 13 07:40:52 consul03 consul[5974]: 2018/05/13 07:40:52 [INFO] serf: attempting reconnect to consul01.aus 172.16.10.50:8302
May 13 07:41:22 consul03 consul[5974]: 2018/05/13 07:41:22 [INFO] serf: attempting reconnect to consul01.aus 172.16.10.50:8302
May 13 07:41:52 consul03 consul[5974]: 2018/05/13 07:41:52 [INFO] serf: attempting reconnect to consul01.aus 172.16.10.50:8302
May 13 07:42:22 consul03 consul[5974]: 2018/05/13 07:42:22 [INFO] serf: attempting reconnect to consul01.aus 172.16.10.50:8302
May 13 07:42:52 consul03 consul[5974]: 2018/05/13 07:42:52 [INFO] serf: attempting reconnect to consul01.aus 172.16.10.50:8302

Autopilot health reports that the cluster was stable as of when the sick node left

{
  "Healthy": true,
  "FailureTolerance": 0,
  "Servers": [
    {
      "ID": "0824f824-2cb5-f756-1210-9542f998bc7e",
      "Name": "consul02",
      "Address": "172.22.10.50:8300",
      "SerfStatus": "alive",
      "Version": "1.0.6",
      "Leader": false,
      "LastContact": "48.118308ms",
      "LastTerm": 52,
      "LastIndex": 86285568,
      "Healthy": true,
      "Voter": true,
      "StableSince": "2018-05-13T03:25:40Z"
    },
    {
      "ID": "81af471c-2521-4c28-fd37-4ecf24609f3f",
      "Name": "consul03",
      "Address": "172.22.10.51:8300",
      "SerfStatus": "alive",
      "Version": "1.0.6",
      "Leader": true,
      "LastContact": "0s",
      "LastTerm": 52,
      "LastIndex": 86285568,
      "Healthy": true,
      "Voter": true,
      "StableSince": "2018-05-13T03:25:40Z"
    }
  ]
}

I've already tried to force-leave the bad nodes (both by node name and ip) but that's not resolved this issue.

I'm planning to stand up a new node to join the cluster to give it fault tolerance but is there a way to ditch this old, bad node? Is it a problem that it persists?

@seafoodbuffet
Copy link
Author

I wanted to come back and update this issue. We ended up create a new consul server node and when it joined the cluster, the old bad node was remove from the list. Oddly, the non leader is still trying to reconnect the bad node. Not sure what to do about that.

@seafoodbuffet
Copy link
Author

I guess this was a matter of time. Adding new nodes caused left nodes to be removed. And eventually, the consul server(s) stopped trying to contact the left nodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant