Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Name conflict with terminated Consul servers #4879

Closed
aglover-zendesk opened this issue Oct 31, 2018 · 5 comments
Closed

Name conflict with terminated Consul servers #4879

aglover-zendesk opened this issue Oct 31, 2018 · 5 comments
Labels
needs-investigation The issue described is detailed and complex.

Comments

@aglover-zendesk
Copy link

Overview of the Issue

I recently did a rolling replacement of our Consul servers as part of an OS upgrade to Ubuntu 16.04. When replacing Consul servers in our smaller environments (<500 nodes), terminating and replacing the servers one by one worked without issue. In our larger Consul environments (>1300 nodes), when the replacement servers came online and tried to join the Consul cluster, I would see the following errors on the remaining Consul servers:

2018/09/25 22:52:14 [WARN] serf: Name conflict for 'consul2.<region>.<domain>.com' both 10.x.x.x:8301 and 10.x.x.y:8301 are claiming
memberlist: Conflicting address for consul2.<region>.<domain>.com. Mine: 10.x.x.x:8301 Theirs: 10.x.x.y:8301
serf: Name conflict for 'consul2.<region>.<domain>.com' both 10.x.x.x:8301 and 10.x.x.y:8301 are claiming
2018/09/25 22:52:15 [ERR] memberlist: Conflicting address for consul2.<region>.<domain>.com. Mine: 10.x.x.x:8301 Theirs: 10.x.x.y:8301
... (error messages repeat)

Concurrently, the new host would see the other client nodes (lots of EventMemberJoin events logged), but would not be able to join its peers as a server. Relevant logs from the new Consul server trying to join the local cluster:

serf: EventMemberJoin: some.host.<region>.<domain>.com 10.x.x.x
... (hundreds of EventMemberJoin events)
memberlist: Refuting a suspect message (from: consul2.<region>.<domain>.com)
raft: no known peers, aborting election
agent: failed to sync remote state: No cluster leader
http: Request PUT /v1/session/create, error: No cluster leader from=127.0.0.1:58754
Lock acquisition failed: failed to create session: Unexpected response code: 500 (No cluster leader)
agent: Coordinate update error: No cluster leader
2018/09/25 22:58:03 [ERR] http: Request PUT /v1/session/create, error: No cluster leader from=127.0.0.1:58794
... (error messages repeat)

The errors would occur for 15-20 minutes. Then the cluster would "forget" about the old node, the rest of the cluster stopped having the "name conflict" errors, and the new node would join the cluster without issue:

2018/09/25 23:07:34 [ERR] agent: failed to sync remote state: No cluster leader
agent: failed to sync remote state: No cluster leader
2018/09/25 23:07:37 [DEBUG] raft-net: 10.x.x.a:8300 accepted connection from: 10.x.x.b:34336
2018/09/25 23:07:37 [WARN] raft: Failed to get previous log: 223317551 log not found (last: 0)
raft: Failed to get previous log: 223317551 log not found (last: 0)
2018/09/25 23:07:37 [INFO] snapshot: Creating new snapshot at /var/lib/consul/raft/snapshots/83955-223304017-1537916857426.tmp
2018/09/25 23:07:37 [WARN] Unable to get address for server id e7dd315d-bd4c-40cc-93d0-0bbd07b5300c, using fallback address 10.x.x.c:8300: Could not find address for server id e7dd315d-bd4c-40cc-93d0-0bbd07b5300c
2018/09/25 23:07:37 [WARN] Unable to get address for server id ad1d2999-b84e-4bef-a7ae-c6d0773b9afc, using fallback address 10.x.x.d:8300: Could not find address for server id ad1d2999-b84e-4bef-a7ae-c6d0773b9afc
snapshot: Creating new snapshot at /var/lib/consul/raft/snapshots/83955-223304017-1537916857426.tmp
2018/09/25 23:07:37 [DEBUG] raft-net: 10.x.x.a:8300 accepted connection from: 10.x.x.b:52862
2018/09/25 23:07:37 [INFO] raft: Copied 27316642 bytes to local snapshot
raft: Copied 27316642 bytes to local snapshot
2018/09/25 23:07:38 [INFO] raft: Installed remote snapshot

This happened consistently as I replaced each of the 5 servers one by one.

At the time, I tried using consul force-leave to eject the old member information, but consul members still showed the old node as left with the old IP and the name conflict error was still present. Similarly I checked the raft data to see if the old node was still there, but it was already removed cleanly. I also tried gracefully leaving first, executing consul leave and stopping the consul agent before terminating the host, but the stale data persisted either way.

For context, we re-use node names so when I terminate consul1.<region>.<domain>.com, the new server that replaces it has a new IP, a new node ID, but the same consul1.<region>.<domain>.com FQDN.

Reproduction Steps

Steps to reproduce this issue, eg:

  1. Create a cluster with 1300+ client nodes and 5 server nodes, using DNS names for retry_join values
  2. Terminate one of the 5 Consul servers
  3. Create a new Consul server with identical Consul configuration, same FQDN, but a new IP address
  4. Run consul monitor -log-level=debug to view the errors as it tries to join the cluster, both on the new Consul server and one of the remaining 4 server nodes

Consul info for both Client and Server

Server configuration (partially redacted)
{
  "datacenter": "<region>",
  "data_dir": "/var/lib/consul",
  "telemetry": {
    "dogstatsd_addr": "127.0.0.1:8125",
    "dogstatsd_tags": [
      "consul_datacenter:<region>"
    ]
  },
  "bind_addr": "10.x.x.x",
  "client_addr": "127.0.0.1",
  "disable_remote_exec": true,
  "dns_config": {
    "allow_stale": true,
    "enable_truncate": true,
    "only_passing": true,
    "service_ttl": {
      "*": "15s"
    },
    "node_ttl": "15s"
  },
  "recursors": [
    "127.0.0.1"
  ],
  "enable_syslog": true,
  "acl_default_policy": "allow",
  "acl_down_policy": "extend-cache",
  "ui": false,
  "verify_incoming": false,
  "verify_outgoing": false,
  "verify_server_hostname": false,
  "watches": [],
  "leave_on_terminate": false,
  "performance": {
    "raft_multiplier": 1
  },
  "acl_token": "<redacted>",
  "acl_datacenter": "<other-region>",
  "server": true,
  "bootstrap_expect": 5,
  "enable_acl_replication": true,
  "acl_replication_token": "<redacted>",
  "retry_join": [
    "consul1",
    "consul2",
    "consul3",
    "consul4",
    "consul5",
    "consul1.<region>.<domain>.com",
    "consul3.<region>.<domain>.com",
    "consul4.<region>.<domain>.com",
    "consul5.<region>.<domain>.com"
  ],
  "enable_script_checks": true
}
Server info
agent:
	check_monitors = 0
	check_ttls = 0
	checks = 0
	services = 0
build:
	prerelease = 
	revision = e716d1b5
	version = 1.2.2
consul:
	bootstrap = false
	known_datacenters = 14
	leader = false
	leader_addr = 10.x.x.a:8300
	server = true
raft:
	applied_index = 239439374
	commit_index = 239439374
	fsm_pending = 0
	last_contact = 31.986512ms
	last_log_index = 239439374
	last_log_term = 84270
	last_snapshot_index = 239436097
	last_snapshot_term = 84270
	latest_configuration = [{Suffrage:Voter ID:ffca1ac7-f0e8-7bb8-fadc-8914bab52f3d Address:10.x.x.a:8300} {Suffrage:Voter ID:e14f1e1d-ae00-0e5e-8392-cd9cf18463be Address:10.x.x.b:8300} {Suffrage:Voter ID:60d35c83-b615-6542-a533-aaf30784967a Address:10.x.x.c:8300} {Suffrage:Voter ID:568e5b61-6c36-5e5f-d513-6d07ea6069d2 Address:10.x.x.d:8300} {Suffrage:Voter ID:df3f7d21-2c5d-9753-f21d-ffb062467db0 Address:10.x.x.e:8300}]
	latest_configuration_index = 237342230
	num_peers = 4
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Follower
	term = 84270
runtime:
	arch = amd64
	cpu_count = 8
	goroutines = 3083
	max_procs = 8
	os = linux
	version = go1.10.1
serf_lan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 346
	failed = 1
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 50522
	members = 1384
	query_queue = 0
	query_time = 107
serf_wan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 1508
	members = 70
	query_queue = 0
	query_time = 1

Operating system and Environment details

OS, Architecture, and any other information you can provide about the environment.

  • All Consul servers running Ubuntu 16.04
  • Running in AWS
    • across three availability zones
    • c4.xlarge instance type (CPU/memory/network/disk don't show obvious signs of being a bottleneck, utilization fairly low)
    • The two busier environments that had this issue also span multiple VPCs (just the clients, not the Consul servers) that are peered
  • Consul servers are WAN joined to 13 other Consul DCs around the planet

Other questions/data

  • A portion of our Consul clients (~5%) aren't exposing all of the necessary ports to all Consul agents, causing issues like increased member flap rate. Could bad/stale gossip data about the status of the old Consul server prevent it from being removed cleanly?
  • Similarly to above, could the sheer number of clients gossiping cause this behavior where old nodes are not cleanly being removed quickly enough?
  • May be related to Duplicate Node IDs after upgrading to 1.2.3 #4741 although we are using 1.2.2 everywhere, not 1.2.3
@banks banks added the needs-investigation The issue described is detailed and complex. label Nov 7, 2018
@banks
Copy link
Member

banks commented Nov 7, 2018

Thanks for the detailed report @aglover-zendesk!

To confirm: did this issue cause you an outage? My reading is that it made each server take a long time to become healthy again so made the rollout take longer than expected but you maintained availability the whole time. Is that correct?

We've certainly observed issues in larger clusters where leaves take longer to propagate fully and occasionally have appeared to not be honoured even when the outgoing server did actually issue a graceful leave.

In your case though your servers are configured with "leave_on_terminate": false, which means that they would not actually have attempted to leave the cluster when you shut them down.

We have a few issues in this area that need some more thorough investigation so hopefully your data and case can assist there. Finding a stable path for replacing a machine in this way (both IP and UUID on disk get lost/changed but name doesn't) is a little tricky, especially when we also have folks who have issue with the opposite - renaming nodes where either IP or UUID or both don't change - and need to protect against malicious or accidental clashes of unique identifiers...

@aglover-zendesk
Copy link
Author

Hey @banks, thanks for the reply. You are correct, this did not cause an outage.

In your case though your servers are configured with "leave_on_terminate": false, which means that they would not actually have attempted to leave the cluster when you shut them down.

That's the peculiar part - the nodes show as left, yet still cause the name conflict somehow. I would assume that a node in the left state wouldn't even be a candidate for name conflicts. I'll try to dig through the Consul code later, see if I can provide a smoking gun, but my proposal is that left nodes be excluded from the name conflict safeguard and are simply overwritten.

@stale
Copy link

stale bot commented Oct 21, 2019

Hey there,
We wanted to check in on this request since it has been inactive for at least 60 days.
If you think this is still an important issue in the latest version of Consul
or its documentation please reply with a comment here which will cause it to stay open for investigation.
If there is still no activity on this issue for 30 more days, we will go ahead and close it.

Feel free to check out the community forum as well!
Thank you!

@stale stale bot added the waiting-reply Waiting on response from Original Poster or another individual in the thread label Oct 21, 2019
@stale
Copy link

stale bot commented Nov 20, 2019

Hey there, This issue has been automatically closed because there hasn't been any activity for at least 90 days. If you are still experiencing problems, or still have questions, feel free to open a new one 👍

@stale stale bot closed this as completed Nov 20, 2019
@ghost
Copy link

ghost commented Jan 25, 2020

Hey there,

This issue has been automatically locked because it is closed and there hasn't been any activity for at least 30 days.

If you are still experiencing problems, or still have questions, feel free to open a new one 👍.

@ghost ghost locked and limited conversation to collaborators Jan 25, 2020
@ghost ghost removed the waiting-reply Waiting on response from Original Poster or another individual in the thread label Jan 25, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
needs-investigation The issue described is detailed and complex.
Projects
None yet
Development

No branches or pull requests

2 participants