-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Name conflict with terminated Consul servers #4879
Comments
Thanks for the detailed report @aglover-zendesk! To confirm: did this issue cause you an outage? My reading is that it made each server take a long time to become healthy again so made the rollout take longer than expected but you maintained availability the whole time. Is that correct? We've certainly observed issues in larger clusters where leaves take longer to propagate fully and occasionally have appeared to not be honoured even when the outgoing server did actually issue a graceful leave. In your case though your servers are configured with We have a few issues in this area that need some more thorough investigation so hopefully your data and case can assist there. Finding a stable path for replacing a machine in this way (both IP and UUID on disk get lost/changed but name doesn't) is a little tricky, especially when we also have folks who have issue with the opposite - renaming nodes where either IP or UUID or both don't change - and need to protect against malicious or accidental clashes of unique identifiers... |
Hey @banks, thanks for the reply. You are correct, this did not cause an outage.
That's the peculiar part - the nodes show as |
Hey there, Feel free to check out the community forum as well! |
Hey there, This issue has been automatically closed because there hasn't been any activity for at least 90 days. If you are still experiencing problems, or still have questions, feel free to open a new one 👍 |
Hey there, This issue has been automatically locked because it is closed and there hasn't been any activity for at least 30 days. If you are still experiencing problems, or still have questions, feel free to open a new one 👍. |
Overview of the Issue
I recently did a rolling replacement of our Consul servers as part of an OS upgrade to Ubuntu 16.04. When replacing Consul servers in our smaller environments (<500 nodes), terminating and replacing the servers one by one worked without issue. In our larger Consul environments (>1300 nodes), when the replacement servers came online and tried to join the Consul cluster, I would see the following errors on the remaining Consul servers:
Concurrently, the new host would see the other client nodes (lots of EventMemberJoin events logged), but would not be able to join its peers as a server. Relevant logs from the new Consul server trying to join the local cluster:
The errors would occur for 15-20 minutes. Then the cluster would "forget" about the old node, the rest of the cluster stopped having the "name conflict" errors, and the new node would join the cluster without issue:
This happened consistently as I replaced each of the 5 servers one by one.
At the time, I tried using
consul force-leave
to eject the old member information, butconsul members
still showed the old node asleft
with the old IP and the name conflict error was still present. Similarly I checked the raft data to see if the old node was still there, but it was already removed cleanly. I also tried gracefully leaving first, executingconsul leave
and stopping the consul agent before terminating the host, but the stale data persisted either way.For context, we re-use node names so when I terminate
consul1.<region>.<domain>.com
, the new server that replaces it has a new IP, a new node ID, but the sameconsul1.<region>.<domain>.com
FQDN.Reproduction Steps
Steps to reproduce this issue, eg:
retry_join
valuesconsul monitor -log-level=debug
to view the errors as it tries to join the cluster, both on the new Consul server and one of the remaining 4 server nodesConsul info for both Client and Server
Server configuration (partially redacted)
Server info
Operating system and Environment details
OS, Architecture, and any other information you can provide about the environment.
c4.xlarge
instance type (CPU/memory/network/disk don't show obvious signs of being a bottleneck, utilization fairly low)Other questions/data
The text was updated successfully, but these errors were encountered: