-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consul autopilot does not leave nodes correctly (version 1.0.0) #3611
Comments
Hi @dpgaspar Consul 1.0.0 defaults to Raft protocol 3 which introduces an extra stabilization period before servers are promoted to voters. How does your rolling update determine when it's safe to roll another node - does it use feedback from Consul or an open loop timeout? The Autopilot Read Health endpoint can tell you when it's safe to proceed when |
Hi @slackpad, Thank you for the quick reply! Our rolling updates will kill and launch a new server when the provisioning script finishes, this is after the consul service starts. But the issue seems to be that the "old" and shutdown instance state is not set to "leave" but remains as "failed" and is still listed on 'consul operator raft list-peers' this behaviour is not observed on 0.9.3 Do you think that we should only send a SUCCESS signal when the REST endpoint: Note that the rolling procedure takes some time, I will make another test and gather some logs to help us out. |
That's a good way to close the loop to know it's safe to roll another server. You can also look at the With 0.9.3 it was listed as failed too, but the new server would get added as a voter right away so autopilot would clean it up really quickly. With 1.0 while it's waiting to stabilize (which usually takes ~10 seconds, but you can tune it with https://www.consul.io/docs/guides/autopilot.html#configuration) the old server will show as failed because autopilot is waiting for the new server to become usable. |
After the first new node starts and finishes provisioning and old node shutdown is executed:
At the end:
Logs from CF:
|
"That's a good way to close the loop to know it's safe to roll another server. You can also look at the Healthy boolean to make sure the other Autopilot checks are passing as well." Will do, but it seems odd, since as you can see from the CF logs, provisioning an instance takes 2 to 3 minutes, way more then 10s On 0.9.3 the instances that are terminated are immediately (or almost) on state "leave" Still convinced the FailureTolerance > 0 wait loop will solve the issue? |
I think it's likely that - it's this interval:
In that case it got a SUCCESS signal and terminated the old server ~1 second later, even though that new server wasn't going to be a useful part of the quorum for ~10 seconds or more. During that window when the old server is terminated you are down 1, so you'll go into an outage if one other server is terminated before that process completes. You are right that the long time to spin up the next instance should kind of pave over that window, so it's possible that things are taking longer than expected to converge and update the new server to being a voter. Adding that check should make it safe, and it would be interesting to see how much additional time it's waiting - there might be some things we could look at there. |
Hi @slackpad, I've changed the 'recipe' and made the following class to call the API: https://gist.github.com/dpgaspar/9143124a58508bfa7e9675e276030a82 Final step on the recipe is:
|
Hi @dpgaspar sorry that didn't fix things. I just realized two things about the autopilot health endpoint. First, it won't be available until you are on Raft protocol 3, so it won't necessarily help for your first transition onto 1.0 (after that it will be available). Second, when you go from 3 to 4 servers, the failure tolerance will already be at 1 so it won't actually delay. I think the right "safe to roll another server" check it to look at that response and make sure you have the expected number of servers (4), marked healthy, and that they are all voters. Trying this locally, though, I think I see a problem if a 1.0 server steps up and becomes leader during the roll. It looks like there's an issue where it won't properly remove the failed servers on the old Raft protocol version, so I'll mark this as a bug and look at that. If you need an immediate workaround, the easiest thing is probably to configure your Consul 1.0 servers with |
Seems like we should add a |
Hi @slackpad, Thank you for the quick reply, great support! Just a small comment about this: "...it won't be available until you are on Raft protocol 3..." "Trying this locally, though, I think I see a problem if a 1.0 server steps up and becomes leader during the roll. It looks like there's an issue where it won't properly remove the failed servers on the old Raft protocol version, so I'll mark this as a bug and look at that." Great!! Thanks. |
Thanks for clarifying that - that narrows it to a possible autopilot bug with 1.0 then, which I think would manifest the same the way I repro-d it. Will take a look. |
When we defaulted the Raft protocol version to 3 in #3477 we made the numPeers() routine more strict to only count voters (this is more conservative and more correct). This had the side effect of breaking rolling updates becuase it's at odds with the Autopilot non-voter promotion logic. That logic used to wait to only promote to maintain an odd quorum of servers. During a rolling update (add one new server, wait, and then kill an old server) the dead server cleanup would still count the old server as a peer, which is conservative and the right thing to do, and no longer count the non-voter. This would wait to promote, so you could get into a stalemate. It is safer to promote early than remove early, so by promoting as soon as possible we have chosen that as the solution here. Fixes #3611
Figured out what was going on here and teed up a fix in #3623. @kyhavlov and @preetapan can you please take a look? |
When we defaulted the Raft protocol version to 3 in #3477 we made the numPeers() routine more strict to only count voters (this is more conservative and more correct). This had the side effect of breaking rolling updates because it's at odds with the Autopilot non-voter promotion logic. That logic used to wait to only promote to maintain an odd quorum of servers. During a rolling update (add one new server, wait, and then kill an old server) the dead server cleanup would still count the old server as a peer, which is conservative and the right thing to do, and no longer count the non-voter. This would wait to promote, so you could get into a stalemate. It is safer to promote early than remove early, so by promoting as soon as possible we have chosen that as the solution here. Fixes #3611
* Relaxes Autopilot promotion logic. When we defaulted the Raft protocol version to 3 in #3477 we made the numPeers() routine more strict to only count voters (this is more conservative and more correct). This had the side effect of breaking rolling updates because it's at odds with the Autopilot non-voter promotion logic. That logic used to wait to only promote to maintain an odd quorum of servers. During a rolling update (add one new server, wait, and then kill an old server) the dead server cleanup would still count the old server as a peer, which is conservative and the right thing to do, and no longer count the non-voter. This would wait to promote, so you could get into a stalemate. It is safer to promote early than remove early, so by promoting as soon as possible we have chosen that as the solution here. Fixes #3611 * Gets rid of unnecessary extra not-a-voter check.
Hi,
We are testing consul 1.0.0, but using cloudformation rolling updates leaves the cluster without a leader, no quorum.
The same behaviour does not happen when using 0.9.3 and the exact same recipe (and config)
consul version
for both Client and ServerClient: 1.0.0
Server: 1.0.0
Server:
Operating system and Environment details
OS: AMZN Linux
3 Nodes on autoscaling group
Description of the Issue (and unexpected/desired result)
After performing a cloudformation rolling update (this will replace each instance one by one) the cluster loses quorum
Some consul members (agent server) that are shutdown are on status failed, the raft quorum size increases with the rolling update, until quorum is not possible.
This behaviour does not happen on 0.9.3, using the exact same steps
Reproduction steps
Shutdown and replace each node of an 3 node cluster.
Server config
{
"bootstrap_expect": 3,
"server": true,
"leave_on_terminate": true,
"datacenter": "us-west-2",
"acl_datacenter": "us-west-2",
"acl_master_token": "XXXXX",
"acl_agent_token": "XXXXX",
"acl_default_policy": "deny",
"acl_down_policy": "extend-cache",
"disable_remote_exec": false,
"data_dir": "/opt/consul/data",
"encrypt": "XXXXX",
"log_level": "INFO",
"enable_syslog": true,
"ui": true,
"retry_join": ["provider=aws tag_key=Name tag_value=dev-consul-cluster"],
"telemetry": {
"dogstatsd_addr": "127.0.0.1:8125"
},
"performance": {
"raft_multiplier": 1
},
"ports": {
"http": 8500,
"dns": 8600,
"serf_lan": 8301,
"serf_wan": 8302,
"server": 8300
}
}
Log Fragments or Link to gist
The text was updated successfully, but these errors were encountered: