-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto Pilot breaks Quorum Invariant #5922
Comments
@nahratzah Thank you for the thorough bug report. This is complicated but I am sure there is something we could do. Raft decides its quorum size based on the number of voting peers. The cleaning up of dead servers is only allowed if less than half of the number of peers would be affected by the removals. The halving of the peers is rounded down so with 5 peers autopilot will allow removing 1. At 4 peers it allows removing 1 as well. However it would not allow you to drop from 5 to 3 all at once. So what is happening is that when you get rid of 2 servers, the calculated quorum size then becomes 2, because there are only 3 voting servers still around. You can check this with the Obviously this is not the behavior you are looking for but I have yet to determine on whether it is totally incorrect or what the fix should look like. If we didn't allow this behavior then if you actually wanted to transition from 5 servers to 3 for whatever reason you would have to use the What seems really wrong is that the code shouldn't auto cleanup the servers if you go from 5 servers to 3 at once but it will let you drop from 5 to 4 and then 4 to 3. That alone makes me think that this behavior was unintended and should be fixed. |
IIRC that was the original rationale for allowing this; folks would get into a lot of trouble trying to size clusters down, so autopilot makes that safer to do by reducing the quorum size when it's safe to do so, since adding new servers back will increase it automatically. One thing we thought about but didn't do was have a way to configure autopilot with the desired target cluster size, which could maybe give it some context about failed servers vs. assuming you are shrinking your cluster. |
@slackpad Yeah, doing some more digging and talking with the rest of the team brought up some similar scenarios. Like if you add new servers before reaping the old ones which could temporarily increase the required quorum size when really it shouldn't stay that way. I think the sanest way of being able to prevent too many servers from being cleaned up and at the same time allowing for the clusters size to temporarily grow and then shrink would be to have the user specify a minimum quorum size within the autopilot configuration. We could automatically populate this from the |
Not strictly related but I guess I'm experiencing another kind of possibly unsafe behaviour caused by overly aggressive dead servers cleanup logic. I have a system-under-test setup upon which I periodically wreak havoc with a jepsen test. The SUT consists of 5 nodes ( consul.json{
"server": true,
"bind_addr": "0.0.0.0",
"bootstrap_expect": 5,
"retry_join": ["n1"],
"performance": {
"raft_multiplier": 1
}
} This one time I observed following sequence of events. Please note that I picked just these events relevant to the issue from these logs I have at hand.
So there was ~2 seconds during which two processes with the same ID were up and running. I was quite bewildered that the autopilot made a decision to consider Server info
If someone consider this a separate issue I would happily make it into one and provide more details if needed. |
When filing a bug, please include the following headings if possible. Any example text in this template can be deleted.
Overview of the Issue
Consul Auto Pilot Dead Server Cleanup allows consul to run reads, writes, etc without quorum.
Reproduction Steps
We are using a cluster of 5 nodes.
A cluster with 5 nodes requires at least 3 nodes to maintain quorum.
Create a docker file `docker-compose.yml` with this contents.
Start docker:
Wait until the cluster is fully operational and healthy.
Remove 1 server by scaling down the followers to 3.
The server will enter the
failed
status, but auto pilot Dead Server Cleanup will change this to theleft
status.This can be observed by looking at the members:
Once the server has entered the
left
status, remove another server by scaling the followers to 2.Once again, this removed server will enter the
failed
status, but auto pilot Dead Server Cleanup will change this to theleft
status.This can be observed by looking at the members:
Once again, we wait until the server has entered the
left
status.At this moment, we're running with 3 servers, and with 2
left
servers.The consul members output will look something like this:
According to quorum rules, our 5 node cluster should have a failure tolerance of
0
and we can no longer lose any servers without losing quorum.So let's remove another one. :)
Our membership will look like this:
alive
serversleft
sersersfailed
serverIf we test setting and getting a value, this succeeds:
(Instructions on how to clean up the cluster under this fold.)
Expected Behaviour
The last server loss would have caused quorum loss, stopped the leader being a leader, and the write operation would have failed.
Consul info for both Client and Server
Server info
Operating system and Environment details
uname -a says:
Linux 4.15.0-50-generic #54-Ubuntu SMP Mon May 6 18:46:08 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Log Fragments
Notes
The scenario requires servers to go away slowly enough for auto-pilot to transition one from the failed to the left state, before the next one fails.
We discovered this during testing of an upgrade script that tested both the
healthy
andfailure_tolerance
telemetry to ascertain if a cluster was healthy. Both these health indicators indicate that everthing is fine with servers in theleft
state.Work-around is to turn off dead server removal and only enable it during brief windows where an upgrade happens.
I think the long term fix in code would be that dead server removal only occurs in cases where a
failed
server can be replaced by a newly joining server. Whereas a failed server that is not replaced would remain in thefailed
state indefinitely, or until a human operator has removed the server using force-leave.I speculate the same scenario can occur with Dead Server Removal disabled, if the servers stay in the
failed
state for 3 days and the automatic cleanup kicks in to remove them.The text was updated successfully, but these errors were encountered: