-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
raft: deadlock during PreVote migration process #8501
Comments
@irfansharif @bdarnell @siddontang @es-chow Someone probably needs to own the pre-voting feature, and make it stable. I do not have the bandwidth to own this for the near future. But if no one is going to own it, I will put this item on my schedule and make it stable. |
I agree, and I apologize for (mostly) implementing this without getting it into a working state. We'll work on fixing this. |
We will also do more tests to make it more stable. /cc @javaforfun |
Originally wrote the following on #8525 but posting here instead: This addresses the specific deadlock observed in #8501 but given how we're looking to stabilize PreVote overall I was wondering what your thoughts were on the following proposal: The current Pre-Vote implementation works such that a a As it pertains to #8501, consider how it addresses that: Messages from higher terms always cause us to revert to the follower state. See irfansharif/raft.tla@22b0581#L520-L522 which was from the original spec (this was the issue I had before trying to retrofit into). So for the "rogue" candidate with advanced terms pre-PreVote enabling, it would just work out of the box. The message handling at the receiver too is simplified as there's less of case-by-case basis. Right now if we have a NB: With the TLA+ spec adding PreVotes, I did not complete a full run. I tried constraining the search space and tried random searching but after a 5-ish hour run with no failures (albeit on my tiny 4-core machine) I am convinced it's OK. As an aside, I'll be on vacation through the upcoming week (so will Ben). I will be back in school right after but I'll try submitting this by/before then. |
When I initially implemented prevote, I tried it both ways (the current implementation with In particular, I don't think making the switch in how we send pre-vote terms would simplify the fix for this deadlock. Sending the true term instead of the future term would avoid the special case when the terms only differ by one, but if there is a greater difference we'd still need a special case to handle MsgPreVote with a term in the past. |
This was first uncovered in the context of cockroachdb/cockroach#18151. For some background we were running a migration from v1.0.x cockroachdb/cockroach binary to our v1.1-rc binary, the difference between the two branches was that the earlier revision had raft groups configured with
PreVote: false
andCheckQuorum: true
whereas the new v1.1-rc revision hadPreVote: true
andCheckQuorum: false
. The "migration" process here was a rolling restart of the cluster from the old revision to the new where it was entirely likely that for a given raft group, you could have some replicas running the new revision withPreVote
enabled and some without.Here's the relevant summary of a raft group that got wedged during this migration process:
Note that of the two replicas shown, we have the first replica with a higher current term than the second replica but with a lower last index. At this point the following manifests:
Now, as for how we could get to such a state where a replica with a lower last index ends up with a higher term, remember that this was a rolling restart process where we had replicas without pre-vote enabled. Replicas at this stage, when calling for elections, advance their current term. This advanced current term is persisted (necessarily so) and when the node is then restarted with PreVote enabled, we're deadlocked as described above.
Here's a test demonstrating this:
+cc @bdarnell @xiang90.
The text was updated successfully, but these errors were encountered: