-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: Raft prevote does not prevent election despite active leader #92088
Comments
cc @cockroachdb/replication |
Maybe I'm missing something, but I don't understand this part:
PreVote is like Vote, except that it doesn't actually crown the new leader. The raft leaders you're seeing here must've actually won the vote. So they would've also won the corresponding So to me this "just" looks like one of the ways in which we don't handle asymmetric network partitions well: we can end up electing a raft leader that is not reachable from the leaseholder, because there's no requirement that the leaseholder has to be a part of the quorum that decides the election. |
It also doesn't seem like CheckQuorum would've helped here, as the new raft leader did get heartbeat responses from a quorum (itself and nX) while the leaseholder was partitioned off. |
Wait, what? Isn't the whole idea of prevote to avoid disrupting an established leader? Why would the followers then choose to vote for a different leader during prevote? Wouldn't that give a 50% chance of disruption?
Yeah, this is clearly it. I was staring at this stuff for way too long. Still doesn't seem helpful that it can win a prevote when it's up to date on its log.
Nah, CheckQuorum is no good here. |
Yeah, prevotes will only be successful if the follower hasn't received a heartbeat from the existing leader for the past election timeout interval:
https://decentralizedthoughts.github.io/2020-12-12-raft-liveness-full-omission/ So I'm guessing that quiescence messes this up somehow. A quiesced follower may not have seen a heartbeat from the leader in a very long time (in wall time), but quiesced replicas likely track time in ticks, which I don't expect to progress while quiesced. Maybe it's because unquiescing the follower doesn't also unquiesce the leader or something, but I thought the leader was responsible for unquiescing. I'm not familiar with the details here, but seems worth looking into. I see there are a few conditions that cause the range to hold an election when unquiescing too: cockroach/pkg/kv/kvserver/replica_raft.go Lines 1929 to 1973 in 55af138
|
Whatever's causing this, it ain't quiescence. :( I ran another experiment with |
Hmm, good point. Here's the code in raft: I thought that we'd maybe reinit the Could it just be "natural" that there isn't a leader at this point in the incident triggering this issue? The experiment makes me think that no - since there is a lease on these ranges, they must've elected a leader "at some point". The only node that went down is n5 (the then half-partitioned node). So maybe that node was the leader before for the affected ranges? In that case, the other members would PreVote for it. I don't know if this is plausible in this experiment, as the lease being on a range should typically pull the leader to it as well, but it's the only explanation I have. |
The node doesn't have any leaders before the restart. The lease preferences pin leases on other nodes, and the Raft leader follows the lease. This is confirmed by metrics. The interesting bit is that it's picking the leaderships up at a regular rate of about 1 every 10 seconds. |
Looks like it is in fact receiving prevotes from n4. Here's the r789 layout:
And we can see that n5 is receiving a prevote from 3 (i.e. n4):
|
Did another run with Raft logging enabled across all nodes. Range r561 with replicas 2:n4 5:n5 6:n3. We first see that n3 becomes the leader at 13:04:12:
n4 confirms this:
When n5 is restarted, it solicits and receives prevotes and then votes from n4, even though n4 already has n3 as leader:
n4 doesn't really say anything about why:
And n3 then happily obeys the new leader:
It's certainly interesting that even n3 casts a prevote for n5 though, even though n3 is supposedly the current leader (this prevote never reaches n5 due to the partition). The Raft logs above are complete, so there's no indication that n3 lost its leadership. I'll have to go read some Raft code, I think. |
It seems like this is a bug in the Raft library. When n5 sends a PreVote, it sends it for the next term (in this case 12): But a node will always accept a PreVote for a future term: https://github.com/etcd-io/etcd/blob/87258efd90224bc8b59e000f75fe07fdeab68e2d/raft/raft.go#L924 |
Tried out a trivial patch to disable that condition, and the node no longer picks up Raft leaderships. Will submit a proper patch upstream. |
I don't recall seeing code that implements this logic in
My understanding of prevote is that it's intended to avoid disrupting an established leader when the candidate wouldn't otherwise be able to win the election. I didn't think it said anything about cases where the candidate could win the election. So I didn't think it would provide any protection in cases where the follower is caught up on its log. I can see how the election timeout check you cited would skew pre-vote elections towards established leaders in cases of ties. Is that missing? Is that the patch you plan to submit upstream? |
And yet, mention of the election timeout is right there in section 9.6 of the Raft thesis. This is a nice find! @bdarnell do you recall whether this was intentionally omitted in etcd-io/etcd#6624? I wouldn't be surprised if fixing this causes other fallout. Specifically, there are places where we call Raft elections to safely request a Raft leadership transfer. For instance, that's what we did in #87244. If campaigning doesn't break ties and we need buy-in from the leader then we'll need to rethink the mechanism that we use in cases like these. |
No, the motivation is that a node rejoining a cluster following e.g. a restart or network partition shouldn't immediately trigger an election, instead respecting an already established quorum. This is particularly important in cases with asymmetric or partial network partitions, where a node may otherwise end up continually triggering elections, destabilizing the quorum (one such scenario was described in the link above, https://decentralizedthoughts.github.io/2020-12-12-raft-liveness-full-omission/).
I think this condition might be sufficient: But I'm not very familiar with the etcd/raft codebase yet, so I'll have to do some reading. I'm hopeful it should be fairly straightforward though.
I believe we can request a cooperative leadership transfer: But I expect there will be some fallout, yeah. |
I believe the relevant check is in https://github.com/etcd-io/etcd/blob/87258efd90224bc8b59e000f75fe07fdeab68e2d/raft/raft.go#L843, where the check for time since hearing from the leader is tied to the CheckQuorum setting, just like it is for regular votes. We deliberately decoupled these things because quiesced replicas would not have a notion of how long it had been since they heard from the leader. When we originally implemented PreVote, the major problem we were trying to solve is that when a leader receives a MsgVote for a higher term, it immediately reverts to follower, even if the node calling for the vote is doomed to fail. PreVote addresses this by ensuring that there are no side effects until we have a reasonable expectation of winning the election. Enabling CheckQuorum would prevent some additional categories of disruption, but it would make quiesced replicas much more expensive (and resetting the replica after unquiescing would I think make recoveries slower). |
Thanks for the context Ben. I'll need to look closer at this, but my sense is that as a first step we could simply consider time to be frozen while the replica is quiesced -- we don't really need a Raft leader until there are any state changes. As you point out, that would make recoveries slower for quiesced replicas -- this would be somewhat mitigated by #91947, and we can probably come up with other improvements to handle that case. |
You said in #92088 (comment) that quiescence is not the problem. And I think the talk of PreVote is kind of a distraction too. The problem is that in partial partition scenarios, a node that is reachable only by some of its peers can win an election. This is an especially noticeable problem if it ends up stealing leadership from a node that is more broadly reachable (and this is what we're talking about here), but it's also a problem if the partially-reachable node steps up in a leadership vacuum. CheckQuorum and similar rules based on time and the validity of a leader can help prevent the problem of stealing leadership, but they make it harder to recover from the latter case in which a bad leader was elected in the first place. We've never made much of an investment in improving support for partial partitions because a sufficiently diabolical partition can always cause unavailability (see jepsen's "majority ring" nemesis: each node can see a majority of its peers, but each one sees a different subset). There's likely some low-hanging fruit to detect and mitigate some of the more common cases. For example, the one asymmetric partition that I know we've seen multiple times is when the network is actually fine, but certificate issues mean that connections from A to B work but B to A do not (and this is especially tricky since RPC responses flow back on the same connection as their request, but raft "responses" are modeled as separate one-way messages). It would be useful to detect this case in the RPC heartbeat system. |
Yes, I meant that if we were to implement prevote and checkquorum as commonly understood (including the election timeout criterion) then quiescence could become a problem, e.g. by increasing recovery time.
I mostly agree with what you're saying, but don't understand the last part: aren't PreVote and CheckQuorum complementary pieces of the solution? PreVote (as described in the thesis) will make it harder to unseat an elected leader as long as that leader is able to send heartbeats to a quorum of followers, but CheckQuorum will make the leader itself step down if a quorum of followers are unable to acknowledge those heartbeats. As far as I can tell, these two conditions ensure that a quorum of followers have bidirectional communication links with the leader and that it won't be disrupted by spurious elections. Of course, that's insufficient in CockroachDB, because the leaseholder can still be partitioned away from the Raft leader, preventing it from proposing writes, and a SQL gateway can be partitioned away from a leaseholder, preventing it from making any RPC calls. These two problems will need to be addressed separately, and my sense is that we'll ultimately have to solve them with mesh routing (hopefully not inside CRDB itself). I still think there's value in making the underlying Raft implementation more resistant to partial network partitions, but if we don't feel like it's worth the effort at the moment then there's no shortage of other work that needs doing. We may still want to file an issue about this in the upstream library in that case, since other library users may have different expectations of prevote.
Yes, we're pursuing this separately in #84289, and I agree that it should be the primary focus. |
This is confusing because there are three pieces of functionality being described with two different names, and the etcd implementation names things differently from the thesis.
The etcd implementation combines both time-tracking features under the CheckQuorum setting (this is due to cockroach's influence, since we were interested in the ability to quiesce replicas and turn off all time-tracking functionality), while the thesis describes the second feature as a part of the pre-vote. Each of these features solves a partially-overlapping set of problems. In terms of improving behavior during partial partitions, PreVote and CheckQuorum are complementary. But in terms of preventing disruption from a restarted node in the absence of a partition, they're redundant, and this was the original focus when they were implemented, so we've historically seen them more as alternatives than as complements. |
I see, that clears things up -- thank you! |
I'm putting this on the backburner for now. We'll consider this as part of normal release planning, when we're ready to prioritize partial network partitions. |
I'm working on enabling CheckQuorum. This has the effect of setting There is a lot of ancient history around CheckQuorum and quiescence that we'll need to revisit here:
For now, I'm submitting a PR in #104040 to enable CheckQuorum via envvar, so we can try it out. |
I ran the Will add some more targeted integration tests. |
I ran |
Added some integration tests for PreVote and CheckQuorum under various scenarios in #104057. They mostly work as expected, except for the case of a partial partition with a quiesced range, where we get a spurious election that prevote should have prevented. I haven't looked into it yet, but I think it's the same case that's discussed in e.g. #9372. |
This appears more straightforward than the situation in #9372: n3's prevote unquiesces n2 (who rejects it), but n2 doesn't wake the leader. Once n2's election timeout elapses, n2 and n3 can hold a new election. I'll try a patch where we wake the leader on receipt of a The scenario is a partial partition between n1 (leader) and n3:
Relevant logs:
|
Submitted a fix in #104057, where receipt of a pre-vote will wake the leader. I gated this on CheckQuorum for now, since this may cause mass unquiescence under partial network partitions, where the partitioned follower will continually keep the ranges unquiesced. We'll need more testing and auditing here, but at least all the current test cases pass for now. |
103909: ui: add Created SQL Connections chart on SQL dashboard r=koorosh a=koorosh Added Created SQL Connections chart that shows rate of created SQL connection over time. Release note (ui change): added Created SQL Connections chart on Metrics page, SQL Dashboard in DB Console. Resolves: #93486 ![Screenshot 2023-05-25 at 21 24 48](https://github.com/cockroachdb/cockroach/assets/3106437/9fbba32e-3bdc-4bce-ac51-4e57d141b926) 104025: kvserver: add a few Raft proposal trace events r=erikgrinaker a=erikgrinaker Epic: none Release note: None 104026: kvserver: don't apply quota pool to lease requests r=erikgrinaker a=erikgrinaker Lease requests and transfers are latency sensitive and small, so we allow them to bypass the quota pool. This is particularly important for expiration lease extensions, which happen every 3 seconds. During TPCC imports, the quota pool was often found to delay lease requests by as much as 3 seconds. Resolves #98124. Epic: none Release note: None 104040: kvserver: add `COCKROACH_RAFT_ENABLE_CHECKQUORUM` r=erikgrinaker a=erikgrinaker This patch adds `COCKROACH_RAFT_ENABLE_CHECKQUORUM`, which enables Raft CheckQuorum (and as a side-effect also fully enables PreVote). See code comment for details. It defaults to false, since we need to investigate liveness concerns from when CheckQuorum was previously enabled in the ancient past. Raft uses the field `Progress.RecentActive` to track follower activity for CheckQuorum. When CheckQuorum is disabled, this is always `true`, but when enabled it will be reset to `false` regularly (with an interval that equals the election timeout). There are some uses of this field in CRDB, but we always replace this field with our own notion of activity via `updateRaftProgressFromActivity` (tracked in wall time with a timeout equaling the range lease interval). Touches #92088. Epic: none Release note: None 104050: cli/zip: avoid including PII-laden details in debug zip r=postamar a=knz Fixes #104049. Epic: CRDB-27642. Release note (bug fix): The details of errors pertaining to invalid descriptors are not included any more in redacted debug zips. 104053: kvserver: add DisableQuiescence testing knob r=erikgrinaker a=erikgrinaker This patch turns `COCKROACH_DISABLE_QUIESCENCE` into a testing knob, for better control in tests. Epic: none Release note: None Co-authored-by: Andrii Vorobiov <and.vorobiov@gmail.com> Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com> Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
A fundamental tradeoff here is that we'll now delay explicit campaigning by up to 2 seconds (the election timeout). This typically comes up in two cases: when the current leader is dead according to liveness, and we're either initializing/unquiescing a replica or attempting to acquire a lease. In these cases, we want to respect the full PreVote+CheckQuorum condition to not cause spurious elections or steal leadership away. This is particularly important under partial or asymmetric network partitions, since the replica can otherwise steal leadership even though the leaseholder can't reach it, causing permanent range unavailability (the case which motivated this issue). We can't trust the liveness or lease information for the leader in this case, because it's likely to be stale or inaccurate. There is one case where we can still campaign immediately: when we removed the leader from the range. See #104189. |
I read over the spurious election case in #9372 (comment), caused by CheckQuorum, and it's covered by PreVote. When both are enabled, the unquiesced node won't be able to call an election that unseats the leader (especially with #104057 which also tests this scenario). |
This is also confirmed by this code comment: https://github.com/etcd-io/raft/blob/e293d6b9d465bd77845b50d40079dbe44996e504/raft.go#L1051-L1073 |
104057: kvserver: wake leader on Raft message from other replica r=erikgrinaker a=erikgrinaker **kvserver: wake leader on Raft message from other replica** Previously, in the case of a quiesced range, a follower who received a pre-vote request could unquiesce itself only to reject the pre-vote. Under a partial network partition, where the leader itself did not receive a corresponding pre-vote request, we would now have two unquiesced followers with a quiesced leader. Eventually, both of the followers' election timeouts would lapse and they would hold an election, upsetting the established leader. Furthermore, the leadership would often be immediately transferred back to the leaseholder, i.e. the old leader, and the cycle would repeat. The same holds for any other message that a follower might send to a different follower, without also sending a message to the leader. This patch makes a quiesced follower unquiesce both itself and the leader when it receives a message from anyone but the leader, allowing the leader to maintain leadership. Note that both before and after this patch, such partial partitions often result in persistent mass unquiescence due to the continuous prevotes. Touches #92088. Epic: None. Release note: None. **kvserver: don't unquiesce during snapshot application** Previously, snapshot application would explicitly unquiesce the newly initialized replica. `handleRaftReady` would also unquiesce the replica once the snapshot is applied, but the former did so without also waking the leader. This can lead to spurious elections if multiple replicas are left unquiesced without waking the leader, e.g. during a partial network partition. This patch therefore removes the explicit unquiesce during snapshot application, and instead unquiesces normally (along with the leader) during `handleRaftReady`. This is also symmetric with other replica initialization, which does not explicitly unquiesce the replica either. Touches #92088. Epic: none Release note: None Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
104969: kvserver: force-campaign leaseholder on leader removal r=erikgrinaker a=erikgrinaker **kvserver: clarify Raft campaign behavior with PreVote+CheckQuorum** This patch tweaks and clarifies explicit campaign behavior when Raft PreVote and CheckQuorum are enabled. In this case, followers will only grant prevotes if they haven't heard from a leader in the past election timeout. It also adds a test covering this. **kvserver: add `Replica.forceCampaignLocked()`** This patch adds `forceCampaignLocked()`, which can be used to force an election, transitioning the replica directly to candidate and bypassing PreVote+CheckQuorum. **kvserver: force-campaign leaseholder on leader removal** Previously, when the leader was removed from the range via a conf change, the first voter in the range descriptor would campaign to avoid waiting for an election timeout. This had a few drawbacks: * If the first voter is unavailable or lags, noone will campaign. * If the first voter isn't the leaseholder, it has to immediately transfer leadership to the leaseholder. * It used Raft PreVote, so it wouldn't be able to win when CheckQuorum is enabled, since followers won't grant votes when they've recently heard from the leader. This patch instead campaigns on the current leaseholder. We know there can only be one, avoiding election ties. The conf change is typically proposed by the leaseholder anyway so it's likely to be up-to-date. And we want it to be colocated with the leader. If there is no leaseholder then waiting out the election timeout is less problematic, since either we'll have to wait out the lease interval anyway, or the range is idle. It also forces an election by transitioning directly to candidate, bypassing PreVote. This is ok, since we know the current leader is dead. To avoid election ties in mixed 23.1/23.2 clusters, we retain the old voter designation until the upgrade is finalized, but always force an election instead of using pre-vote. Resolves #104871. Touches #92088. Follows #104189. Epic: none. Release note: None Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
105126: kvserver: don't campaign when unquiescing for a Raft message r=erikgrinaker a=erikgrinaker We're going to generalize this to all (pre)votes, see #105132. However, I'm merging this as-is to keep a record of the alternative, since we may want to change to this approach later, when e.g. 23.1 compatibility and epoch leases is less of a concern. --- **DEPS: upgrade Raft to a10cd45** Adds `ForgetLeader()`, which we need for CheckQuorum handling. ``` a10cd45 2023-06-26: Merge pull request 78 from erikgrinaker/forget-leader [Benjamin Wang] 1159466 2023-06-16: add `ForgetLeader` [Erik Grinaker] 09ea4c5 2023-06-16: rafttest: show term and leader for `raft-state` [Erik Grinaker] 26ce926 2023-06-20: rafttest: add `log-level` argument for `stabilize` [Erik Grinaker] 30e2fa4 2023-06-12: Merge pull request 70 from erikgrinaker/prevote-checkquorum-datadriven [Benjamin Wang] a042ce3 2023-06-09: Merge pull request 75 from charles-chenzz/update-go-patch [Benjamin Wang] dd2340f 2023-06-08: update go to patch release 1.19.10 [charles-chenzz] 27dd2c2 2023-06-02: add data-driven tests for PreVote and CheckQuorum [Erik Grinaker] ``` **kvserver: tweak updateLivenessMap comment** **kvserver: add `Replica.forgetLeaderLocked()`** **kvserver: don't campaign when unquiescing for a Raft message** This patch forgets the leader when unquiescing in response to a Raft message and finding a dead leader (according to liveness). We don't campaign, because that could result in election ties, but forgetting the leader allows us to grant (pre)votes even though we've heard from the leader recently, avoiding having to wait out an election timeout. Touches #92088. Epic: none Release note: None Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
105132: kvserver: maybe forget leader on (Pre)Vote requests r=erikgrinaker a=erikgrinaker This patch will forget the Raft leader when receiving a (Pre)Vote request and finding the leader to be dead (according to liveness) or removed. This allows a quorum of replicas to campaign despite the PreVote+CheckQuorum condition if they independently consider the leader to be dead. A more targeted alternative was explored, where we forget the leader only when unquiescing to a dead leader. However, this had a number of drawbacks. It would have to use `TransferLeader` to steal leadership away from a Raft leader who can't heartbeat liveness during lease acquisitions, but this could cause unavailability due to races where multiple replicas request leader transfers and some are not up-to-date on the log. It would also need mixed-version logic since 23.1 nodes could otherwise be unable to obtain necessary prevotes. We instead choose to be lenient for now, and we can consider tightening the conditions later when we're more confident in the PreVote+CheckQuorum handling and may no longer need epoch leases. Touches #92088. Epic: none Release note: None Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
In https://github.com/cockroachlabs/support/issues/1875, we saw indications that range quiescence may interfere with Raft prevote, which in turn can cause range unavailability with asymmetric/partial network partitions where the Raft leader is unreachable from other replicas. This can be reproduced in the following scenario:
We can now see n5 pick up Raft leaderships, even though prevote should have prevented it from doing so. We also see that range leases are mostly not transferred across to the Raft leader, respecting the lease preferences (although not always). Leaseholders on other nodes wouldn't be able to propose a lease transfer or Raft leadership transfer anyway (#92089). Once the partition heals (at 21:10), the Raft leaderships are immediately transferred to the appropriate leaseholder.
This does not happen when the workload is changed to a write-only workload, where the ranges are not quiesced -- the restarted node is not able to acquire any Raft leaderships at all.
It thus appears that range quiescence somehow interferes with Raft prevote, allowing a partitioned replica to acquire Raft leadership. If the restarted node is then not reachable from some of the other nodes, the existing leaseholder may not be able to send proposals to the new Raft leader, which prevents it from reclaiming Raft leadership. I haven't mapped out all the details here yet.
We should ensure the prevote protection still works as expected with quiesced ranges.
Jira issue: CRDB-21567
Epic CRDB-25199
The text was updated successfully, but these errors were encountered: