-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: force-campaign leaseholder on leader removal #104969
kvserver: force-campaign leaseholder on leader removal #104969
Conversation
2ce9e7b
to
7fd07a8
Compare
pkg/kv/kvserver/replica_raft.go
Outdated
// We don't have an initialized, so we can't figure out who is supposed | ||
// to campaign. It's possible that it's us and we're waiting for the | ||
// initial snapshot, but it's hard to tell. Don't do anything. | ||
// No descriptor, so we don't know if the leader has been removed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would this case ever be hit? This method is called when you apply a conf change, which implies you're applying from the log, which implies initialized. Or do we hit this on the leader, as it just removed and replicaGC'ed itself? I'm pretty sure r.Desc()
will still reflect the "old state" then, won't it? But more likely we're not hitting this method in the first place.
I realize this isn't related to this PR, just musings. A comment could be helpful on what actually happens today and whether it matters that we check this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, it doesn't seem likely, but may as well be defensive. Said as much in the comment.
@@ -162,6 +162,7 @@ type proposer interface { | |||
// The following require the proposer to hold an exclusive lock. | |||
withGroupLocked(func(proposerRaft) error) error | |||
registerProposalLocked(*ProposalData) | |||
campaignLocked(ctx context.Context) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a comment might be helpful - most of the callers have a *RawNode*
in scope, but this method also enqueues to the scheduler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that problematic though? This was mostly to avoid having to write the comment all over again at the call site (and yes, I realize the interface will mask the comment anyway, but presumably people would go to the actual implementation and read it there).
pkg/kv/kvserver/client_raft_test.go
Outdated
// Set an ~infinite election timeout. We don't want replicas to call | ||
// elections due to timeouts, we want them to campaign and obtain | ||
// votes despite PreVote+CheckQuorum. | ||
RaftElectionTimeoutTicks: 1e9, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about setting this to 20 * time.Second / raftTickInterval
so that it doesn't stall the test indefinitely but is guaranteed to still fail the 10s timeout you're giving it for electing a leader?
I see that you're canceling the ctx but just worry it may not be good enough to avoid a hung test in case of the right kind of flake in the right place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pkg/kv/kvserver/replica_raft.go
Outdated
// Leader unknown. This isn't what we expect in steady state, so we | ||
// don't do anything. | ||
if raftStatus.Lead == 0 { | ||
// Leader unknown. Unexpected in steady state, so we don't do anything. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The state arguably isn't steady. I agree that the Lead field would usually be set in this case, but isn't that largely an implementation detail? For example, if raft
started having some mechanism in which the leader actively abdicates (fat chance, I know, but still) you'd get a zero here sometimes.
Maybe we're just trying to change as little as possible, that's fine, but maybe we can add a comment to that effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the comment to say that we don't want to risk spurious elections if we don't know what's going on, given that we force an election.
28c073d
to
31afc50
Compare
This patch tweaks and clarifies explicit campaign behavior when Raft PreVote and CheckQuorum are enabled. In this case, followers will only grant prevotes if they haven't heard from a leader in the past election timeout. It also adds a test covering this. Epic: none Release note: None
This patch adds `forceCampaignLocked()`, which can be used to force an election, transitioning the replica directly to candidate and bypassing PreVote+CheckQuorum. Epic: none Release note: None
Previously, when the leader was removed from the range via a conf change, the first voter in the range descriptor would campaign to avoid waiting for an election timeout. This had a few drawbacks: * If the first voter is unavailable or lags, noone will campaign. * If the first voter isn't the leaseholder, it has to immediately transfer leadership to the leaseholder. * It used Raft PreVote, so it wouldn't be able to win when CheckQuorum is enabled, since followers won't grant votes when they've recently heard from the leader. This patch instead campaigns on the current leaseholder. We know there can only be one, avoiding election ties. The conf change is typically proposed by the leaseholder anyway so it's likely to be up-to-date. And we want it to be colocated with the leader. If there is no leaseholder then waiting out the election timeout is less problematic, since either we'll have to wait out the lease interval anyway, or the range is idle. It also forces an election by transitioning directly to candidate, bypassing PreVote. This is ok, since we know the current leader is dead. To avoid election ties in mixed 23.1/23.2 clusters, we retain the old voter designation until the upgrade is finalized, but always force an election instead of using pre-vote. Epic: none Release note: None
31afc50
to
2959dda
Compare
TFTR! bors r+ |
Build succeeded: |
kvserver: clarify Raft campaign behavior with PreVote+CheckQuorum
This patch tweaks and clarifies explicit campaign behavior when Raft PreVote and CheckQuorum are enabled. In this case, followers will only grant prevotes if they haven't heard from a leader in the past election timeout. It also adds a test covering this.
kvserver: add
Replica.forceCampaignLocked()
This patch adds
forceCampaignLocked()
, which can be used to force an election, transitioning the replica directly to candidate and bypassing PreVote+CheckQuorum.kvserver: force-campaign leaseholder on leader removal
Previously, when the leader was removed from the range via a conf change, the first voter in the range descriptor would campaign to avoid waiting for an election timeout. This had a few drawbacks:
If the first voter is unavailable or lags, noone will campaign.
If the first voter isn't the leaseholder, it has to immediately transfer leadership to the leaseholder.
It used Raft PreVote, so it wouldn't be able to win when CheckQuorum is enabled, since followers won't grant votes when they've recently heard from the leader.
This patch instead campaigns on the current leaseholder. We know there can only be one, avoiding election ties. The conf change is typically proposed by the leaseholder anyway so it's likely to be up-to-date. And we want it to be colocated with the leader. If there is no leaseholder then waiting out the election timeout is less problematic, since either we'll have to wait out the lease interval anyway, or the range is idle.
It also forces an election by transitioning directly to candidate, bypassing PreVote. This is ok, since we know the current leader is dead.
To avoid election ties in mixed 23.1/23.2 clusters, we retain the old voter designation until the upgrade is finalized, but always force an election instead of using pre-vote.
Resolves #104871.
Touches #92088.
Follows #104189.
Epic: none.
Release note: None