Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: force-campaign leaseholder on leader removal #104969

Merged
merged 3 commits into from
Jun 16, 2023

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented Jun 15, 2023

kvserver: clarify Raft campaign behavior with PreVote+CheckQuorum

This patch tweaks and clarifies explicit campaign behavior when Raft PreVote and CheckQuorum are enabled. In this case, followers will only grant prevotes if they haven't heard from a leader in the past election timeout. It also adds a test covering this.

kvserver: add Replica.forceCampaignLocked()

This patch adds forceCampaignLocked(), which can be used to force an election, transitioning the replica directly to candidate and bypassing PreVote+CheckQuorum.

kvserver: force-campaign leaseholder on leader removal

Previously, when the leader was removed from the range via a conf change, the first voter in the range descriptor would campaign to avoid waiting for an election timeout. This had a few drawbacks:

  • If the first voter is unavailable or lags, noone will campaign.

  • If the first voter isn't the leaseholder, it has to immediately transfer leadership to the leaseholder.

  • It used Raft PreVote, so it wouldn't be able to win when CheckQuorum is enabled, since followers won't grant votes when they've recently heard from the leader.

This patch instead campaigns on the current leaseholder. We know there can only be one, avoiding election ties. The conf change is typically proposed by the leaseholder anyway so it's likely to be up-to-date. And we want it to be colocated with the leader. If there is no leaseholder then waiting out the election timeout is less problematic, since either we'll have to wait out the lease interval anyway, or the range is idle.

It also forces an election by transitioning directly to candidate, bypassing PreVote. This is ok, since we know the current leader is dead.

To avoid election ties in mixed 23.1/23.2 clusters, we retain the old voter designation until the upgrade is finalized, but always force an election instead of using pre-vote.

Resolves #104871.
Touches #92088.
Follows #104189.
Epic: none.

Release note: None

@erikgrinaker erikgrinaker requested review from pav-kv and tbg June 15, 2023 13:27
@erikgrinaker erikgrinaker requested a review from a team as a code owner June 15, 2023 13:27
@erikgrinaker erikgrinaker self-assigned this Jun 15, 2023
@erikgrinaker erikgrinaker requested a review from a team as a code owner June 15, 2023 13:27
@erikgrinaker erikgrinaker requested a review from a team June 15, 2023 13:27
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@erikgrinaker erikgrinaker force-pushed the raft-remove-leader-campaign branch 2 times, most recently from 2ce9e7b to 7fd07a8 Compare June 15, 2023 13:35
// We don't have an initialized, so we can't figure out who is supposed
// to campaign. It's possible that it's us and we're waiting for the
// initial snapshot, but it's hard to tell. Don't do anything.
// No descriptor, so we don't know if the leader has been removed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would this case ever be hit? This method is called when you apply a conf change, which implies you're applying from the log, which implies initialized. Or do we hit this on the leader, as it just removed and replicaGC'ed itself? I'm pretty sure r.Desc() will still reflect the "old state" then, won't it? But more likely we're not hitting this method in the first place.

I realize this isn't related to this PR, just musings. A comment could be helpful on what actually happens today and whether it matters that we check this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it doesn't seem likely, but may as well be defensive. Said as much in the comment.

pkg/kv/kvserver/replica.go Outdated Show resolved Hide resolved
@@ -162,6 +162,7 @@ type proposer interface {
// The following require the proposer to hold an exclusive lock.
withGroupLocked(func(proposerRaft) error) error
registerProposalLocked(*ProposalData)
campaignLocked(ctx context.Context)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a comment might be helpful - most of the callers have a *RawNode* in scope, but this method also enqueues to the scheduler.

Copy link
Contributor Author

@erikgrinaker erikgrinaker Jun 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that problematic though? This was mostly to avoid having to write the comment all over again at the call site (and yes, I realize the interface will mask the comment anyway, but presumably people would go to the actual implementation and read it there).

// Set an ~infinite election timeout. We don't want replicas to call
// elections due to timeouts, we want them to campaign and obtain
// votes despite PreVote+CheckQuorum.
RaftElectionTimeoutTicks: 1e9,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about setting this to 20 * time.Second / raftTickInterval so that it doesn't stall the test indefinitely but is guaranteed to still fail the 10s timeout you're giving it for electing a leader?
I see that you're canceling the ctx but just worry it may not be good enough to avoid a hung test in case of the right kind of flake in the right place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// Leader unknown. This isn't what we expect in steady state, so we
// don't do anything.
if raftStatus.Lead == 0 {
// Leader unknown. Unexpected in steady state, so we don't do anything.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The state arguably isn't steady. I agree that the Lead field would usually be set in this case, but isn't that largely an implementation detail? For example, if raft started having some mechanism in which the leader actively abdicates (fat chance, I know, but still) you'd get a zero here sometimes.

Maybe we're just trying to change as little as possible, that's fine, but maybe we can add a comment to that effect.

Copy link
Contributor Author

@erikgrinaker erikgrinaker Jun 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the comment to say that we don't want to risk spurious elections if we don't know what's going on, given that we force an election.

This patch tweaks and clarifies explicit campaign behavior when Raft
PreVote and CheckQuorum are enabled. In this case, followers will only
grant prevotes if they haven't heard from a leader in the past election
timeout. It also adds a test covering this.

Epic: none
Release note: None
This patch adds `forceCampaignLocked()`, which can be used to force an
election, transitioning the replica directly to candidate and bypassing
PreVote+CheckQuorum.

Epic: none
Release note: None
Previously, when the leader was removed from the range via a conf
change, the first voter in the range descriptor would campaign to avoid
waiting for an election timeout. This had a few drawbacks:

* If the first voter is unavailable or lags, noone will campaign.

* If the first voter isn't the leaseholder, it has to immediately
  transfer leadership to the leaseholder.

* It used Raft PreVote, so it wouldn't be able to win when CheckQuorum
  is enabled, since followers won't grant votes when they've recently
  heard from the leader.

This patch instead campaigns on the current leaseholder. We know there
can only be one, avoiding election ties. The conf change is typically
proposed by the leaseholder anyway so it's likely to be up-to-date. And
we want it to be colocated with the leader. If there is no leaseholder
then waiting out the election timeout is less problematic, since either
we'll have to wait out the lease interval anyway, or the range is idle.

It also forces an election by transitioning directly to candidate,
bypassing PreVote. This is ok, since we know the current leader is dead.

To avoid election ties in mixed 23.1/23.2 clusters, we retain the old
voter designation until the upgrade is finalized, but always force an
election instead of using pre-vote.

Epic: none
Release note: None
@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Jun 16, 2023

TFTR!

bors r+

@craig
Copy link
Contributor

craig bot commented Jun 16, 2023

Build succeeded:

@craig craig bot merged commit cf4e6d1 into cockroachdb:master Jun 16, 2023
@erikgrinaker erikgrinaker deleted the raft-remove-leader-campaign branch June 16, 2023 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kvserver: designate follower to campaign on Raft leader removal
3 participants