storage: avoid reading uncommitted tail of Raft log when becoming leader #18601

petermattis · 2017-09-19T18:48:48Z

The work will all be upstream in etcd/raft. Filing an issue here for tracking purposes.

#13231 is mainly talking about reducing some inefficiencies in the scan for uncommitted config changes, but the scan would still be there. It's tricky to eliminate the scan completely given the current raft semantics (if we cached some information about the presence of config changes we'd have to update that when the uncommitted tail gets truncated).

Fortunately, with a little refactoring in etcd/raft I think we can avoid the need for a precise count. What we require here is to ensure that we never have more than one config change in flight at a time. If we simply assume pessimistically that the tail of the log has a config change (so that the new leader cannot propose a config change until it has applied all entries up to the point of its election) and we can skip the scan.

@petermattis says:

When would you clear raft.pendingConf? Currently that field gets cleared when the conf change is applied. Keeping track of the the current last index and watching for when that index is committed seems doable, but a bit tricky. Did you have a simpler idea in mind?

@bdarnell responds:

My idea is to track the last index (instead of a bool pendingConf, it would be configChangeBlockedUntilIndex). I don't think it will be tricky because it doesn't need to persist across leadership changes.

The text was updated successfully, but these errors were encountered:

nvanbenschoten · 2017-12-20T22:00:47Z

Are we still planning on getting to this before the 2.0 release?

bdarnell · 2017-12-20T23:06:25Z

It's less crucial thanks to the quota pool (which limits the size of the uncommitted tail of the log), though it still has some value (we thought this was still worth doing even though this issue was created after the quota pool landed).

I don't know if it's going to make the cut for 2.0, though. I think it ranks below fixing PreVote (#18151) as far as raft changes go.

petermattis · 2017-12-21T00:33:04Z

I recall an incident post quota pool where we saw a very large uncommitted tail of the log due to re-proposals. I don't recall the details, but I think @a-robinson looked at this too and he has a fantastic memory for this stuff.

a-robinson · 2017-12-21T06:36:29Z

The incident I looked into (#15702) was pre-quota pool. A 40MB delete operation got re-proposed 66 times, kicking off the infinite cycle of raft elections. Even the first proposal triggered an election due to the high latency / low bandwidth, but if reproposals hadn't been allowed then things presumably wouldn't have spun so out of control.

Around the same time, we also saw it during the uncommon combination of a dropping a large database and running a restore at the same time while running on terrible disks (#15681).

#18199 happened post-quota pool, though. I think understanding @bdarnell's questions in #18199 (comment) would be helpful for properly prioritizing this. I'll personally remain worried about it unless we know it actually happened while they were running 1.0.x or we understand how the quota pool didn't prevent it.

tbg · 2017-12-21T15:03:06Z

The quota pool also doesn't prevent reproposals, and the Raft log could grow that way too.

a-robinson · 2017-12-21T15:09:39Z

I guess I should have checked the code. If that's the case, then consider me still fairly worried about this.

bdarnell · 2017-12-30T15:21:23Z

etcd-io/etcd#9073

xiang90 · 2018-01-05T18:48:26Z

@bdarnell

I think it ranks below fixing PreVote (#18151) as far as raft changes go.

Just let you know that we at etcd side is also going to put some effort to fix pre-vote in Q1 2018. We also want to enable it soon.

/cc @gyuho

Scanning the uncommitted portion of the raft log to determine whether there are any pending config changes can be expensive. In cockroachdb/cockroach#18601, we've seen that a new leader can spend so much time scanning its log post-election that it fails to send its first heartbeats in time to prevent a second election from starting immediately. Instead of tracking whether a pending config change exists with a boolean, this commit tracks the latest log index at which a pending config change *could* exist. This is a less expensive solution to the problem, and the impact of false positives should be minimal since a newly-elected leader should be able to quickly commit the tail of its log.

Fixes cockroachdb#18601 Release note (bug fix): Fix a bug in which ranges could get stuck if the uncommitted raft log grew too large

Scanning the uncommitted portion of the raft log to determine whether there are any pending config changes can be expensive. In cockroachdb/cockroach#18601, we've seen that a new leader can spend so much time scanning its log post-election that it fails to send its first heartbeats in time to prevent a second election from starting immediately. Instead of tracking whether a pending config change exists with a boolean, this commit tracks the latest log index at which a pending config change *could* exist. This is a less expensive solution to the problem, and the impact of false positives should be minimal since a newly-elected leader should be able to quickly commit the tail of its log.

Picks up a cherry-picked version of etcd-io/etcd#9073, to fix cockroachdb#18601 Release note (bug fix): Fixes potential cluster unavailability after raft logs grow too large.

24889: cherrypick-1.1: build: Update etcd r=bdarnell a=bdarnell Picks up a cherry-picked version of etcd-io/etcd#9073, to fix #18601 Release note (bug fix): Fixes potential cluster unavailability after raft logs grow too large. Co-authored-by: Ben Darnell <ben@cockroachlabs.com>

petermattis added this to the 1.2 milestone Sep 19, 2017

petermattis assigned bdarnell Sep 19, 2017

bdarnell mentioned this issue Dec 30, 2017

raft: Avoid scanning raft log in becomeLeader etcd-io/etcd#9073

Merged

bdarnell mentioned this issue Jan 9, 2018

build: Update coreos/etcd dependency #21356

Merged

bdarnell added a commit to bdarnell/cockroach that referenced this issue Jan 9, 2018

build: Update coreos/etcd dependency

16faad3

Fixes cockroachdb#18601 Release note (bug fix): Fix a bug in which ranges could get stuck if the uncommitted raft log grew too large

bdarnell closed this as completed in #21356 Jan 9, 2018

bdarnell mentioned this issue Feb 10, 2018

release-1.0: Backports and updates for table lease leak #22563

Merged

bdarnell mentioned this issue Apr 17, 2018

cherrypick-1.1: build: Update etcd #24889

Merged

bdarnell added a commit to bdarnell/cockroach that referenced this issue Apr 18, 2018

build: Update etcd

54819a3

Picks up a cherry-picked version of etcd-io/etcd#9073, to fix cockroachdb#18601 Release note (bug fix): Fixes potential cluster unavailability after raft logs grow too large.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: avoid reading uncommitted tail of Raft log when becoming leader #18601

storage: avoid reading uncommitted tail of Raft log when becoming leader #18601

petermattis commented Sep 19, 2017

nvanbenschoten commented Dec 20, 2017

bdarnell commented Dec 20, 2017

petermattis commented Dec 21, 2017

a-robinson commented Dec 21, 2017

tbg commented Dec 21, 2017

a-robinson commented Dec 21, 2017

bdarnell commented Dec 30, 2017

xiang90 commented Jan 5, 2018

storage: avoid reading uncommitted tail of Raft log when becoming leader #18601

storage: avoid reading uncommitted tail of Raft log when becoming leader #18601

Comments

petermattis commented Sep 19, 2017

nvanbenschoten commented Dec 20, 2017

bdarnell commented Dec 20, 2017

petermattis commented Dec 21, 2017

a-robinson commented Dec 21, 2017

tbg commented Dec 21, 2017

a-robinson commented Dec 21, 2017

bdarnell commented Dec 30, 2017

xiang90 commented Jan 5, 2018