Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: avoid reading uncommitted tail of Raft log when becoming leader #18601

Closed
petermattis opened this issue Sep 19, 2017 · 8 comments · Fixed by #21356
Closed

storage: avoid reading uncommitted tail of Raft log when becoming leader #18601

petermattis opened this issue Sep 19, 2017 · 8 comments · Fixed by #21356
Assignees
Milestone

Comments

@petermattis
Copy link
Collaborator

The work will all be upstream in etcd/raft. Filing an issue here for tracking purposes.

Forked from a comment on #18199:

#13231 is mainly talking about reducing some inefficiencies in the scan for uncommitted config changes, but the scan would still be there. It's tricky to eliminate the scan completely given the current raft semantics (if we cached some information about the presence of config changes we'd have to update that when the uncommitted tail gets truncated).

Fortunately, with a little refactoring in etcd/raft I think we can avoid the need for a precise count. What we require here is to ensure that we never have more than one config change in flight at a time. If we simply assume pessimistically that the tail of the log has a config change (so that the new leader cannot propose a config change until it has applied all entries up to the point of its election) and we can skip the scan.

@petermattis says:

When would you clear raft.pendingConf? Currently that field gets cleared when the conf change is applied. Keeping track of the the current last index and watching for when that index is committed seems doable, but a bit tricky. Did you have a simpler idea in mind?

@bdarnell responds:

My idea is to track the last index (instead of a bool pendingConf, it would be configChangeBlockedUntilIndex). I don't think it will be tricky because it doesn't need to persist across leadership changes.

@petermattis petermattis added this to the 1.2 milestone Sep 19, 2017
@nvanbenschoten
Copy link
Member

Are we still planning on getting to this before the 2.0 release?

@bdarnell
Copy link
Contributor

It's less crucial thanks to the quota pool (which limits the size of the uncommitted tail of the log), though it still has some value (we thought this was still worth doing even though this issue was created after the quota pool landed).

I don't know if it's going to make the cut for 2.0, though. I think it ranks below fixing PreVote (#18151) as far as raft changes go.

@petermattis
Copy link
Collaborator Author

I recall an incident post quota pool where we saw a very large uncommitted tail of the log due to re-proposals. I don't recall the details, but I think @a-robinson looked at this too and he has a fantastic memory for this stuff.

@a-robinson
Copy link
Contributor

The incident I looked into (#15702) was pre-quota pool. A 40MB delete operation got re-proposed 66 times, kicking off the infinite cycle of raft elections. Even the first proposal triggered an election due to the high latency / low bandwidth, but if reproposals hadn't been allowed then things presumably wouldn't have spun so out of control.

Around the same time, we also saw it during the uncommon combination of a dropping a large database and running a restore at the same time while running on terrible disks (#15681).

#18199 happened post-quota pool, though. I think understanding @bdarnell's questions in #18199 (comment) would be helpful for properly prioritizing this. I'll personally remain worried about it unless we know it actually happened while they were running 1.0.x or we understand how the quota pool didn't prevent it.

@tbg
Copy link
Member

tbg commented Dec 21, 2017

The quota pool also doesn't prevent reproposals, and the Raft log could grow that way too.

@a-robinson
Copy link
Contributor

I guess I should have checked the code. If that's the case, then consider me still fairly worried about this.

@bdarnell
Copy link
Contributor

etcd-io/etcd#9073

@xiang90
Copy link
Contributor

xiang90 commented Jan 5, 2018

@bdarnell

I think it ranks below fixing PreVote (#18151) as far as raft changes go.

Just let you know that we at etcd side is also going to put some effort to fix pre-vote in Q1 2018. We also want to enable it soon.

/cc @gyuho

xiang90 pushed a commit to etcd-io/etcd that referenced this issue Jan 9, 2018
Scanning the uncommitted portion of the raft log to determine whether
there are any pending config changes can be expensive. In
cockroachdb/cockroach#18601, we've seen that a new leader can spend so
much time scanning its log post-election that it fails to send
its first heartbeats in time to prevent a second election from
starting immediately.

Instead of tracking whether a pending config change exists with a
boolean, this commit tracks the latest log index at which a pending
config change *could* exist. This is a less expensive solution to
the problem, and the impact of false positives should be minimal since
a newly-elected leader should be able to quickly commit the tail of
its log.
gyuho pushed a commit to etcd-io/etcd that referenced this issue Jan 9, 2018
Scanning the uncommitted portion of the raft log to determine whether
there are any pending config changes can be expensive. In
cockroachdb/cockroach#18601, we've seen that a new leader can spend so
much time scanning its log post-election that it fails to send
its first heartbeats in time to prevent a second election from
starting immediately.

Instead of tracking whether a pending config change exists with a
boolean, this commit tracks the latest log index at which a pending
config change *could* exist. This is a less expensive solution to
the problem, and the impact of false positives should be minimal since
a newly-elected leader should be able to quickly commit the tail of
its log.
bdarnell added a commit to bdarnell/cockroach that referenced this issue Jan 9, 2018
Fixes cockroachdb#18601

Release note (bug fix): Fix a bug in which ranges could get stuck if
the uncommitted raft log grew too large
bdarnell added a commit to cockroachdb/etcd that referenced this issue Apr 17, 2018
Scanning the uncommitted portion of the raft log to determine whether
there are any pending config changes can be expensive. In
cockroachdb/cockroach#18601, we've seen that a new leader can spend so
much time scanning its log post-election that it fails to send
its first heartbeats in time to prevent a second election from
starting immediately.

Instead of tracking whether a pending config change exists with a
boolean, this commit tracks the latest log index at which a pending
config change *could* exist. This is a less expensive solution to
the problem, and the impact of false positives should be minimal since
a newly-elected leader should be able to quickly commit the tail of
its log.
bdarnell added a commit to bdarnell/cockroach that referenced this issue Apr 18, 2018
Picks up a cherry-picked version of etcd-io/etcd#9073, to fix cockroachdb#18601

Release note (bug fix): Fixes potential cluster unavailability after
raft logs grow too large.
craig bot pushed a commit that referenced this issue Apr 19, 2018
24889: cherrypick-1.1: build: Update etcd r=bdarnell a=bdarnell

Picks up a cherry-picked version of etcd-io/etcd#9073, to fix #18601

Release note (bug fix): Fixes potential cluster unavailability after
raft logs grow too large.

Co-authored-by: Ben Darnell <ben@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants