WIP keep logs around during recovery to ensure we can always recover … #355

banks · 2019-07-19T21:15:45Z

…however long install snapshot takes

This seems reasonable to me but I've not actually got the new test to pass yet, and even if it does it's not really testing everything as thoroughly as it could.

I need to leave this here for now so it's a draft PR that others can work on if needed.

…however long install snapshot takes

freddygv · 2019-07-19T21:44:01Z

replication.go

+		r.leaderState.recoveringPeersLock.Lock()
+		if r.leaderState.recoveringPeers != nil {
+			if caughtUp {
+				delete(r.leaderState.recoveringPeers, s.peer.ID)


It seems that items in recoveringPeers should also be able to expire some other way. If a recovering peer drops off before catching up, and never comes back, the logs will continue to be truncated at that peer's last known index.

Yeah great point. I wonder how to resolve this. If it's only time based then it comes back to "how long is it reasonable to expect a snapshot to take" which is kind of the same problem we are trying to solve here. i.e. any timeout even if it's configurable is just another upper bound on how big snapshots can get relative to restore speed before you get in the same unrecoverable state.

That said, it might still make sense to cap it somewhere and at least it keeps the common case logs smaller but allows recovery to increase storage only if needed up to some much higher limit? But what is that limit? Maybe an hour? That's a lot more than the recovery time of the one case we've seen in prod now but it's already the case that they are operating outside of our design envelope so it's hard to come up with a non-arbitrary limit on how far outside we want to support? And that's only Consul's use case. Other DBs might use this raft implementation for cases where much larger datasets are expected.

For example I used to have to recover 200GB+ MongoDB replica sets that had exactly this problem since they have a fixed-size OpLog. In some cases they took several hours to recover from snapshot. I guess if it's configurable then people can configure it up for that kind of use case (although it doesn't sound like a great idea to me!).

Couldn't we just remove the entry from recoveringPeers when that peer is removed from the Raft Configuration?

I guess that isn't 100% bullet proof either.

What happens when a peer is restarting while restoring the snapshot from the leader? Obviously it needs the full snapshot sent again so at that point in time we can update the recovering peers entry (probably with a larger min index)

Basically we need a way to know that a Raft peer is in a recovering state vs just dead. If we can track that information then we can make intelligent decisions about what logs need to be kept around.

Maybe we could kick off some arbitrary timer when the peer is removed from the Raft config. That way we are less dependent on the time it takes to transfer snapshots, and it would account for the situation where the peer restarts.

When the timer elapses, if the peer is still there we update the index, if it's not we delete the entry from recoveringPeers.

Removing from raft config is a good call.

I guess there is still the case where a follower gets stuck doing something else though and doesn't cleanly crash or get removed from the quorum.

After thinking about this some, I decided we can probably just be more conservative than this PR and instead:

if installSnapshot fails then remove the peer from recovering

if it succeeds, set a timer to remove it in X minutes

don't bother waiting for TrailingLogs/2 appends - by the time install snapshot has succeeded the follower has not just downloaded but completely restored so we should pretty much immediately attempt appendEntries if we can so having a buffer of a few minutes should be plenty in all but the most pathalogical hanging case and those cases are the ones we want to protect against here I think.

What do you both think?

I think the timer approach should work. Should we expose X as a configuration?

Maybe, since some X could not be enough depending on the workload.

On the other hand, if someone is repeatedly running into X not being high enough, then TrailingLogs should probably be bumped up instead. In what case would someone prefer to relax X over TrailingLogs?

stale · 2019-10-18T19:45:47Z

Hey there,
We wanted to check in on this request since it has been inactive for at least 90 days.
Have you reviewed the latest godocs?
If you think this is still an important issue in the latest version of the Raft library or
its documentation please feel let us know and we'll keep it open for investigation.
If there is still no activity on this request in 30 days, we will go ahead and close it.
Thank you!

stale · 2019-11-17T20:30:19Z

Hey there, This issue has been automatically closed because there hasn't been any activity for a while. If you are still experiencing problems, or still have questions, feel free to open a new one :+1

WIP keep logs around during recovery to ensure we can always recover …

80c14cb

…however long install snapshot takes

freddygv reviewed Jul 19, 2019

View reviewed changes

banks mentioned this pull request Jul 22, 2019

Allow raft TrailingLogs to be configured. hashicorp/consul#6186

Merged

stale bot added the waiting-reply label Oct 18, 2019

stale bot closed this Nov 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP keep logs around during recovery to ensure we can always recover … #355

WIP keep logs around during recovery to ensure we can always recover … #355

banks commented Jul 19, 2019

freddygv Jul 19, 2019 •

edited

Loading

banks Jul 22, 2019

mkeeler Jul 22, 2019 •

edited

Loading

freddygv Jul 22, 2019

banks Jul 22, 2019

mkeeler Jul 22, 2019

freddygv Jul 22, 2019 •

edited

Loading

freddygv Jul 22, 2019 •

edited

Loading

stale bot commented Oct 18, 2019

stale bot commented Nov 17, 2019

WIP keep logs around during recovery to ensure we can always recover … #355

WIP keep logs around during recovery to ensure we can always recover … #355

Conversation

banks commented Jul 19, 2019

freddygv Jul 19, 2019 • edited Loading

Choose a reason for hiding this comment

banks Jul 22, 2019

Choose a reason for hiding this comment

mkeeler Jul 22, 2019 • edited Loading

Choose a reason for hiding this comment

freddygv Jul 22, 2019

Choose a reason for hiding this comment

banks Jul 22, 2019

Choose a reason for hiding this comment

mkeeler Jul 22, 2019

Choose a reason for hiding this comment

freddygv Jul 22, 2019 • edited Loading

Choose a reason for hiding this comment

freddygv Jul 22, 2019 • edited Loading

Choose a reason for hiding this comment

stale bot commented Oct 18, 2019

stale bot commented Nov 17, 2019

freddygv Jul 19, 2019 •

edited

Loading

mkeeler Jul 22, 2019 •

edited

Loading

freddygv Jul 22, 2019 •

edited

Loading

freddygv Jul 22, 2019 •

edited

Loading