-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP keep logs around during recovery to ensure we can always recover … #355
Conversation
…however long install snapshot takes
r.leaderState.recoveringPeersLock.Lock() | ||
if r.leaderState.recoveringPeers != nil { | ||
if caughtUp { | ||
delete(r.leaderState.recoveringPeers, s.peer.ID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that items in recoveringPeers
should also be able to expire some other way. If a recovering peer drops off before catching up, and never comes back, the logs will continue to be truncated at that peer's last known index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah great point. I wonder how to resolve this. If it's only time based then it comes back to "how long is it reasonable to expect a snapshot to take" which is kind of the same problem we are trying to solve here. i.e. any timeout even if it's configurable is just another upper bound on how big snapshots can get relative to restore speed before you get in the same unrecoverable state.
That said, it might still make sense to cap it somewhere and at least it keeps the common case logs smaller but allows recovery to increase storage only if needed up to some much higher limit? But what is that limit? Maybe an hour? That's a lot more than the recovery time of the one case we've seen in prod now but it's already the case that they are operating outside of our design envelope so it's hard to come up with a non-arbitrary limit on how far outside we want to support? And that's only Consul's use case. Other DBs might use this raft implementation for cases where much larger datasets are expected.
For example I used to have to recover 200GB+ MongoDB replica sets that had exactly this problem since they have a fixed-size OpLog. In some cases they took several hours to recover from snapshot. I guess if it's configurable then people can configure it up for that kind of use case (although it doesn't sound like a great idea to me!).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't we just remove the entry from recoveringPeers when that peer is removed from the Raft Configuration?
I guess that isn't 100% bullet proof either.
What happens when a peer is restarting while restoring the snapshot from the leader? Obviously it needs the full snapshot sent again so at that point in time we can update the recovering peers entry (probably with a larger min index)
Basically we need a way to know that a Raft peer is in a recovering state vs just dead. If we can track that information then we can make intelligent decisions about what logs need to be kept around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could kick off some arbitrary timer when the peer is removed from the Raft config. That way we are less dependent on the time it takes to transfer snapshots, and it would account for the situation where the peer restarts.
When the timer elapses, if the peer is still there we update the index, if it's not we delete the entry from recoveringPeers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing from raft config is a good call.
I guess there is still the case where a follower gets stuck doing something else though and doesn't cleanly crash or get removed from the quorum.
After thinking about this some, I decided we can probably just be more conservative than this PR and instead:
- if
installSnapshot
fails then remove the peer from recovering - if it succeeds, set a timer to remove it in X minutes
- don't bother waiting for TrailingLogs/2 appends - by the time install snapshot has succeeded the follower has not just downloaded but completely restored so we should pretty much immediately attempt appendEntries if we can so having a buffer of a few minutes should be plenty in all but the most pathalogical hanging case and those cases are the ones we want to protect against here I think.
What do you both think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the timer approach should work. Should we expose X as a configuration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, since some X could not be enough depending on the workload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the other hand, if someone is repeatedly running into X not being high enough, then TrailingLogs should probably be bumped up instead. In what case would someone prefer to relax X over TrailingLogs?
Hey there, |
Hey there, This issue has been automatically closed because there hasn't been any activity for a while. If you are still experiencing problems, or still have questions, feel free to open a new one :+1 |
…however long install snapshot takes
This seems reasonable to me but I've not actually got the new test to pass yet, and even if it does it's not really testing everything as thoroughly as it could.
I need to leave this here for now so it's a draft PR that others can work on if needed.