Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP keep logs around during recovery to ensure we can always recover … #355

Closed
wants to merge 1 commit into from

Conversation

banks
Copy link
Member

@banks banks commented Jul 19, 2019

…however long install snapshot takes

This seems reasonable to me but I've not actually got the new test to pass yet, and even if it does it's not really testing everything as thoroughly as it could.

I need to leave this here for now so it's a draft PR that others can work on if needed.

r.leaderState.recoveringPeersLock.Lock()
if r.leaderState.recoveringPeers != nil {
if caughtUp {
delete(r.leaderState.recoveringPeers, s.peer.ID)
Copy link
Contributor

@freddygv freddygv Jul 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that items in recoveringPeers should also be able to expire some other way. If a recovering peer drops off before catching up, and never comes back, the logs will continue to be truncated at that peer's last known index.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah great point. I wonder how to resolve this. If it's only time based then it comes back to "how long is it reasonable to expect a snapshot to take" which is kind of the same problem we are trying to solve here. i.e. any timeout even if it's configurable is just another upper bound on how big snapshots can get relative to restore speed before you get in the same unrecoverable state.

That said, it might still make sense to cap it somewhere and at least it keeps the common case logs smaller but allows recovery to increase storage only if needed up to some much higher limit? But what is that limit? Maybe an hour? That's a lot more than the recovery time of the one case we've seen in prod now but it's already the case that they are operating outside of our design envelope so it's hard to come up with a non-arbitrary limit on how far outside we want to support? And that's only Consul's use case. Other DBs might use this raft implementation for cases where much larger datasets are expected.

For example I used to have to recover 200GB+ MongoDB replica sets that had exactly this problem since they have a fixed-size OpLog. In some cases they took several hours to recover from snapshot. I guess if it's configurable then people can configure it up for that kind of use case (although it doesn't sound like a great idea to me!).

Copy link
Member

@mkeeler mkeeler Jul 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't we just remove the entry from recoveringPeers when that peer is removed from the Raft Configuration?

I guess that isn't 100% bullet proof either.

What happens when a peer is restarting while restoring the snapshot from the leader? Obviously it needs the full snapshot sent again so at that point in time we can update the recovering peers entry (probably with a larger min index)

Basically we need a way to know that a Raft peer is in a recovering state vs just dead. If we can track that information then we can make intelligent decisions about what logs need to be kept around.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could kick off some arbitrary timer when the peer is removed from the Raft config. That way we are less dependent on the time it takes to transfer snapshots, and it would account for the situation where the peer restarts.

When the timer elapses, if the peer is still there we update the index, if it's not we delete the entry from recoveringPeers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing from raft config is a good call.

I guess there is still the case where a follower gets stuck doing something else though and doesn't cleanly crash or get removed from the quorum.

After thinking about this some, I decided we can probably just be more conservative than this PR and instead:

  • if installSnapshot fails then remove the peer from recovering
  • if it succeeds, set a timer to remove it in X minutes
  • don't bother waiting for TrailingLogs/2 appends - by the time install snapshot has succeeded the follower has not just downloaded but completely restored so we should pretty much immediately attempt appendEntries if we can so having a buffer of a few minutes should be plenty in all but the most pathalogical hanging case and those cases are the ones we want to protect against here I think.

What do you both think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the timer approach should work. Should we expose X as a configuration?

Copy link
Contributor

@freddygv freddygv Jul 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, since some X could not be enough depending on the workload.

Copy link
Contributor

@freddygv freddygv Jul 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, if someone is repeatedly running into X not being high enough, then TrailingLogs should probably be bumped up instead. In what case would someone prefer to relax X over TrailingLogs?

@stale
Copy link

stale bot commented Oct 18, 2019

Hey there,
We wanted to check in on this request since it has been inactive for at least 90 days.
Have you reviewed the latest godocs?
If you think this is still an important issue in the latest version of the Raft library or
its documentation please feel let us know and we'll keep it open for investigation.
If there is still no activity on this request in 30 days, we will go ahead and close it.
Thank you!

@stale stale bot added the waiting-reply label Oct 18, 2019
@stale
Copy link

stale bot commented Nov 17, 2019

Hey there, This issue has been automatically closed because there hasn't been any activity for a while. If you are still experiencing problems, or still have questions, feel free to open a new one :+1

@stale stale bot closed this Nov 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants