Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CATCHUP] Repro Restart Bug + Fix #3686

Merged
merged 8 commits into from
Sep 19, 2024
Merged

[CATCHUP] Repro Restart Bug + Fix #3686

merged 8 commits into from
Sep 19, 2024

Conversation

bfish713
Copy link
Collaborator

@bfish713 bfish713 commented Sep 18, 2024

Closes #3681, #3682

This PR:

Does two things main things:

  1. Adds a test which repros the restart issue we saw in production
  2. Fixes the issue by storing a Vote action the same for TimeoutVotes as for QuorumVotes

The scenario the the test repros is and the bug we had is as follows:

  • There is a committee that is small enough such that a quorum can be formed while a DA committee does not have a quorum. E.g. There are 10 nodes total and 4 in the DA committee. If we take 3 committee nodes offline there are still 7/10 total nodes which is a quorum but 1/4 DA is not.
  • After some time passes some nodes are taken down
  • All nodes are then brought back online and are restarted
  • The restarted nodes start from their last_actioned_view
    • The last actioned view is the view the node last proposed in or sent a QuorumVote in (no DA actions are counted, nor were TimeoutVotes or view sync related messages).
  • The nodes come back online and there is no view that they can sync too. The DA nodes that went down start in the last view they saw and the remaining nodes in seemingly random future views (it's not really random as you'll see below)
  • Progress is stalled with all nodes timing out within view sync repeatedly.

What went wrong:

  • In the example above where 3 DA nodes are down and before this PR what happens is:
    • Lets say the DA nodes are taken offline right after view n so view n succeeds
    • in view n+1 a proposal is formed with the QC for n
    • view n+1 times out because no DAC can be formed (1/4 DA committee members are voting)
    • All nodes send a timeout vote to leader of n+2
    • view n+2 has a valid proposal with a TC
    • n+2 times out because of lack of DAC and we enter view sync
    • We sync to the next view and start the cycle again but
      When the next node is taken down a Quorum will no longer be able to be formed and the cycle is broken
      So what view does each store as the last actioned view which it will restart from? There are 3 possibilities
  1. If the node is one of first 3 nodes taken down (or more generally the first f) it'll just be the last regular view it voted and proposed in. Typically this will be much lower than the rest as progress could still be made
  2. The node proposed in a view after the DA committee was taken down, then it will save the view it proposed in
  3. The node was not leader and did not propose after the DA committee went offline, it will have the last view which had a DAC (or last fully successful view stored).

After restart the nodes in bucket 1 will be able to join view sync for any view of the others (since the restart in a lower view). The nodes in bucket 2 will restarted into a higher view than the nodes in bucket 3. The nodes in bucket 2 will also be in unique views (only one leader per view so only one node will action per view since nobody is sending QuorumVotes). The nodes in bucket 3 will have the same view stored. 2f+1 nodes exactly are in either bucket 2 or 3. The issue that happens is if more than f nodes are in bucket 2 (they proposed, but their proposal failed due to lack of QC) then after restart consensus will stall. Even with the help of bucked 1 nodes (by chance if they are in the same view as bucket 1) their will be less than 2f+1 nodes to form a view sync commit certificate. Even with f+1 nodes agreeing on the view still possible the nodes in bucket 2 will not join view sync because it's for a lower view than they are in.

What happens if we count Timeout Votes as actions in this scenario. Again their are 3 possibilities, but it's much different. Consider what happens when the f+1th node goes down. No more TCs can be formed, and no view sync certs can be formed. This means there can be no more valid proposals and everyone stalls in view sync. Lets call the last view with a Quorum of nodes participating as e. A TC may or will be formed for e (if it were not then e-1 would be the last view with a Quorum and the same logic would apply. i.e. there must be some last view with a certificate and we are calling it e). A proposal for e+1 may or may not get sent out and nodes may or may not see it and may or may not timeout.

  1. Same as before, the first f nodes down may be in any view equal or earlier to the last successful view
  2. Nodes that don't timeout e+1 and are not the leader make up this view. These nodes may either not see a proposal e+1 or they may be taken offline before timing out that view. They will have e as their last voted view
  3. Nodes see a proposal for e+1 and send a TimeoutVote. They will restart in e+1

So on restart in the worst case there are f nodes in a view <= to e (bucket 1) and 2f+1 nodes in either view e or e+1. This means either f+1 nodes must be in either bucket 2 or 3. Since the bucket 1 nodes are in a lesser or equal view they will join the nodes in whichever bucket has f+1 nodes when they form precommit certificate meaning at least 2f+1 nodes will sync to the same view.

In other words by counting timeout votes as an action we make this exactly the same as the case for quorum votes.

The test I added fails without the change to store the time out view and the logs look very much like the real logs we saw in datadog.

What are the changes

  • Added the test
  • Change the spinning task. This was another interesting change. It involved a few parts
    • I added a way to delay restarting for a number of views. This works by pushing a new action for a future view to restart the node and storing all the relevant info to do so. We can't just use the up action because we want to reuse the storage and external channel of the node when it comes back up. We need the external channel because other task tasks need to read from it to verify the decide events etc. and it's not trivial to add new streams to e.g. the safety task
    • I had make a change to not reuse the internal channels of the node. This was a very tricky issue where when we kill the first instance of the node it's timeout tasks remained running, and because we used the same internal channel the new instance of the node would actually get a timeout event from the old node and jump ahead to whatever view that node was on last! This made the test pass when it shouldn't
  • Removed unused file view_change.rs looks like it was an abandoned refactor
  • Fixed the code to store an Action when we send a TimeoutVote

This PR does not:

Key places to review:

Make sure the test I added simulates the scenario I am describing
Make sure the change to network.rs does what I'm saying

Run the new integration test

@bfish713 bfish713 marked this pull request as ready for review September 19, 2024 02:58
@bfish713 bfish713 assigned jparr721 and unassigned shenkeyao Sep 19, 2024
Copy link
Collaborator

@rob-maron rob-maron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one question

Comment on lines +491 to +494
HotShotEvent::TimeoutVoteSend(vote) => {
*maybe_action = Some(HotShotAction::Vote);
Some((
vote.signing_key(),
Copy link
Collaborator

@rob-maron rob-maron Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked over the update_action logic and this looks good 👍

@@ -217,7 +228,7 @@ where
self.last_decided_leaf.clone(),
TestInstanceState::new(self.async_delay_config.clone()),
None,
view_number,
read_storage.last_actioned_view().await,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this equivalent to what we do in the sequencer? Do we use the last actioned view or something else?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be equivalent, the sequencer only has access to storage and has a very similar function to get the view number

@bfish713 bfish713 merged commit 4016292 into main Sep 19, 2024
36 checks passed
@bfish713 bfish713 deleted the bf/restart-repro branch September 19, 2024 18:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CATCHUP] - Fix Issue where Some nodes Start in Higher view
5 participants