[CATCHUP] Repro Restart Bug + Fix #3686

bfish713 · 2024-09-18T19:59:56Z

Closes #3681, #3682

This PR:

Does two things main things:

Adds a test which repros the restart issue we saw in production
Fixes the issue by storing a Vote action the same for TimeoutVotes as for QuorumVotes

The scenario the the test repros is and the bug we had is as follows:

There is a committee that is small enough such that a quorum can be formed while a DA committee does not have a quorum. E.g. There are 10 nodes total and 4 in the DA committee. If we take 3 committee nodes offline there are still 7/10 total nodes which is a quorum but 1/4 DA is not.
After some time passes some nodes are taken down
All nodes are then brought back online and are restarted
The restarted nodes start from their last_actioned_view
- The last actioned view is the view the node last proposed in or sent a QuorumVote in (no DA actions are counted, nor were TimeoutVotes or view sync related messages).
The nodes come back online and there is no view that they can sync too. The DA nodes that went down start in the last view they saw and the remaining nodes in seemingly random future views (it's not really random as you'll see below)
Progress is stalled with all nodes timing out within view sync repeatedly.

What went wrong:

In the example above where 3 DA nodes are down and before this PR what happens is:
- Lets say the DA nodes are taken offline right after view n so view n succeeds
- in view n+1 a proposal is formed with the QC for n
- view n+1 times out because no DAC can be formed (1/4 DA committee members are voting)
- All nodes send a timeout vote to leader of n+2
- view n+2 has a valid proposal with a TC
- n+2 times out because of lack of DAC and we enter view sync
- We sync to the next view and start the cycle again but
  When the next node is taken down a Quorum will no longer be able to be formed and the cycle is broken
  So what view does each store as the last actioned view which it will restart from? There are 3 possibilities

If the node is one of first 3 nodes taken down (or more generally the first f) it'll just be the last regular view it voted and proposed in. Typically this will be much lower than the rest as progress could still be made
The node proposed in a view after the DA committee was taken down, then it will save the view it proposed in
The node was not leader and did not propose after the DA committee went offline, it will have the last view which had a DAC (or last fully successful view stored).

After restart the nodes in bucket 1 will be able to join view sync for any view of the others (since the restart in a lower view). The nodes in bucket 2 will restarted into a higher view than the nodes in bucket 3. The nodes in bucket 2 will also be in unique views (only one leader per view so only one node will action per view since nobody is sending QuorumVotes). The nodes in bucket 3 will have the same view stored. 2f+1 nodes exactly are in either bucket 2 or 3. The issue that happens is if more than f nodes are in bucket 2 (they proposed, but their proposal failed due to lack of QC) then after restart consensus will stall. Even with the help of bucked 1 nodes (by chance if they are in the same view as bucket 1) their will be less than 2f+1 nodes to form a view sync commit certificate. Even with f+1 nodes agreeing on the view still possible the nodes in bucket 2 will not join view sync because it's for a lower view than they are in.

What happens if we count Timeout Votes as actions in this scenario. Again their are 3 possibilities, but it's much different. Consider what happens when the f+1th node goes down. No more TCs can be formed, and no view sync certs can be formed. This means there can be no more valid proposals and everyone stalls in view sync. Lets call the last view with a Quorum of nodes participating as e. A TC may or will be formed for e (if it were not then e-1 would be the last view with a Quorum and the same logic would apply. i.e. there must be some last view with a certificate and we are calling it e). A proposal for e+1 may or may not get sent out and nodes may or may not see it and may or may not timeout.

Same as before, the first f nodes down may be in any view equal or earlier to the last successful view
Nodes that don't timeout e+1 and are not the leader make up this view. These nodes may either not see a proposal e+1 or they may be taken offline before timing out that view. They will have e as their last voted view
Nodes see a proposal for e+1 and send a TimeoutVote. They will restart in e+1

So on restart in the worst case there are f nodes in a view <= to e (bucket 1) and 2f+1 nodes in either view e or e+1. This means either f+1 nodes must be in either bucket 2 or 3. Since the bucket 1 nodes are in a lesser or equal view they will join the nodes in whichever bucket has f+1 nodes when they form precommit certificate meaning at least 2f+1 nodes will sync to the same view.

In other words by counting timeout votes as an action we make this exactly the same as the case for quorum votes.

The test I added fails without the change to store the time out view and the logs look very much like the real logs we saw in datadog.

What are the changes

Added the test
Change the spinning task. This was another interesting change. It involved a few parts
- I added a way to delay restarting for a number of views. This works by pushing a new action for a future view to restart the node and storing all the relevant info to do so. We can't just use the up action because we want to reuse the storage and external channel of the node when it comes back up. We need the external channel because other task tasks need to read from it to verify the decide events etc. and it's not trivial to add new streams to e.g. the safety task
- I had make a change to not reuse the internal channels of the node. This was a very tricky issue where when we kill the first instance of the node it's timeout tasks remained running, and because we used the same internal channel the new instance of the node would actually get a timeout event from the old node and jump ahead to whatever view that node was on last! This made the test pass when it shouldn't
Removed unused file view_change.rs looks like it was an abandoned refactor
Fixed the code to store an Action when we send a TimeoutVote

This PR does not:

Key places to review:

Make sure the test I added simulates the scenario I am describing
Make sure the change to network.rs does what I'm saying

Run the new integration test

crates/testing/src/spinning_task.rs

rob-maron

Just one question

rob-maron · 2024-09-19T17:14:50Z

crates/task-impls/src/network.rs

+            HotShotEvent::TimeoutVoteSend(vote) => {
+                *maybe_action = Some(HotShotAction::Vote);
+                Some((
+                    vote.signing_key(),


I checked over the update_action logic and this looks good 👍

rob-maron · 2024-09-19T17:18:44Z

crates/testing/src/spinning_task.rs

@@ -217,7 +228,7 @@ where
                                    self.last_decided_leaf.clone(),
                                    TestInstanceState::new(self.async_delay_config.clone()),
                                    None,
-                                    view_number,
+                                    read_storage.last_actioned_view().await,


Is this equivalent to what we do in the sequencer? Do we use the last actioned view or something else?

This should be equivalent, the sequencer only has access to storage and has a very similar function to get the view number

bfish713 added 7 commits September 17, 2024 16:37

create restart context, and allow restarting after number of views

299b8b1

repros the issue

8991b9c

Fix the issue by saving the timeout vote as an action

1313952

cleanup/lint

8848bb7

Merge branch 'main' into bf/restart-repro

ec529cd

revert some logging remove unused file

c3cab62

logging

896ff21

bfish713 marked this pull request as ready for review September 19, 2024 02:58

github-actions bot assigned shenkeyao and lukeiannucci Sep 19, 2024

bfish713 assigned jparr721 and unassigned shenkeyao Sep 19, 2024

lukeiannucci reviewed Sep 19, 2024

View reviewed changes

crates/testing/src/spinning_task.rs Show resolved Hide resolved

Rename UpDown enum

3f2506a

bfish713 requested review from lukeiannucci, jparr721, ss-es, lukaszrzasik and rob-maron September 19, 2024 15:32

rob-maron approved these changes Sep 19, 2024

View reviewed changes

bfish713 merged commit 4016292 into main Sep 19, 2024
36 checks passed

bfish713 deleted the bf/restart-repro branch September 19, 2024 18:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CATCHUP] Repro Restart Bug + Fix #3686

[CATCHUP] Repro Restart Bug + Fix #3686

bfish713 commented Sep 18, 2024 •

edited

Loading

rob-maron left a comment

rob-maron Sep 19, 2024 •

edited

Loading

rob-maron Sep 19, 2024

bfish713 Sep 19, 2024

[CATCHUP] Repro Restart Bug + Fix #3686

[CATCHUP] Repro Restart Bug + Fix #3686

Conversation

bfish713 commented Sep 18, 2024 • edited Loading

This PR:

What are the changes

This PR does not:

Key places to review:

rob-maron left a comment

Choose a reason for hiding this comment

rob-maron Sep 19, 2024 • edited Loading

Choose a reason for hiding this comment

rob-maron Sep 19, 2024

Choose a reason for hiding this comment

bfish713 Sep 19, 2024

Choose a reason for hiding this comment

bfish713 commented Sep 18, 2024 •

edited

Loading

rob-maron Sep 19, 2024 •

edited

Loading