Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Problem
If for some reason, the node does not get status change events from its peers, a situation may arise where the consensus will stuck, and no blocks will be created.
For example, let's say we have a situation where we have 5 validator nodes, and they are building block 100. 4 nodes (which is quorum) send the commit messages to each other, but for some reason, only 3 of the nodes receive enough commit messages and they insert the block to their state. Because of some network issues, the other two, even though they've sent their commit messages, did not receive enough to insert the block through consensus, so they remain reliable to
syncer
to insert the new block. But because network was in some weird state, where some messages were not received, thesyncer
also on the two problematic nodes did not receive status change events from connected peers, so they missed that block throughsyncer
as well.Solution
The
syncer
has a field calledblock timeout
(basicallyblock time * 3
) used as a way to stop the syncing from some peer if it does not respond in appropriate time. The PR uses the same field to check if we did not receive anything from some peer, and if not, it will manually ping the best peer, without the need to wait for its status change.We have all the peers in a peer map, and that map holds the last block on each peer, so the algorithm will always choose a peer that is responsive and has the highest block number.
Changes include
Checklist
Testing