Fix test timeout related to min connected peers #1443
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The problem with timeout in tests when waiting for cluster start-up is related to not enough min connected peers due to a dialing problem. Before starting the polybft consensus protocol, there is a check if at least 2 peers are connected before proceeding with the initialization. When dialing peers, the dial can fail for several reasons, e.g.
"unable to create new identity client connection, protocols not supported: [/id/0.1]"
"failed to negotiate security protocol: context deadline exceeded"
If this happens there is a redial mechanism which should handle this and after that peers will be eventually connected. However, during multiple tests executing when analyzing this problem, there is a doubt that a bug in the process of discovering peers and adding them to dialing queue exists. Analyzing the logs from this one case (3 nodes in total), it can be seen that the node connected to 1 peer but could not connect to another peer. Later, dialing that other peer is not called because the dial queue was empty, even though the peer should be added when running peer discovery. It seems that peer was present in the routing table, but it was disconnected, also when dialing fails in regular cases we see the logs for removing the peer info ("Attempted removing missing peer") which will probably enable adding it to the dialing queue and it this problematic scenario these logs are missing. Since the problem should be analyzed more detailed (if possible with higher reproducibility), and there is a task in which the parallel peer dialing should be implemented which should enhance peers connection time, this problematic scenario will be explained in that task and logs attached.
Through this PR the following is done:
Probably we should review the condition for number of min peers condition when waiting before starting consensus protocol (minSyncPeers = 2 in polybft.go) and (compare it to?) minimum peer connections used for keepAliveMinimumPeerConnections (MinimumPeerConnections = 1 in server.go)
Changes include
Checklist
Testing