Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: guard addPeer from adding peers who are closing #5151

Merged
merged 5 commits into from
Feb 24, 2023

Conversation

AlgoAxel
Copy link
Contributor

@AlgoAxel AlgoAxel commented Feb 21, 2023

On a closer read of the test failure here, it appears possible to add a peer to the wsNetwork which has already closed:

If the peer remotely closes, it will call the wsNetwork.removePeer function to remove itself. However, if this remote close happens before addPeer has added it, there is no net effect. Then, when addPeer is finally called, the peer is added without being checked that the peer is closed or not. A peer is added with a closed read/write loop, and it is not clear if/when the peer is ever discovered and cleaned up, besides connection performance monitoring.

This adds a simple guard clause from the closing channel on the peer to prevent adding peers which have already closed.

There is an opportunity for this closing channel to close after this guard check, but before addPeer does its work, so this does not eliminate this situation 100%. Since the peer and wsNetwork aren't synchronized to one another, we would have to put in some recurring checks (or save some state from removePeer) to see if any of the peers in the peers list have closed without us noticing.

V2

now this guard uses the peer's didSignalClose atomic Int32. This flag gets set prior to the removePeer ever being called, so out of order calls should not have race potential.

Testing:
unit tests, still not reproducing locally (running now to attempt it more)

@AlgoAxel AlgoAxel changed the title guard addPeer from adding peers who are closing Fix: guard addPeer from adding peers who are closing Feb 21, 2023
@AlgoAxel AlgoAxel requested review from cce and algorandskiy February 21, 2023 19:57
@codecov
Copy link

codecov bot commented Feb 21, 2023

Codecov Report

Merging #5151 (4a1e27b) into master (b84f980) will increase coverage by 0.00%.
The diff coverage is 0.00%.

@@           Coverage Diff           @@
##           master    #5151   +/-   ##
=======================================
  Coverage   53.56%   53.57%           
=======================================
  Files         433      433           
  Lines       54736    54740    +4     
=======================================
+ Hits        29321    29326    +5     
+ Misses      23136    23131    -5     
- Partials     2279     2283    +4     
Impacted Files Coverage Δ
network/wsNetwork.go 68.74% <0.00%> (-0.07%) ⬇️
ledger/tracker.go 72.91% <0.00%> (-1.67%) ⬇️
ledger/catchpointtracker.go 56.66% <0.00%> (-0.77%) ⬇️
ledger/acctonline.go 76.90% <0.00%> (-0.51%) ⬇️
ledger/acctupdates.go 69.01% <0.00%> (+0.11%) ⬆️
network/wsPeer.go 71.72% <0.00%> (+0.45%) ⬆️
catchup/service.go 71.05% <0.00%> (+0.70%) ⬆️
ledger/testing/randomAccounts.go 56.88% <0.00%> (+1.22%) ⬆️
ledger/blockqueue.go 84.94% <0.00%> (+2.68%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

network/wsNetwork.go Outdated Show resolved Hide resolved
network/wsNetwork.go Outdated Show resolved Hide resolved
Copy link
Contributor

@cce cce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is the right place. Like you said it just shortens the race and doesn't solve the real problem, which is that a peer can end up getting through addPeer and adding a peer to the peers heap/list even though the other end closed the connection, the readLoop() and/or writeLoop() have both terminated, likely already called wp.Close() and set wp.didSignalClose, wp.didInnerClose, etc but the network isn't checking for this anywhere and removing them from the peers list/heap?

@AlgoAxel AlgoAxel requested a review from cce February 22, 2023 21:14
@excalq
Copy link
Contributor

excalq commented Feb 23, 2023

Telemetry gets a very high quantity of WARNING level entries having the message read err: websocket: close 1006 (abnormal closure): unexpected EOF. (About 5500/min). Would this guard against this condition, thereby reducing log noise?

algorandskiy
algorandskiy previously approved these changes Feb 24, 2023
@cce
Copy link
Contributor

cce commented Feb 24, 2023

Is this mostly a test-only concern — I think my understanding is you are narrowing a race to make the test less flaky, but if this happened there is a periodic checker in the network that will notice the peer was closed, and remove it right?

@AlgoAxel
Copy link
Contributor Author

@cce not sure how to directly "reply" to your comment,

Is this mostly a test-only concern — I think my understanding is you are narrowing a race to make the test less flaky, but if this happened there is a periodic checker in the network that will notice the peer was closed, and remove it right?

I don’t think this is just a test concern. The guard I’ve implemented should prevent this out of order add, which will prevent this from happening, but to your point about “is there a periodic checker” — it appears not, but I could use your confirmation.
If the read or write loops close, then the closing channel is closed, and the didSignalClose flag(s) are also set, closing down all the peer’s threading.
However, when wsNetwork interacts with a peer, it does so by calling things like peer.writeNonBlockMsg() . this message gets dumped onto a prio/bulk outgoing channel which the loops would be processing, but they aren’t. Errors from trying to write to the closed connection would only ever happen again if the write loop was doing its job on the messages. Instead, writeNonBlockMsg returns a bool if it was enqueued or not, but even that boolean isn’t well used in the places its collected. So the write loop being closed isn’t currently well detected. And since the read loop is no longer running, it doesn’t seem like the read loop would cause wsNetwork to notice any errors at this point either.

https://github.dev/algorand/go-algorand/blob/3b197f48b185741c9e88cfa5f3df39fd4c123c93/network/wsPeer.go#L702-L703

@cce
Copy link
Contributor

cce commented Feb 24, 2023

What about ConnPerfMon or checkPeersConnectivity?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants