-
Notifications
You must be signed in to change notification settings - Fork 476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: guard addPeer from adding peers who are closing #5151
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5151 +/- ##
=======================================
Coverage 53.56% 53.57%
=======================================
Files 433 433
Lines 54736 54740 +4
=======================================
+ Hits 29321 29326 +5
+ Misses 23136 23131 -5
- Partials 2279 2283 +4
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is the right place. Like you said it just shortens the race and doesn't solve the real problem, which is that a peer can end up getting through addPeer and adding a peer to the peers heap/list even though the other end closed the connection, the readLoop() and/or writeLoop() have both terminated, likely already called wp.Close() and set wp.didSignalClose, wp.didInnerClose, etc but the network isn't checking for this anywhere and removing them from the peers list/heap?
Telemetry gets a very high quantity of WARNING level entries having the message |
Is this mostly a test-only concern — I think my understanding is you are narrowing a race to make the test less flaky, but if this happened there is a periodic checker in the network that will notice the peer was closed, and remove it right? |
@cce not sure how to directly "reply" to your comment,
I don’t think this is just a test concern. The guard I’ve implemented should prevent this out of order add, which will prevent this from happening, but to your point about “is there a periodic checker” — it appears not, but I could use your confirmation. |
What about ConnPerfMon or checkPeersConnectivity? |
On a closer read of the test failure here, it appears possible to add a peer to the wsNetwork which has already closed:
If the peer remotely closes, it will call the
wsNetwork.removePeer
function to remove itself. However, if this remote close happens beforeaddPeer
has added it, there is no net effect. Then, whenaddPeer
is finally called, the peer is added without being checked that the peer is closed or not. A peer is added with a closed read/write loop, and it is not clear if/when the peer is ever discovered and cleaned up, besides connection performance monitoring.This adds a simple guard clause from the closing channel on the peer to prevent adding peers which have already closed.
There is an opportunity for this closing channel to close after this guard check, but before addPeer does its work, so this does not eliminate this situation 100%. Since the peer and wsNetwork aren't synchronized to one another, we would have to put in some recurring checks (or save some state from
removePeer
) to see if any of the peers in the peers list have closed without us noticing.V2
now this guard uses the peer's
didSignalClose
atomic Int32. This flag gets set prior to theremovePeer
ever being called, so out of order calls should not have race potential.Testing:
unit tests, still not reproducing locally (running now to attempt it more)