-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Concurrent upgrades #489
Concurrent upgrades #489
Conversation
# A nil connection means that we might have hit a | ||
# file-handle limit (or another non-fatal error), | ||
# we can get one on the next try, but we should | ||
# be careful to not end up in a thigh loop that | ||
# will starve the main event loop, thus we sleep | ||
# here before retrying. | ||
trace "Unable to get a connection, sleeping" | ||
await sleepAsync(100.millis) # TODO: should be configurable? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try/catch because may be cancelled, in which case semaphore must be released - no finally in loop though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, isn't canceling only possible on the entire accept
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, but you don't want to leave the semaphore hanging, ever, same as you don't want to leave other unfinished futures around
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are canceled as well tho? There is (should be?) teardown process where all enclosed futures get cancelled as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant that the semaphore will not be released - it may be harmless in this particular case but it's a ticking bomb not to release the semaphore because the code will get copy-pasted and refactored to places where it matters - the point is that with semaphores and locks, you never ever want to leave them in an acquired state no matter the flow - I mentioned the futures because they're similar from that point of view
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I guess it's a good idea to release on cancellation either way
libp2p/switch.nim
Outdated
## monitor connection for upgrades | ||
## | ||
try: | ||
await conn.upgraded.wait(30.seconds) # wait for connection to by upgraded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this implicitly becomes an upgrade timeout? like this overlaps with the general connection timeout monitor, creating two competing timeout mechanisms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no upgrade timeout right now, so yes this is an implicit upgrade timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so it looks like a feature that should at least be shared between incoming and outgoing upgrades?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, this is actually more about who initiates the flow. In the case of a dial/connect
, "we" are the initiator and we can make decisions when to stop it - either due to a timeout or some other reason. In other words "we" control the flow of the upgrade. For incoming connections, we're not the initiator, the remote is, we only respond to it's upgrade requests, so this is a guard to prevent it from hijacking a connection in case it "hangs".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fair enough, deserves a comment
libp2p/switch.nim
Outdated
trace "Connection upgrade succeeded" | ||
except CatchableError as exc: | ||
trace "Exception awaiting connection upgrade", exc = exc.msg, conn | ||
if not(isNil(conn)) and not(conn.closed): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
conn
can't be nil here and doesn't need closed check really (close does that)
8e856c0
to
9e38247
Compare
libp2p/switch.nim
Outdated
await conn.upgraded.wait(30.seconds) # wait for connection to be upgraded | ||
trace "Connection upgrade succeeded" | ||
except CatchableError as exc: | ||
if not isNil(conn): # for some reason, this can be nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This nil
is puzzling me and it also happens almost immediately, so quite reproducible, and yet there isn't anything that would nillify
the connection explicitly and the closure scope should still be holding onto it - possible GC bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eeh... right, nim has some pretty weird closure rules - wonder for example if a new instance is allocated on every iteration - @zah?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the more reliable way to write things is to not use a closure and pass the variable in explicitly, then you don't get hit by gotchas
libp2p/connmanager.nim
Outdated
try: | ||
trace "Triggering connect events", conn | ||
# NOTE: make sure the upgrade event | ||
# happens *before* any other events | ||
# are triggered |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this important? ie anything that's waiting on this future may run either before or after the peer event trigger below
Because it makes it less likely for this event to interleave with a
possible disconnect triggered from the event handlers.
…On Sat, Dec 19, 2020, 08:47 Jacek Sieka ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In libp2p/connmanager.nim
<#489 (comment)>:
> try:
trace "Triggering connect events", conn
+ # NOTE: make sure the upgrade event
+ # happens *before* any other events
+ # are triggered
why is this important? ie anything that's waiting on this future may run
either before or after the peer event trigger below
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#489 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAILFRPZ7PU325HK3R2R4WTSVS4HFANCNFSM4U6SEREA>
.
|
less likely doesn't really matter though - as in, it just makes the bugs less frequent / harder to detect - the code must be written in such a way that the order doesn't matter and the comment would do well to point that out, rather than advertising a false sense of order |
It definitely doesn't guarantee anything but it reduces the likelihood of triggering error paths, but other than that and being cautious, it doesn't have an effect. |
try: | ||
trace "Triggering connect events", conn | ||
conn.upgraded.complete() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't this need a nil check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It gets initialized in Connection
, but we can add one for good measure.
747f832
to
0949bb0
Compare
@arnetheduck anything else you think needs to be done here or you're fine merging it now? |
libp2p/switch.nim
Outdated
@@ -478,6 +510,14 @@ proc stop*(s: Switch) {.async.} = | |||
except CatchableError as exc: | |||
warn "error cleaning up transports", msg = exc.msg | |||
|
|||
let stopped = await allFuturesThrowing(s.acceptFuts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so.. the hope is that none of the accept futures raises? because if any of them did raise, we'll not finish the stop logic here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it raises it will propagate up to the caller in which case the caller will have to decide what can be done about it, but we can attempt to handle it here as well given that there probably isn't much else to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the point is more that the caller cannot reason about what to do if stop
fails, typically - you can't describe what actions are appropriate because the information is internal to the switch module - specially after stop partially has succeeded (some of the futures may have completed and some not) - this is often the case with these kinds of exceptions that randomly bubble up from composite operations - this is also why raising an exception in a for loop rarely makes sense: there's no way for the caller to reason about the partial success - this is something we keep coming back to, and that has been the source of numerous bugs in libp2p: aborting operations mid-way often means some sort of cleanup must happen - this is specially true of composite operations and it is exceedingly rare for these not to require local error handling and cleanup - to the point that it should be motivated explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool
libp2p/switch.nim
Outdated
@@ -510,8 +510,12 @@ proc stop*(s: Switch) {.async.} = | |||
except CatchableError as exc: | |||
warn "error cleaning up transports", msg = exc.msg | |||
|
|||
let stopped = await allFuturesThrowing(s.acceptFuts) | |||
.withTimeout(1.seconds) | |||
var stopped: bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let stopped = try:
await all...
except ...:
trace "..."
false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this style makes the compiler ensure that all branches get an explicit value for stopped
This PR adds inflight connection throttling for incoming connections.
It includes:
upgraded
event to connections to signal when it has been upgraded/failed