Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrent upgrades #489

Merged
merged 24 commits into from
Jan 4, 2021
Merged

Concurrent upgrades #489

merged 24 commits into from
Jan 4, 2021

Conversation

dryajov
Copy link
Contributor

@dryajov dryajov commented Dec 16, 2020

This PR adds inflight connection throttling for incoming connections.

It includes:

  • Adds an upgraded event to connections to signal when it has been upgraded/failed
    • This is needed for incoming connections because the initiator controls the upgrade flow
  • Incoming connection upgrade throttling to match how dial works and prevent them from taking up all the available slots

@dryajov dryajov mentioned this pull request Dec 16, 2020
@dryajov dryajov marked this pull request as draft December 16, 2020 21:16
@dryajov dryajov marked this pull request as ready for review December 17, 2020 15:49
# A nil connection means that we might have hit a
# file-handle limit (or another non-fatal error),
# we can get one on the next try, but we should
# be careful to not end up in a thigh loop that
# will starve the main event loop, thus we sleep
# here before retrying.
trace "Unable to get a connection, sleeping"
await sleepAsync(100.millis) # TODO: should be configurable?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try/catch because may be cancelled, in which case semaphore must be released - no finally in loop though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, isn't canceling only possible on the entire accept?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, but you don't want to leave the semaphore hanging, ever, same as you don't want to leave other unfinished futures around

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are canceled as well tho? There is (should be?) teardown process where all enclosed futures get cancelled as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that the semaphore will not be released - it may be harmless in this particular case but it's a ticking bomb not to release the semaphore because the code will get copy-pasted and refactored to places where it matters - the point is that with semaphores and locks, you never ever want to leave them in an acquired state no matter the flow - I mentioned the futures because they're similar from that point of view

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I guess it's a good idea to release on cancellation either way

## monitor connection for upgrades
##
try:
await conn.upgraded.wait(30.seconds) # wait for connection to by upgraded
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this implicitly becomes an upgrade timeout? like this overlaps with the general connection timeout monitor, creating two competing timeout mechanisms

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no upgrade timeout right now, so yes this is an implicit upgrade timeout.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it looks like a feature that should at least be shared between incoming and outgoing upgrades?

Copy link
Contributor Author

@dryajov dryajov Dec 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this is actually more about who initiates the flow. In the case of a dial/connect, "we" are the initiator and we can make decisions when to stop it - either due to a timeout or some other reason. In other words "we" control the flow of the upgrade. For incoming connections, we're not the initiator, the remote is, we only respond to it's upgrade requests, so this is a guard to prevent it from hijacking a connection in case it "hangs".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair enough, deserves a comment

trace "Connection upgrade succeeded"
except CatchableError as exc:
trace "Exception awaiting connection upgrade", exc = exc.msg, conn
if not(isNil(conn)) and not(conn.closed):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conn can't be nil here and doesn't need closed check really (close does that)

await conn.upgraded.wait(30.seconds) # wait for connection to be upgraded
trace "Connection upgrade succeeded"
except CatchableError as exc:
if not isNil(conn): # for some reason, this can be nil
Copy link
Contributor Author

@dryajov dryajov Dec 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This nil is puzzling me and it also happens almost immediately, so quite reproducible, and yet there isn't anything that would nillify the connection explicitly and the closure scope should still be holding onto it - possible GC bug?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eeh... right, nim has some pretty weird closure rules - wonder for example if a new instance is allocated on every iteration - @zah?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the more reliable way to write things is to not use a closure and pass the variable in explicitly, then you don't get hit by gotchas

try:
trace "Triggering connect events", conn
# NOTE: make sure the upgrade event
# happens *before* any other events
# are triggered
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this important? ie anything that's waiting on this future may run either before or after the peer event trigger below

@dryajov
Copy link
Contributor Author

dryajov commented Dec 19, 2020 via email

@arnetheduck
Copy link
Contributor

Because it makes it less likely for this event to interleave with a possible disconnect triggered from the event handlers.

less likely doesn't really matter though - as in, it just makes the bugs less frequent / harder to detect - the code must be written in such a way that the order doesn't matter and the comment would do well to point that out, rather than advertising a false sense of order

@dryajov
Copy link
Contributor Author

dryajov commented Dec 21, 2020

Because it makes it less likely for this event to interleave with a possible disconnect triggered from the event handlers.

less likely doesn't really matter though - as in, it just makes the bugs less frequent / harder to detect - the code must be written in such a way that the order doesn't matter and the comment would do well to point that out, rather than advertising a false sense of order

It definitely doesn't guarantee anything but it reduces the likelihood of triggering error paths, but other than that and being cautious, it doesn't have an effect.

try:
trace "Triggering connect events", conn
conn.upgraded.complete()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't this need a nil check?

Copy link
Contributor Author

@dryajov dryajov Dec 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It gets initialized in Connection, but we can add one for good measure.

@dryajov
Copy link
Contributor Author

dryajov commented Dec 23, 2020

@arnetheduck anything else you think needs to be done here or you're fine merging it now?

@@ -478,6 +510,14 @@ proc stop*(s: Switch) {.async.} =
except CatchableError as exc:
warn "error cleaning up transports", msg = exc.msg

let stopped = await allFuturesThrowing(s.acceptFuts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so.. the hope is that none of the accept futures raises? because if any of them did raise, we'll not finish the stop logic here?

Copy link
Contributor Author

@dryajov dryajov Dec 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it raises it will propagate up to the caller in which case the caller will have to decide what can be done about it, but we can attempt to handle it here as well given that there probably isn't much else to do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the point is more that the caller cannot reason about what to do if stop fails, typically - you can't describe what actions are appropriate because the information is internal to the switch module - specially after stop partially has succeeded (some of the futures may have completed and some not) - this is often the case with these kinds of exceptions that randomly bubble up from composite operations - this is also why raising an exception in a for loop rarely makes sense: there's no way for the caller to reason about the partial success - this is something we keep coming back to, and that has been the source of numerous bugs in libp2p: aborting operations mid-way often means some sort of cleanup must happen - this is specially true of composite operations and it is exceedingly rare for these not to require local error handling and cleanup - to the point that it should be motivated explicitly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool

@@ -510,8 +510,12 @@ proc stop*(s: Switch) {.async.} =
except CatchableError as exc:
warn "error cleaning up transports", msg = exc.msg

let stopped = await allFuturesThrowing(s.acceptFuts)
.withTimeout(1.seconds)
var stopped: bool
Copy link
Contributor

@arnetheduck arnetheduck Dec 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let stopped = try: 
    await all... 
  except ...: 
    trace "..."
    false

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this style makes the compiler ensure that all branches get an explicit value for stopped

@dryajov dryajov merged commit b2ea5a3 into master Jan 4, 2021
@dryajov dryajov deleted the concurrent-upgrades branch January 4, 2021 18:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants