Concurrent upgrades #489

dryajov · 2020-12-16T20:50:35Z

This PR adds inflight connection throttling for incoming connections.

It includes:

Adds an upgraded event to connections to signal when it has been upgraded/failed
- This is needed for incoming connections because the initiator controls the upgrade flow
Incoming connection upgrade throttling to match how dial works and prevent them from taking up all the available slots

arnetheduck · 2020-12-18T14:51:28Z

libp2p/switch.nim

        # A nil connection means that we might have hit a
        # file-handle limit (or another non-fatal error),
        # we can get one on the next try, but we should
        # be careful to not end up in a thigh loop that
        # will starve the main event loop, thus we sleep
        # here before retrying.
+        trace "Unable to get a connection, sleeping"
        await sleepAsync(100.millis) # TODO: should be configurable?


try/catch because may be cancelled, in which case semaphore must be released - no finally in loop though

Hmm, isn't canceling only possible on the entire accept?

yeah, but you don't want to leave the semaphore hanging, ever, same as you don't want to leave other unfinished futures around

Those are canceled as well tho? There is (should be?) teardown process where all enclosed futures get cancelled as well.

I meant that the semaphore will not be released - it may be harmless in this particular case but it's a ticking bomb not to release the semaphore because the code will get copy-pasted and refactored to places where it matters - the point is that with semaphores and locks, you never ever want to leave them in an acquired state no matter the flow - I mentioned the futures because they're similar from that point of view

yeah, I guess it's a good idea to release on cancellation either way

arnetheduck · 2020-12-18T14:54:03Z

libp2p/switch.nim

+        ## monitor connection for upgrades
+        ##
+        try:
+          await conn.upgraded.wait(30.seconds) # wait for connection to by upgraded


so this implicitly becomes an upgrade timeout? like this overlaps with the general connection timeout monitor, creating two competing timeout mechanisms

There is no upgrade timeout right now, so yes this is an implicit upgrade timeout.

so it looks like a feature that should at least be shared between incoming and outgoing upgrades?

Well, this is actually more about who initiates the flow. In the case of a dial/connect, "we" are the initiator and we can make decisions when to stop it - either due to a timeout or some other reason. In other words "we" control the flow of the upgrade. For incoming connections, we're not the initiator, the remote is, we only respond to it's upgrade requests, so this is a guard to prevent it from hijacking a connection in case it "hangs".

fair enough, deserves a comment

arnetheduck · 2020-12-18T14:54:50Z

libp2p/switch.nim

+          trace "Connection upgrade succeeded"
+        except CatchableError as exc:
+          trace "Exception awaiting connection upgrade", exc = exc.msg, conn
+          if not(isNil(conn)) and not(conn.closed):


conn can't be nil here and doesn't need closed check really (close does that)

dryajov · 2020-12-18T22:28:59Z

libp2p/switch.nim

+          await conn.upgraded.wait(30.seconds) # wait for connection to be upgraded
+          trace "Connection upgrade succeeded"
+        except CatchableError as exc:
+          if not isNil(conn): # for some reason, this can be nil


This nil is puzzling me and it also happens almost immediately, so quite reproducible, and yet there isn't anything that would nillify the connection explicitly and the closure scope should still be holding onto it - possible GC bug?

eeh... right, nim has some pretty weird closure rules - wonder for example if a new instance is allocated on every iteration - @zah?

the more reliable way to write things is to not use a closure and pass the variable in explicitly, then you don't get hit by gotchas

arnetheduck · 2020-12-19T14:47:00Z

libp2p/connmanager.nim

  try:
    trace "Triggering connect events", conn
+    # NOTE: make sure the upgrade event
+    # happens *before* any other events
+    # are triggered


why is this important? ie anything that's waiting on this future may run either before or after the peer event trigger below

dryajov · 2020-12-19T18:51:39Z

Because it makes it less likely for this event to interleave with a possible disconnect triggered from the event handlers.

…

On Sat, Dec 19, 2020, 08:47 Jacek Sieka ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In libp2p/connmanager.nim <#489 (comment)>: > try: trace "Triggering connect events", conn + # NOTE: make sure the upgrade event + # happens *before* any other events + # are triggered why is this important? ie anything that's waiting on this future may run either before or after the peer event trigger below — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#489 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAILFRPZ7PU325HK3R2R4WTSVS4HFANCNFSM4U6SEREA> .

arnetheduck · 2020-12-20T09:20:53Z

Because it makes it less likely for this event to interleave with a possible disconnect triggered from the event handlers.
…

less likely doesn't really matter though - as in, it just makes the bugs less frequent / harder to detect - the code must be written in such a way that the order doesn't matter and the comment would do well to point that out, rather than advertising a false sense of order

dryajov · 2020-12-21T15:25:00Z

Because it makes it less likely for this event to interleave with a possible disconnect triggered from the event handlers.
…

less likely doesn't really matter though - as in, it just makes the bugs less frequent / harder to detect - the code must be written in such a way that the order doesn't matter and the comment would do well to point that out, rather than advertising a false sense of order

It definitely doesn't guarantee anything but it reduces the likelihood of triggering error paths, but other than that and being cautious, it doesn't have an effect.

arnetheduck · 2020-12-22T09:04:36Z

libp2p/connmanager.nim

  try:
    trace "Triggering connect events", conn
+    conn.upgraded.complete()


doesn't this need a nil check?

It gets initialized in Connection, but we can add one for good measure.

dryajov · 2020-12-23T14:11:56Z

@arnetheduck anything else you think needs to be done here or you're fine merging it now?

arnetheduck · 2020-12-24T08:53:49Z

libp2p/switch.nim

@@ -478,6 +510,14 @@ proc stop*(s: Switch) {.async.} =
    except CatchableError as exc:
      warn "error cleaning up transports", msg = exc.msg

+  let stopped = await allFuturesThrowing(s.acceptFuts)


so.. the hope is that none of the accept futures raises? because if any of them did raise, we'll not finish the stop logic here?

If it raises it will propagate up to the caller in which case the caller will have to decide what can be done about it, but we can attempt to handle it here as well given that there probably isn't much else to do.

the point is more that the caller cannot reason about what to do if stop fails, typically - you can't describe what actions are appropriate because the information is internal to the switch module - specially after stop partially has succeeded (some of the futures may have completed and some not) - this is often the case with these kinds of exceptions that randomly bubble up from composite operations - this is also why raising an exception in a for loop rarely makes sense: there's no way for the caller to reason about the partial success - this is something we keep coming back to, and that has been the source of numerous bugs in libp2p: aborting operations mid-way often means some sort of cleanup must happen - this is specially true of composite operations and it is exceedingly rare for these not to require local error handling and cleanup - to the point that it should be motivated explicitly.

arnetheduck · 2020-12-25T21:10:56Z

libp2p/switch.nim

@@ -510,8 +510,12 @@ proc stop*(s: Switch) {.async.} =
    except CatchableError as exc:
      warn "error cleaning up transports", msg = exc.msg

-  let stopped = await allFuturesThrowing(s.acceptFuts)
-    .withTimeout(1.seconds)
+  var stopped: bool


let stopped = try: await all... except ...: trace "..." false

this style makes the compiler ensure that all branches get an explicit value for stopped

dryajov mentioned this pull request Dec 16, 2020

Concurrent accepts #475

Closed

dryajov marked this pull request as draft December 16, 2020 21:16

dryajov marked this pull request as ready for review December 17, 2020 15:49

arnetheduck reviewed Dec 18, 2020

View reviewed changes

dryajov force-pushed the concurrent-upgrades branch from 8e856c0 to 9e38247 Compare December 18, 2020 19:58

dryajov commented Dec 18, 2020

View reviewed changes

arnetheduck reviewed Dec 19, 2020

View reviewed changes

arnetheduck reviewed Dec 22, 2020

View reviewed changes

dryajov added 17 commits December 22, 2020 18:30

adding an upgraded event to conn

0836d07

set stopped flag asap

2b63865

trigger upgradded event on conn

7d3f404

set concurrency limit for accepts

4538fb5

backporting semaphore from tcp-limits2

be0880a

export unittests module

0ac19a3

make params explicit

2199886

tone down debug logs

110d7cc

adding semaphore tests

36a7462

use semaphore to throttle concurent upgrades

cf23fa1

add libp2p scope

b3f071f

trigger upgraded event before any other events

5dc2946

add event handler for connection upgrade

3dc4f23

cleanup upgraded event on conn close

b3a07a0

make upgrades slot release rebust

354e342

dont forget to release slot on nil connection

93aab50

misc

620dce6

dryajov added 5 commits December 22, 2020 18:30

make sure semaphore is always released

1373a5b

minor improvements and a nil check

906bf08

removing unneeded comment

dab74f7

make upgradeMonitor a non-closure proc

014c38b

make sure the upgraded event is initialized

0949bb0

dryajov force-pushed the concurrent-upgrades branch from 747f832 to 0949bb0 Compare December 23, 2020 00:30

arnetheduck reviewed Dec 24, 2020

View reviewed changes

handle exceptions in accepts when stopping

c42b49d

arnetheduck reviewed Dec 25, 2020

View reviewed changes

arnetheduck approved these changes Dec 28, 2020

View reviewed changes

don't leak exceptions when stopping accept loops

fe904e2

dryajov merged commit b2ea5a3 into master Jan 4, 2021

dryajov deleted the concurrent-upgrades branch January 4, 2021 18:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concurrent upgrades #489

Concurrent upgrades #489

dryajov commented Dec 16, 2020 •

edited

Loading

arnetheduck Dec 18, 2020

dryajov Dec 18, 2020

arnetheduck Dec 18, 2020

dryajov Dec 18, 2020

arnetheduck Dec 18, 2020

dryajov Dec 18, 2020

arnetheduck Dec 18, 2020

dryajov Dec 18, 2020

arnetheduck Dec 18, 2020

dryajov Dec 18, 2020 •

edited

Loading

arnetheduck Dec 18, 2020

arnetheduck Dec 18, 2020

dryajov Dec 18, 2020 •

edited

Loading

arnetheduck Dec 19, 2020

arnetheduck Dec 19, 2020

arnetheduck Dec 19, 2020

dryajov commented Dec 19, 2020 via email

arnetheduck commented Dec 20, 2020

dryajov commented Dec 21, 2020

arnetheduck Dec 22, 2020

dryajov Dec 22, 2020 •

edited

Loading

dryajov commented Dec 23, 2020

arnetheduck Dec 24, 2020

dryajov Dec 24, 2020 •

edited

Loading

arnetheduck Dec 25, 2020

dryajov Dec 27, 2020

arnetheduck Dec 25, 2020 •

edited

Loading

arnetheduck Dec 25, 2020

Concurrent upgrades #489

Concurrent upgrades #489

Conversation

dryajov commented Dec 16, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dryajov Dec 18, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dryajov Dec 18, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dryajov commented Dec 19, 2020 via email

arnetheduck commented Dec 20, 2020

dryajov commented Dec 21, 2020

Choose a reason for hiding this comment

dryajov Dec 22, 2020 • edited Loading

Choose a reason for hiding this comment

dryajov commented Dec 23, 2020

Choose a reason for hiding this comment

dryajov Dec 24, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arnetheduck Dec 25, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dryajov commented Dec 16, 2020 •

edited

Loading

dryajov Dec 18, 2020 •

edited

Loading

dryajov Dec 18, 2020 •

edited

Loading

dryajov Dec 22, 2020 •

edited

Loading

dryajov Dec 24, 2020 •

edited

Loading

arnetheduck Dec 25, 2020 •

edited

Loading