Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Peer dialing/kicking system overhaul #3346

Merged
merged 18 commits into from
Mar 11, 2022
Merged

Peer dialing/kicking system overhaul #3346

merged 18 commits into from
Mar 11, 2022

Conversation

Menduist
Copy link
Contributor

@Menduist Menduist commented Jan 31, 2022

This PR is an ongoing effort to improve peering

The most complex part of this PR is the new force dial mechanism.
Previously, we tried to dial peer, and if we didn't have room to dial peer, we would kick some to make room for the new one. The major downside of this system is that it kick peers before actually dialing them, so it make kick peers and fail to dial the new ones afterward.

Instead, this PR relies on a small modification in libp2p that allows to go over the maxPeers when dialing. A regular trimmer will then kick the excess connections.
The tricky part is that we need to balance between health and incoming starvation. Indeed, by being most of the time above maxPeers, incoming peers will never be allowed, so over time, we'll end up with outgoing only, which is not good for the network as a whole.

Instead, this PR will let a few incoming peers in when we are force dialing.

A few metrics of it running in a few scenarios, with all subnets:

Mainnet, max-peers=300

image
It's really hard to get a healthy set of peers for every subnet on mainnet, even with 300 peers. This node is never able to stabilize, instead hovering around 310 peers and spend all his time kicking and dialing new peers.

Prater, max-peers=60

image
On prater, getting healthy peers is easier (bigger nodes on average), so we're stable, hover around 60, and have around half incoming, half outgoing which is good.

Mainnet, max-peers=60

image
If getting good health with 300 peers is hard, 60 is impossible. But still, a good incoming/outgoing balance

Astute readers will have notice the weird spikes on the last two screenshots. The hard limit of connections is now max-conn * 2, and sometime we will find a lot of peers in a discovery round, and dial them all at once, until we hit the limit. Then the kicker will come in and trim the excess fairly quickly.
EDIT: the spikes were actually caused by the trimmer getting stuck

@github-actions
Copy link

github-actions bot commented Jan 31, 2022

Unit Test Results

     12 files     829 suites   30m 6s ⏱️
1 674 tests 1 626 ✔️   48 💤 0
9 775 runs  9 663 ✔️ 112 💤 0

Results for commit 97a5977.

♻️ This comment has been updated with latest results.

@arnetheduck
Copy link
Member

Then the kicker will come in and trim the excess fairly quickly.

I don't see the corresponding spike in the kicked peers metric under discovery and dialling - is that a display issue in grafana or something else?

# between kicks, so we must not kick too many peers at
# once
for _ in 0..<excessPeers div 2:
await node.trimConnections(2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

intuitively, this feels like it's on the aggressive end: consider startup - we just connected to a bunch of peers and they have not necessarily had time to even establish a score / connection / handshake to meaningfully differentiate them, and here we kick them out very aggressively -

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metadata pinger is used as a grace period, ie we won't kick a peer if we don't have his metadata.
This may seem a bit weird, but the grace period is a tricky subject: if it's too low, we'll kick perfectly good peers that don't have time to be established correctly, if it's too long, we may kick a useful peer because a useless peer is still in his grace period.

The idea is that if we had time to get his metadata, we should be nicely setup, and if he doesn't respond to metadata requests, he'll eventually get kicked anyway

I did try to do a smarter trimmer, that will only kick peers if we have to, in this spirit:
ed3b2fd#diff-84fc3d73d852761d07624e61621b0784b6f144dbeeb57f001990bd8f7bc5d7a7R1650-R1659

But we ended up getting stable at 75% of the hard-capacity (so stabilized at max-peers * 1.5)
I tried later to do a more soft limit, eg max-score = lerp(minScore=300, maxScore=10000, a=0, b=maxPeers, value=excessPeers), but then we just stabilized at a random point between max-peers and max-peers * 2, which I find more confusing than anything. + it relies on magic values (300 & 10000)

So anyway, the smart approach proved difficult, I switched back to a dumb one, which is pretty effective, imo

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking this loop could be: while len(node.peerPool) > node.wantedPeers: kick(worst peer); sleep(1) or something like this - basically, we fetch new peers from discovery at a certain rate - we should not kick peers at a higher rate than that (but perhaps there's logic somewhere that I'm missing that keeps the two in balance?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we don't know the discovery rate, for instance on mainnet we discover a lot more peers than in prater (because the DHT is shared and there is more nodes in mainnet)
But the result is the same: since only dialing can go over max-peers, if we kick everything over max-peers, we'll kick as much as we discover (and successfully dial)

I've thought about moving this to the connect worker, eg:

if canConnect:
  let success = await connect
  if success and len(node.peerPool) > node.wantedPeers:
    kick(worst peer)

But I think the current approach is more "bullet-proof"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, kicking should stay in this separate task - what would the downsides of "rate-limiting" kicking be?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rate-limiting being your snippet?

Since we don't know how much peers we discover on each round, it may be too slow. To be safe, it would have to sleep very little, which would be equivalent to the current code

Some numbers on mainnet:

2022-02-01 11:32:45.011+00:00 discovered_nodes:56
2022-02-01 11:33:06.336+00:00 discovered_nodes:8
2022-02-01 11:33:13.220+00:00 discovered_nodes:30
2022-02-01 11:33:22.452+00:00 discovered_nodes:24

As you can see, both the number of discovered nodes and the duration of the discovery loop varies wildly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be safe, it would have to sleep very little, which would be equivalent to the current code

except that it allows the chronos loop to run between each kick as well, which is a good thing in general - how about my loop with some cap on the number of peers kicked in each round - maybe something like max(1, sqrt(peers_in_excess)) which will kick more aggressively when there's lots of overshoot and slow down later..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last commit switches to

while true:
  let excessPeers = connectedPeers - node.wantedPeers
  if excessPeers > 0: kick(2 worst peer)
  sleep(1.seconds div max(1, excessPeers))

Which works quite well

except CancelledError as exc:
raise exc
except CatchableError as exc:
debug "failed to kick peer", peerId, err=exc.msg
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something that can really occur? Is the dec toKick then ok on the line below?

Copy link
Contributor Author

@Menduist Menduist Feb 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it can actually happen, though worst case scenario, the next iteration (5 seconds later) will pick up the slack

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I would rather stay safe and over-estimate the number of kicked peers, than under-estimate


scores[peer.peerId] = scores[peer.peerId] + thisPeersScore
for peer in node.pubsub.gossipsub.getOrDefault(topic):
if peer.peerId notin scores: continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we skip non-stability peers here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I read the code wrong - we're looking at peers that nimbus knows about, and filtering out peers that are in gossipsub and not in nimbus - if this happens, it smells of bug, but moving on:

we don't give score to syncnet peers in the initial loop - this needs addressing as well - also, we sort of prefer "large" peers with lots of subnets over smaller peer with fewer subnets - I suspect this ends up being wrong over time because it gives an unfair advantage to large peers which in now way are guaranteed to be "better" - it would be good to strive for a random mix of both kinds - perhaps "has_some_interesting_attnet" and "has_some_interesting_syncnet" would be better conditions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another idea I had, instead of scoring on SUM(topicsPoints), we could score on MAX(topicPoints), wdyt?

The first loop is only here to score peers during sync at this point, ideally I would like to get rid of it, not sure how atm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and filtering out peers that are in gossipsub and not in nimbus - if this happens, it smells of bug, but moving on

It can happen if we don't have a metadata for them yet (ie, they are in their grace period)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last commit switches to AVG(nonGlobalTopicsPoints)
Global topics are not counted because that would be unfair towards small nodes. They always have low value, and would bring the average down

@Menduist Menduist mentioned this pull request Feb 1, 2022
@Menduist Menduist changed the base branch from unstable to tunepeer1 February 1, 2022 12:28
@Menduist Menduist changed the base branch from tunepeer1 to unstable February 1, 2022 12:30
@Menduist Menduist changed the base branch from unstable to tunepeer1 February 1, 2022 12:49
@Menduist
Copy link
Contributor Author

Menduist commented Feb 1, 2022

I don't see the corresponding spike in the kicked peers metric under discovery and dialling - is that a display issue in grafana or something else?

Indeed, I was wrong, the issue was that the peer trimmer got stuck while disconnecting a peer, switched to an asyncSpawn to avoid it happening again, seems good so far

Base automatically changed from tunepeer1 to unstable February 1, 2022 17:20
@arnetheduck
Copy link
Member

Indeed, I was wrong, the issue was that the peer trimmer got stuck while disconnecting a peer,

when does it get .. unstuck?

@Menduist
Copy link
Contributor Author

Menduist commented Feb 4, 2022

It goes back to the discussion we had a while ago: should await switch.disconnect(peer) wait until the peer is actually disconnected or not

Currently, it does wait, so if you are disconnecting someone and don't want to wait, you have to asyncSpawn. And I fell into the trap of not doing it, which caused the trimmer to be stuck for a few minutes when a slow peer is being kicked

And I can confirm that it still runs without spikes for 3 days, so that's fixed

Base automatically changed from unstable to stable February 15, 2022 20:57
@zah zah changed the base branch from stable to unstable February 15, 2022 22:03
@Menduist Menduist mentioned this pull request Feb 25, 2022
@Menduist Menduist marked this pull request as ready for review February 25, 2022 12:54
@Menduist Menduist changed the title Tune peering Peer dialing/kicking system overhaul Feb 25, 2022
Co-authored-by: Jacek Sieka <jacek@status.im>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants