-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Low Peer Numbers #6805
Comments
Some examples of current peers on some mainnet nodes: 25 peers on dev-elc-besu-teku-mainnet-dev-stefan-BWS-2:
prd-elc-besu-teku-mainnet-bonsai-snap
prd-elc-besu-nimbus-mainnet-nightly-bonsai-snap
|
I wonder if this is a misreported version, but if it's not then this is the commit hash... From Jan 2022.
That lends more weight to this being a misreported version I guess |
Some discussion on #execution-dev ... |
Updated description with more reports. We are still actively looking into this, hopefully have an update soon. |
We have identified an issue related to Blob transaction gossip resulting in Geth peers sometimes disconnecting Besu. We are currently testing out a fix. This should only be an issue once the node is in sync (since transaction gossip is paused during sync). There may still be a separate issue related to peer discovery which we are still investigating |
blob transaction gossip fix has been merged into main - more info - #6777 |
FWIW observed the same on Holesky with 24.3.0. I raised the peer limit to 40 from 25, but during snap sync it was essentially 0-2 peers that I actually synced with. Towards the very end I started to get more peers. Never hit the 40 limit in the 12 hours it's been running. I tried making sense of debug logs but don't have much context (new to the staking scene). It was clearly trying though, lots of disconnects. I presume the "Disconnected USEFULL peer" might be most important from those DCs. I'll try again with yesterday's |
With
Deleting all data and reverting to |
Thanks for the analysis @Beanow! Will take a closer look at your logs. What you're describing is similar to what I've I've seen in my analysis so far: on Holesky at least, it can sync with the bootnodes but sometimes disconnects these due to empty responses leading us to treat it as a useless peer (correctly). This can happen if the peer is busy, as I expect a bootnode would be. The problem during Holesky sync is that it isn't finding new peers, or finds and disconnects them quickly maybe. This is intermittent...sometimes restarting besu finds a new non-bootnode peer, but peer numbers still remain low until after syncing. Mainnet is a different story: maybe lower numbers during sync too but >> 0. Primary issue on mainnet seems to be the post-sync blob tx gossip disconnects which 24.3.1 (24.4.0) should have resolved.
I think it might just be intermittent/peering luck, as I've been investigating the same problem on 24.3.0 and 24.1.2. But I will try to confirm. |
Also worth noting that until 23.3.1/24.4.0, our Holeskly config relied entirely on the bootnodes for finding new peers, we added the dns discovery server during this investigation: 63a53aa Despite this, holesky found peers fine until recently. |
It did seem more consistent than luck. I've tried to wipe the data and resync on |
@Beanow When you wipe the data do you also wipe the node key? Retaining the node key may cause us to remain on other peers' good or bad lists and produce more consistent results. |
I believe the node key is in |
Hopefully that's just a metric/dashboard being misleading...we should be sending the correct head information to the peers over the rlpx layer https://github.com/ethereum/devp2p/blob/master/rlpx.md |
I've been analysing the logs of a holesky node that dropped all peers during sync and has been sitting on 0 peers for days.
About 45% of these are "Timed out waiting to establish connection with peer" which means we've found the peer to be on the correct network + forkId ("ForkId OK..."), but have waited 10 seconds between "Prepared ECIES handshake..." and "Failed to connect to peer"
Two issues I want to confirm with this: Line 76 in deaea9b
b. Why do we end up going through this with the same peers repeatedly...on average 50 times in 4 hours. The peer in the logs above is an outlier, we got "Failed to connect to peer" for enode://c13e9d36... 702 times in 4 hours :(
|
Update: we have included several fixes in a release candidate, 24.4.0-RC2, which is currently undergoing testing and thus not ready for production, look out for announcements on discord. Notably #6777 which hopefully avoids in-sync users from disconnecting peers. We are still investigating other peering issues, especially during holesky sync. In our testing, sepolia and mainnet peers are performing reasonably during sync using 24.4.0-RC2. |
Update: We're not get confident enough in 24.4.0-RC2 to release it: it may have some issues unrelated to peering, we are still investigating. At the same time, we are preparing a hotfix release which contains the known mainnet peering issue related to blobs. |
Update: https://github.com/hyperledger/besu/releases/tag/24.3.3 hotfix was released which should help besu keep hold of geth peers. Note, there still appear to still be some issues finding peers, particularly on the testnets. If you still experience low peers on mainnet for a prolonged period, please let us know. We would regard normal as taking 30-60 minutes to find 25 peers on mainnet. From our testing, we are aware that some nodes are taking as long as 15 hours though. We are still working on the next major release which has more peering improvements, but was found to introduce an unrelated syncing bug so has been delayed. |
peerage on testnets is always a problem, regardless of clients, I've experienced issues with Geth and Reth as well. The nodes tend to be more static and fewer in number, resulting in more "cliques". |
@shemnon Indeed. There did seem to be a marked difference shortly following Dencun (if not the fork itself, then maybe following client updates around Dencun time perhaps). Mixed reports about success on Holesky especially. We have significant issues getting beyond the bootnodes on a lot of our new holesky nodes. I'm sure some of it is noticing pre-existing issues now we're looking more closely. |
Multiple reports of peering issues, this is a placeholder to track and gather info.
Anecdotally seems like issues began after Dencun fork on March 13th and are worse during initial sync.
@garyschulte suspected a Dencun fork id issue on March 22nd related to geth during snap server testing...it ultimately turned out to be an old version of geth, but there were still a lack geth peers on our canary nodes.
Other devs internally have noticed issues connecting to geth peers while trying to sync Holesky.
Issue reported on geth No Inbound peers after upgrade ethereum/go-ethereum#29312
User reports from EthStaker #besu-helpdesk Discord
The text was updated successfully, but these errors were encountered: