-
Notifications
You must be signed in to change notification settings - Fork 741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IP Address change causes missed attestations #2123
Comments
There's a few things here. Firstly the log "Address Updated" is mostly irrelevant to the current peers connected and reactive time of being able to submit attestations. That log refers only to discovery and its slow to update because it keeps a longer history. I've made an issue to make it more reactive: #2131 however it will have little impact to what you are seeing in this issue. It is strange that it took 10 minutes for your node to re-establish connections. If you suddenly lose connection, you should kick all your peers fairly quickly (I can't see why you didn't without debug-level logs). The way that it works, is that if peers are non-responsive (i.e they can't communicate back because you changed your ip and the connections are no longer valid) we record this against the peer. We then have to make a decision about whether we should kick the peer and try and find new ones or give it another chance to respond. We can be very harsh and kick peers as soon as they don't respond correctly on a single message, however the stability in your peer list will be lost as you'll likely kick many slow peers and/or missed packets / poor connectivity. However it would be very responsive when you lose your internet connection, you'd immediately drop all your peers. I think the solution here is to decrease our tcp ping interval to check for liveness. That should make your node more responsive to these events without strongly affecting peer count stability. |
I've tunneled it to a VPS with a stable ip now (which basically solved the issue - wireguard took the address changes like a champ and there were no missed attestations, and adds some privacy) - on which occasion I discovered geth was generating insane amounts of traffic, so I suspect the ISP was kicking me offline every 100 GB or so, or it didn't like me exchanging UDP packets with some 100 random IPs from the internet; the VPS only has 1 TB of traffic included, so I ended up switching geth to light client mode I'll try to set up a testnet validator when I have time, and come back with debug logs. |
wooh 100GB of data? Were you syncing? |
nope, funnily enough I had no issues while syncing (the validator wasn't active at that point); based on geth's own metrics it was doing 600 kB/s constantly (p2p ingress + egress), but the linux ethernet metrics showed ~1.5 MB/s; this means 126 GB/day (also, this is doubled for the vpn endpoint); I'd still like to run a full eth1 node eventually, but probably in a datacenter somewhere (VPS storage is outrageously expensive, so I may splurge and get a 30-ish eur hetzner auction machine with enough SSD at some point next year) |
hmm. I'm running a synced geth node and it seems to be sitting around 130-300kB/s. Mostly in the low 100's but jumps to 300kB/s every now and then. Strange yours was at 600kB/s constantly. |
## Issue Addressed #2123 ## Description Reduces the TCP ping interval to increase our responsiveness to peer liveness changes.
## Issue Addressed #2123 ## Description Reduces the TCP ping interval to increase our responsiveness to peer liveness changes.
## Issue Addressed #2123 ## Description Reduces the TCP ping interval to increase our responsiveness to peer liveness changes.
I have very similar issue. I bonded 2 network interfaces in the "active backup" mode. Each network interface is connected to different ISP. Once primary interface is down it takes from 5 min to 15 min for the lighthouse beacon node to react. |
I've reduced the ping interval which should make this more responsive. However, the current settings shouldn't take 5m to drop and reconnect 100 peers. I've made a separate issue to investigate and correct this: #2146 |
Lighthouse should now be significantly more responsive to disconnects in the next release. There was a scoring bug which allowed peers to linger around longer than they should have. I'll consider this resolved in #2147 |
Description
When my ISP sees fit to reset the connection, causing an IP address change, libp2p takes 5 minutes to even notice the address change, but the beacon stops talking to the network for a much longer long time
Version
released Lighthouse v1.0.5-9ed65a6
Present Behaviour
Context: the router detected a connection drop and reconnected at
08:02:52
The validator kept thinking everything was fine and kept submitting attestations, but the network thought differently, and attestations for epoch 5569, 5570 and 5571 appear as missed (not sure what's with the subsequent ones, I see some libp2p complaining about insufficient peers while the logged number is still 50-ish):
Expected Behaviour
Beacon resumes work ... faster? I'm assuming there must be a cache of peers to reconnect to, 13 minutes is rather excessive
Steps to resolve
No idea what can be done application-wise (shorter ip change detection intervals?); I'll tunnel to a stable IP in the mean time.
The text was updated successfully, but these errors were encountered: