Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IP Address change causes missed attestations #2123

Closed
m0n5t3r opened this issue Dec 26, 2020 · 8 comments
Closed

IP Address change causes missed attestations #2123

m0n5t3r opened this issue Dec 26, 2020 · 8 comments

Comments

@m0n5t3r
Copy link

m0n5t3r commented Dec 26, 2020

Description

When my ISP sees fit to reset the connection, causing an IP address change, libp2p takes 5 minutes to even notice the address change, but the beacon stops talking to the network for a much longer long time

Version

released Lighthouse v1.0.5-9ed65a6

Present Behaviour

Context: the router detected a connection drop and reconnected at 08:02:52

Dec 26 08:02:35.276 INFO New block received                      hash: 0x6f22…0688, slot: 178211
Dec 26 08:02:41.000 INFO Synced                                  slot: 178211, block: 0x6f22…0688, epoch: 5569, finalized_epoch: 5567, finalized_root: 0x92b9…01e8, peers: 55, service: slot_notifier
Dec 26 08:02:52.948 INFO New block received                      hash: 0x94a2…84b2, slot: 178212
Dec 26 08:02:53.000 INFO Synced                                  slot: 178212, block: 0x94a2…84b2, epoch: 5569, finalized_epoch: 5567, finalized_root: 0x92b9…01e8, peers: 51, service: slot_notifier
Dec 26 08:03:05.000 INFO Synced                                  slot: 178213, block:    …  empty, epoch: 5569, finalized_epoch: 5567, finalized_root: 0x92b9…01e8, peers: 50, service: slot_notifier
[...]
Dec 26 08:08:17.000 INFO Synced                                  slot: 178239, block:    …  empty, epoch: 5569, finalized_epoch: 5567, finalized_root: 0x92b9…01e8, peers: 50, service: slot_notifier
Dec 26 08:08:21.895 INFO Imported deposit log(s)                 new: 1, total: 63333, latest_block: 11526025, service: eth1_rpc
Dec 26 08:08:29.000 INFO Synced                                  slot: 178240, block:    …  empty, epoch: 5570, finalized_epoch: 5567, finalized_root: 0x92b9…01e8, peers: 50, service: slot_notifier
Dec 26 08:08:41.000 INFO Synced                                  slot: 178241, block:    …  empty, epoch: 5570, finalized_epoch: 5567, finalized_root: 0x92b9…01e8, peers: 50, service: slot_notifier
Dec 26 08:08:43.948 INFO Address updated                         udp_port: 9000, ip: <censored>, service: libp2p
Dec 26 08:08:53.001 INFO Synced                                  slot: 178242, block:    …  empty, epoch: 5570, finalized_epoch: 5567, finalized_root: 0x92b9…01e8, peers: 50, service: slot_notifier
[...]
Dec 26 08:15:17.000 INFO Synced                                  slot: 178274, block:    …  empty, epoch: 5571, finalized_epoch: 5567, finalized_root: 0x92b9…01e8, peers: 49, service: slot_notifier
Dec 26 08:15:17.301 INFO Sync state updated                      new_state: Syncing Head Chain, old_state: Synced, service: sync
Dec 26 08:15:19.497 INFO Sync state updated                      new_state: Synced, old_state: Syncing Head Chain, service: sync
Dec 26 08:15:23.367 INFO New block received                      hash: 0x702b…dcc4, slot: 178275
Dec 26 08:15:29.000 INFO Synced                                  slot: 178275, block: 0x702b…dcc4, epoch: 5571, finalized_epoch: 5569, finalized_root: 0xa5ce…fc04, peers: 51, service: slot_notifier
Dec 26 08:15:35.246 INFO New block received                      hash: 0x0a7e…38bf, slot: 178276
Dec 26 08:15:41.000 INFO Synced                                  slot: 178276, block: 0x0a7e…38bf, epoch: 5571, finalized_epoch: 5569, finalized_root: 0xa5ce…fc04, peers: 51, service: slot_notifier

The validator kept thinking everything was fine and kept submitting attestations, but the network thought differently, and attestations for epoch 5569, 5570 and 5571 appear as missed (not sure what's with the subsequent ones, I see some libp2p complaining about insufficient peers while the logged number is still 50-ish):

Screenshot from 2020-12-26 10-07-00

Expected Behaviour

Beacon resumes work ... faster? I'm assuming there must be a cache of peers to reconnect to, 13 minutes is rather excessive

Steps to resolve

No idea what can be done application-wise (shorter ip change detection intervals?); I'll tunnel to a stable IP in the mean time.

@m0n5t3r m0n5t3r changed the title IP Address change causes missing attestations IP Address change causes missed attestations Dec 26, 2020
@AgeManning
Copy link
Member

AgeManning commented Dec 29, 2020

There's a few things here.

Firstly the log "Address Updated" is mostly irrelevant to the current peers connected and reactive time of being able to submit attestations. That log refers only to discovery and its slow to update because it keeps a longer history. I've made an issue to make it more reactive: #2131 however it will have little impact to what you are seeing in this issue.

It is strange that it took 10 minutes for your node to re-establish connections. If you suddenly lose connection, you should kick all your peers fairly quickly (I can't see why you didn't without debug-level logs). The way that it works, is that if peers are non-responsive (i.e they can't communicate back because you changed your ip and the connections are no longer valid) we record this against the peer. We then have to make a decision about whether we should kick the peer and try and find new ones or give it another chance to respond.

We can be very harsh and kick peers as soon as they don't respond correctly on a single message, however the stability in your peer list will be lost as you'll likely kick many slow peers and/or missed packets / poor connectivity. However it would be very responsive when you lose your internet connection, you'd immediately drop all your peers.

I think the solution here is to decrease our tcp ping interval to check for liveness. That should make your node more responsive to these events without strongly affecting peer count stability.

@m0n5t3r
Copy link
Author

m0n5t3r commented Dec 29, 2020

I've tunneled it to a VPS with a stable ip now (which basically solved the issue - wireguard took the address changes like a champ and there were no missed attestations, and adds some privacy) - on which occasion I discovered geth was generating insane amounts of traffic, so I suspect the ISP was kicking me offline every 100 GB or so, or it didn't like me exchanging UDP packets with some 100 random IPs from the internet; the VPS only has 1 TB of traffic included, so I ended up switching geth to light client mode

I'll try to set up a testnet validator when I have time, and come back with debug logs.

@AgeManning
Copy link
Member

wooh 100GB of data? Were you syncing?

@m0n5t3r
Copy link
Author

m0n5t3r commented Dec 30, 2020

nope, funnily enough I had no issues while syncing (the validator wasn't active at that point); based on geth's own metrics it was doing 600 kB/s constantly (p2p ingress + egress), but the linux ethernet metrics showed ~1.5 MB/s; this means 126 GB/day (also, this is doubled for the vpn endpoint); I'd still like to run a full eth1 node eventually, but probably in a datacenter somewhere (VPS storage is outrageously expensive, so I may splurge and get a 30-ish eur hetzner auction machine with enough SSD at some point next year)

@AgeManning
Copy link
Member

hmm. I'm running a synced geth node and it seems to be sitting around 130-300kB/s. Mostly in the low 100's but jumps to 300kB/s every now and then.

Strange yours was at 600kB/s constantly.

bors bot pushed a commit that referenced this issue Jan 6, 2021
## Issue Addressed

#2123

## Description

Reduces the TCP ping interval to increase our responsiveness to peer liveness changes.
bors bot pushed a commit that referenced this issue Jan 6, 2021
## Issue Addressed

#2123

## Description

Reduces the TCP ping interval to increase our responsiveness to peer liveness changes.
bors bot pushed a commit that referenced this issue Jan 6, 2021
## Issue Addressed

#2123

## Description

Reduces the TCP ping interval to increase our responsiveness to peer liveness changes.
@greg-flexidao
Copy link

I have very similar issue. I bonded 2 network interfaces in the "active backup" mode. Each network interface is connected to different ISP. Once primary interface is down it takes from 5 min to 15 min for the lighthouse beacon node to react.
I'm connected to 100 peers and 5m-15min later still starts to gradually decreasing from 100 peers, while still not syncing with the network.

@AgeManning
Copy link
Member

I've reduced the ping interval which should make this more responsive. However, the current settings shouldn't take 5m to drop and reconnect 100 peers.

I've made a separate issue to investigate and correct this: #2146

@AgeManning
Copy link
Member

AgeManning commented Jan 8, 2021

Lighthouse should now be significantly more responsive to disconnects in the next release. There was a scoring bug which allowed peers to linger around longer than they should have. I'll consider this resolved in #2147

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants