IP Address change causes missed attestations #2123

m0n5t3r · 2020-12-26T08:17:29Z

Description

When my ISP sees fit to reset the connection, causing an IP address change, libp2p takes 5 minutes to even notice the address change, but the beacon stops talking to the network for a much longer long time

Version

released Lighthouse v1.0.5-9ed65a6

Present Behaviour

Context: the router detected a connection drop and reconnected at 08:02:52

Dec 26 08:02:35.276 INFO New block received                      hash: 0x6f22…0688, slot: 178211
Dec 26 08:02:41.000 INFO Synced                                  slot: 178211, block: 0x6f22…0688, epoch: 5569, finalized_epoch: 5567, finalized_root: 0x92b9…01e8, peers: 55, service: slot_notifier
Dec 26 08:02:52.948 INFO New block received                      hash: 0x94a2…84b2, slot: 178212
Dec 26 08:02:53.000 INFO Synced                                  slot: 178212, block: 0x94a2…84b2, epoch: 5569, finalized_epoch: 5567, finalized_root: 0x92b9…01e8, peers: 51, service: slot_notifier
Dec 26 08:03:05.000 INFO Synced                                  slot: 178213, block:    …  empty, epoch: 5569, finalized_epoch: 5567, finalized_root: 0x92b9…01e8, peers: 50, service: slot_notifier
[...]
Dec 26 08:08:17.000 INFO Synced                                  slot: 178239, block:    …  empty, epoch: 5569, finalized_epoch: 5567, finalized_root: 0x92b9…01e8, peers: 50, service: slot_notifier
Dec 26 08:08:21.895 INFO Imported deposit log(s)                 new: 1, total: 63333, latest_block: 11526025, service: eth1_rpc
Dec 26 08:08:29.000 INFO Synced                                  slot: 178240, block:    …  empty, epoch: 5570, finalized_epoch: 5567, finalized_root: 0x92b9…01e8, peers: 50, service: slot_notifier
Dec 26 08:08:41.000 INFO Synced                                  slot: 178241, block:    …  empty, epoch: 5570, finalized_epoch: 5567, finalized_root: 0x92b9…01e8, peers: 50, service: slot_notifier
Dec 26 08:08:43.948 INFO Address updated                         udp_port: 9000, ip: <censored>, service: libp2p
Dec 26 08:08:53.001 INFO Synced                                  slot: 178242, block:    …  empty, epoch: 5570, finalized_epoch: 5567, finalized_root: 0x92b9…01e8, peers: 50, service: slot_notifier
[...]
Dec 26 08:15:17.000 INFO Synced                                  slot: 178274, block:    …  empty, epoch: 5571, finalized_epoch: 5567, finalized_root: 0x92b9…01e8, peers: 49, service: slot_notifier
Dec 26 08:15:17.301 INFO Sync state updated                      new_state: Syncing Head Chain, old_state: Synced, service: sync
Dec 26 08:15:19.497 INFO Sync state updated                      new_state: Synced, old_state: Syncing Head Chain, service: sync
Dec 26 08:15:23.367 INFO New block received                      hash: 0x702b…dcc4, slot: 178275
Dec 26 08:15:29.000 INFO Synced                                  slot: 178275, block: 0x702b…dcc4, epoch: 5571, finalized_epoch: 5569, finalized_root: 0xa5ce…fc04, peers: 51, service: slot_notifier
Dec 26 08:15:35.246 INFO New block received                      hash: 0x0a7e…38bf, slot: 178276
Dec 26 08:15:41.000 INFO Synced                                  slot: 178276, block: 0x0a7e…38bf, epoch: 5571, finalized_epoch: 5569, finalized_root: 0xa5ce…fc04, peers: 51, service: slot_notifier

The validator kept thinking everything was fine and kept submitting attestations, but the network thought differently, and attestations for epoch 5569, 5570 and 5571 appear as missed (not sure what's with the subsequent ones, I see some libp2p complaining about insufficient peers while the logged number is still 50-ish):

Expected Behaviour

Beacon resumes work ... faster? I'm assuming there must be a cache of peers to reconnect to, 13 minutes is rather excessive

Steps to resolve

No idea what can be done application-wise (shorter ip change detection intervals?); I'll tunnel to a stable IP in the mean time.

The text was updated successfully, but these errors were encountered:

AgeManning · 2020-12-29T23:16:40Z

There's a few things here.

Firstly the log "Address Updated" is mostly irrelevant to the current peers connected and reactive time of being able to submit attestations. That log refers only to discovery and its slow to update because it keeps a longer history. I've made an issue to make it more reactive: #2131 however it will have little impact to what you are seeing in this issue.

It is strange that it took 10 minutes for your node to re-establish connections. If you suddenly lose connection, you should kick all your peers fairly quickly (I can't see why you didn't without debug-level logs). The way that it works, is that if peers are non-responsive (i.e they can't communicate back because you changed your ip and the connections are no longer valid) we record this against the peer. We then have to make a decision about whether we should kick the peer and try and find new ones or give it another chance to respond.

We can be very harsh and kick peers as soon as they don't respond correctly on a single message, however the stability in your peer list will be lost as you'll likely kick many slow peers and/or missed packets / poor connectivity. However it would be very responsive when you lose your internet connection, you'd immediately drop all your peers.

I think the solution here is to decrease our tcp ping interval to check for liveness. That should make your node more responsive to these events without strongly affecting peer count stability.

m0n5t3r · 2020-12-29T23:51:26Z

I've tunneled it to a VPS with a stable ip now (which basically solved the issue - wireguard took the address changes like a champ and there were no missed attestations, and adds some privacy) - on which occasion I discovered geth was generating insane amounts of traffic, so I suspect the ISP was kicking me offline every 100 GB or so, or it didn't like me exchanging UDP packets with some 100 random IPs from the internet; the VPS only has 1 TB of traffic included, so I ended up switching geth to light client mode

I'll try to set up a testnet validator when I have time, and come back with debug logs.

AgeManning · 2020-12-30T00:29:43Z

wooh 100GB of data? Were you syncing?

m0n5t3r · 2020-12-30T10:14:14Z

nope, funnily enough I had no issues while syncing (the validator wasn't active at that point); based on geth's own metrics it was doing 600 kB/s constantly (p2p ingress + egress), but the linux ethernet metrics showed ~1.5 MB/s; this means 126 GB/day (also, this is doubled for the vpn endpoint); I'd still like to run a full eth1 node eventually, but probably in a datacenter somewhere (VPS storage is outrageously expensive, so I may splurge and get a 30-ish eur hetzner auction machine with enough SSD at some point next year)

AgeManning · 2020-12-30T23:32:02Z

hmm. I'm running a synced geth node and it seems to be sitting around 130-300kB/s. Mostly in the low 100's but jumps to 300kB/s every now and then.

Strange yours was at 600kB/s constantly.

## Issue Addressed #2123 ## Description Reduces the TCP ping interval to increase our responsiveness to peer liveness changes.

greg-flexidao · 2021-01-06T17:27:11Z

I have very similar issue. I bonded 2 network interfaces in the "active backup" mode. Each network interface is connected to different ISP. Once primary interface is down it takes from 5 min to 15 min for the lighthouse beacon node to react.
I'm connected to 100 peers and 5m-15min later still starts to gradually decreasing from 100 peers, while still not syncing with the network.

AgeManning · 2021-01-07T01:00:34Z

I've reduced the ping interval which should make this more responsive. However, the current settings shouldn't take 5m to drop and reconnect 100 peers.

I've made a separate issue to investigate and correct this: #2146

AgeManning · 2021-01-08T07:10:52Z

Lighthouse should now be significantly more responsive to disconnects in the next release. There was a scoring bug which allowed peers to linger around longer than they should have. I'll consider this resolved in #2147

m0n5t3r changed the title ~~IP Address change causes missing attestations~~ IP Address change causes missed attestations Dec 26, 2020

AgeManning mentioned this issue Dec 29, 2020

[Merged by Bors] - Reduce ping interval #2132

Closed

bors bot pushed a commit that referenced this issue Jan 6, 2021

Reduce ping interval (#2132)

29eaeff

## Issue Addressed #2123 ## Description Reduces the TCP ping interval to increase our responsiveness to peer liveness changes.

bors bot pushed a commit that referenced this issue Jan 6, 2021

Reduce ping interval (#2132)

e353b66

## Issue Addressed #2123 ## Description Reduces the TCP ping interval to increase our responsiveness to peer liveness changes.

bors bot pushed a commit that referenced this issue Jan 6, 2021

Reduce ping interval (#2132)

7e4b190

## Issue Addressed #2123 ## Description Reduces the TCP ping interval to increase our responsiveness to peer liveness changes.

AgeManning mentioned this issue Jan 7, 2021

Improve lighthouse's disconnect responsiveness #2146

Closed

AgeManning closed this as completed Jan 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IP Address change causes missed attestations #2123

IP Address change causes missed attestations #2123

m0n5t3r commented Dec 26, 2020

AgeManning commented Dec 29, 2020 •

edited

Loading

m0n5t3r commented Dec 29, 2020

AgeManning commented Dec 30, 2020

m0n5t3r commented Dec 30, 2020

AgeManning commented Dec 30, 2020

greg-flexidao commented Jan 6, 2021

AgeManning commented Jan 7, 2021

AgeManning commented Jan 8, 2021 •

edited

Loading

IP Address change causes missed attestations #2123

IP Address change causes missed attestations #2123

Comments

m0n5t3r commented Dec 26, 2020

Description

Version

Present Behaviour

Expected Behaviour

Steps to resolve

AgeManning commented Dec 29, 2020 • edited Loading

m0n5t3r commented Dec 29, 2020

AgeManning commented Dec 30, 2020

m0n5t3r commented Dec 30, 2020

AgeManning commented Dec 30, 2020

greg-flexidao commented Jan 6, 2021

AgeManning commented Jan 7, 2021

AgeManning commented Jan 8, 2021 • edited Loading

AgeManning commented Dec 29, 2020 •

edited

Loading

AgeManning commented Jan 8, 2021 •

edited

Loading