Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve lighthouse's disconnect responsiveness #2146

Closed
AgeManning opened this issue Jan 7, 2021 · 3 comments
Closed

Improve lighthouse's disconnect responsiveness #2146

AgeManning opened this issue Jan 7, 2021 · 3 comments

Comments

@AgeManning
Copy link
Member

Description

Users are reporting Lighthouse taking 5 to 15mins for Lighthouse to drop their peers once a connection drops/changes (#2123).

It is known that discv5 will take at least 5m to update the ENR, but that is unrelated and addressed in a separate issue (#2131)

In current stable we ping every 30 seconds. It should take 2 failed pings to disconnect a peer and therefore I'd expect all peers to be dropped in under a minute. In #2132 I reduced the ping interval to 15 seconds so in the unstable branch I'd expect all peers to be dropped in under 30 seconds.

This doesn't seem to be the case and should be investigated as to why peers are taking longer to drop and reconnect when connectivity drops.

@pawanjay176
Copy link
Member

Looks like this happens because a ping timeout is a HighToleranceError which leads to a disconnection after ~10mins.
Changing the ping timeout to LowTolerance is kicking out all peers in ~1minute.

@bwtimmermans
Copy link

My similar experience (version: Lighthouse/v1.0.5-9ed65a6) is that very short connection outages, like <30s, often result in one or more missed attestations by Lighthouse, but not Teku, on the same network. The situation is I have a faulty telephone line awaiting repair and periodically the ADSL drops for about 30s before reconnecting. This afternoon I missed 4 attestations in a row on a LH validator but zero missed attestations on a Teku validator on the same network (different machine).

If peers are not disconnected for ~10 minutes, then why would a 30s Internet outage result in 15 minutes worth of missed attestations? I guess one answer to this would be that the IP address was changed, although I cannot confirm whether that is the case, nor would I expect that to be the case every time the connection goes down for 30s.

bors bot pushed a commit that referenced this issue Jan 8, 2021
## Issue Addressed

Fixes #2146 

## Proposed Changes

Change ping timeout errors to return `LowToleranceErrors` so that we disconnect faster on internet failures/changes.
@AgeManning
Copy link
Member Author

@pawanjay176 nice.

We were not scoring the negotiation timeouts as I had expected. #2147 should make lighthouse significantly more responsive to disconnects.

bors bot pushed a commit that referenced this issue Jan 10, 2021
## Issue Addressed

Fixes #2146 

## Proposed Changes

Change ping timeout errors to return `LowToleranceErrors` so that we disconnect faster on internet failures/changes.
bors bot pushed a commit that referenced this issue Jan 13, 2021
## Issue Addressed

Fixes #2146 

## Proposed Changes

Change ping timeout errors to return `LowToleranceErrors` so that we disconnect faster on internet failures/changes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants