Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server.go: replace call to removePeer with Disconnect in DisconnectPeer #6655

Merged
merged 3 commits into from
Jun 30, 2022

Conversation

Crypt-iQ
Copy link
Collaborator

Without this, calls to DisconnectPeer would bypass the peerTerminationWatcher and allow subsequent connect requests to go through before the peer's links were fully shut down. This could lead to force closes.

One way it could trigger a force close that made it look like the user lost state is:

  • The existing link has sent a CommitSig and is waiting for RevokeAndAck. The database state is persisted under the commit diff key
  • The node's peer.Brontide object receives the RevokeAndAck and will attempt to pass it to the link.
  • The node issues a DisconnectPeer call. removePeer is called which removes the peer.Brontide object from the map without waiting for the associated ChannelLinks to stop. This bypasses the peerTerminationWatcher!
  • The node hands the RevokeAndAck to the ChannelLink - it is now in the link's mailbox.
  • The node reconnects with the counter-party. It can do so even though the ChannelLink isn't stopped. It will call FetchOpenChannels when creating the ChannelLinks in loadActiveChannels. FetchOpenChannels does NOT include the pending remote commitment.
  • The node's existing ChannelLink processes the RevokeAndAck and removes the pending remote commitment all before the Init handshake completes.
  • After the Init handshake, NewLightningChannel is called for each channel that will result in a ChannelLink. Here NLC will fetch the pending remote commitment, which no longer exists since the existing ChannelLink removed it. The local and remote commitments are not updated here. So the remote commitment is behind by 1.
  • Existing ChannelLink goes down eventually.
  • New ChannelLink performs ChannelReestablish handshake and triggers DLP. No data is actually lost as the erroneous state is all in-memory and the correct data is on-disk.

Fixes #6639

This could also trigger #6593 (though there are other ways besides using DisconnectPeer) and probably #6617

@Crypt-iQ Crypt-iQ added server P1 MUST be fixed or reviewed bug fix force closes labels Jun 21, 2022
@Roasbeef Roasbeef added this to the v0.15.1 milestone Jun 21, 2022
@Crypt-iQ Crypt-iQ marked this pull request as ready for review June 22, 2022 17:55
@yyforyongyu yyforyongyu self-requested a review June 23, 2022 06:03
Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🦄

@yyforyongyu
Copy link
Member

yyforyongyu commented Jun 27, 2022

The node issues a DisconnectPeer call. removePeer is called which removes the peer.Brontide object from the map without waiting for the associated ChannelLinks to stop. This bypasses the peerTerminationWatcher!

What triggers the DisconnectPeer call?

The node reconnects with the counter-party. It can do so even though the ChannelLink isn't stopped. It will call FetchOpenChannels when creating the ChannelLinks in loadActiveChannels. FetchOpenChannels does NOT include the pending remote commitment.

So all these happen in a short time while the ChannelLink is stopping? Is this connect/reconnect triggered by the internal logic in our permanent peer watcher?

@yyforyongyu
Copy link
Member

Also pending a release note.

server.go Show resolved Hide resolved
@Crypt-iQ
Copy link
Collaborator Author

Crypt-iQ commented Jun 27, 2022

What triggers the DisconnectPeer call?

There are some call-sites (like autopilot), but probably the rpc call

So all these happen in a short time while the ChannelLink is stopping?

Yes like a rapid reconnect. Depending on the load of ChannelLink it could take a while to come to a stop

So all these happen in a short time while the ChannelLink is stopping? Is this connect/reconnect triggered by the internal logic in our permanent peer watcher?

Not sure, but if there was a running connection request, then the reconnect could be triggered by it

@yyforyongyu
Copy link
Member

Cool pending the release note otherwise LGTM🎉

@Roasbeef
Copy link
Member

Just needs a release notes entry and this can land!

Crypt-iQ added 3 commits June 29, 2022 13:50
Without this, calls to DisconnectPeer would bypass the
peerTerminationWatcher and allow subsequent connect requests to
go through before the peer's links were fully shut down. This could
lead to force closes.
@Crypt-iQ Crypt-iQ force-pushed the disconnectpeerfix branch from 0aec361 to ff39f1e Compare June 29, 2022 17:50
@Crypt-iQ
Copy link
Collaborator Author

added note

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ChanStatusBorked, closure with ZFR, "ghost" balance sent to unknown addresses
3 participants