fix chanArb deadlock #9253

ziggie1984 · 2024-11-09T12:10:06Z

This PR does 2 things:

Fixes [bug]: ChannelArbitrator does not cleanly stop #8149, it now starts a chainArb level goroutine which is responsible to stop goroutines of the channel_arbitrators when they are fully resolved.
Now starts the different channel arbitrator during startup in an errorGroup and collects the result concurrently. This makes sure LND starts-up correctly. Sometimes the channelArbs depend on other subsystems like for example taproot assets, so we need to make sure we do not block here forever.

EDIT:

Changed the approach. This will prevent the deadlock from happening.

A separate PR will be created to start the arbitrator concurrently.

coderabbitai · 2024-11-09T12:10:14Z

Important

Review skipped

Auto reviews are limited to specific labels.

🏷️ Labels to auto review (1)

llm-review

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

ziggie1984 · 2024-11-09T14:30:24Z

contractcourt/chain_arbitrator.go

@@ -1192,9 +1232,6 @@ func (c *ChainArbitrator) ForceCloseContract(chanPoint wire.OutPoint) (*wire.Msg
 // channel has finished its final funding flow, it should be registered with
 // the ChainArbitrator so we can properly react to any on-chain events.
 func (c *ChainArbitrator) WatchNewChannel(newChan *channeldb.OpenChannel) error {
-	c.Lock()


Not sure but I think we cannot lock the chainArb here and also start the ChannelArbitrator, because the ChannelArbitrator might call the ResolveContract which needs the ChainArb lock as well, so probably this deadlock was never seen in the wild but I think we need to unlock the ChainArb before starting the ChannelArb ?

yyforyongyu · 2024-11-10T09:25:42Z

I would re-access this issue after blockbeat as it greatly refactors the resolvers and likely the issue will be gone, also to reduce rebase conflicts from either side.

ziggie1984 · 2024-11-10T09:36:31Z

I would re-access this issue after blockbeat as it greatly refactors the resolvers and likely the issue will be gone, also to reduce rebase conflicts from either side.

That would be cool, because the main reason is that some external services might depend on the successful startup of LND however they also have dependencies when starting the ChannelArbitrator so let's see then.

guggero

Thanks a lot for the quick fixes! I think we need to handle a couple of edge cases a bit more, but the general approach looks good!

guggero · 2024-11-11T09:53:06Z

contractcourt/chain_arbitrator.go

@@ -1192,9 +1232,6 @@ func (c *ChainArbitrator) ForceCloseContract(chanPoint wire.OutPoint) (*wire.Msg
 // channel has finished its final funding flow, it should be registered with
 // the ChainArbitrator so we can properly react to any on-chain events.
 func (c *ChainArbitrator) WatchNewChannel(newChan *channeldb.OpenChannel) error {
-	c.Lock()
-	defer c.Unlock()
-


We now no longer guard the read access to c.activeChannels below (L1205). Perhaps we need to introduce an RWMutex instead?

guggero · 2024-11-11T09:55:37Z

contractcourt/chain_arbitrator.go

+	chainArb := c.activeChannels[chanPoint]
+	c.Unlock()
+	if chainArb != nil {
+		arbLog := chainArb.log


Could arbLog still be nil at this point? Perhaps the above condition should be if chainArb != nil && chainArg.log != nil?

guggero · 2024-11-11T10:02:51Z

contractcourt/channel_arbitrator.go

-	c.wg.Add(1)
-	go c.channelAttendant(bestHeight)
-	return nil
+	err = c.wg.Go(func(ctx context.Context) {


nit: could return directly here.

So what does this do exactly? No error is returned from channelAttendant.

Was this issue actually introduced by adding the a goroutine the prior commit?

guggero · 2024-11-11T10:03:20Z

contractcourt/chain_arbitrator.go

+	// timeouts for itests and normal operations.
+	ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second)
+
+	// Create an errgroup with the context


nit: missing full stop at end of sentence, here and a couple of places below.

guggero · 2024-11-11T10:04:04Z

contractcourt/chain_arbitrator.go

 	for _, arbitrator := range c.activeChannels {
 		startState, ok := startStates[arbitrator.cfg.ChanPoint]
 		if !ok {
 			stopAndLog()
+			// In case we encounter an error we need to cancel the


nit: add empty line before comment if it isn't at the start of a block.

guggero · 2024-11-11T10:13:04Z

contractcourt/chain_arbitrator.go

+
+			// Start arbitrator in a separate goroutine
+			go func() {
+				errChan <- arbitrator.Start(startState)


Isn't the whole goal of the errGroup that we have a context that is canceled when an error occurs?
But now we're starting a new goroutine inside the err group goroutine just so we can abort Start()?
This will work I think. But perhaps another possible approach would be to pass the cancellable context into Start() and abort on context cancel there?
Otherwise we'd kind of kill/abandon the goroutines spawned in Start() on shutdown.

very good idea

Also why isn't arb.Start() returned directly as the error here? That would actually use the errgroup features.

Zooming out, what we want here is that chanAbr.Start() actually doesn't block, but AFAICT, it'll wait for all the goroutines to start below, which can still enter the deadlock scenario we were trying to resolve.

not sure if I undestand your question, but I introduced the errGroup and this goroutine, because otherwise I cannot fail the goroutines as soon as a goroutines fails with an error. The normal errGroup waits until all goroutines are done which would bring us into the deadlock again. but yeah not really necessary wich the new appraoch.

guggero · 2024-11-11T10:16:56Z

contractcourt/chain_arbitrator.go

+		select {
+		// As soon as the context cancels we can be sure the
+		// errGroup has finished waiting.
+		case <-ctx.Done():


I'm not sure we can rely on the context being canceled here. The Godoc for the err group says "the first call to return a non-nil error cancels the group's context". But if there is no failure, there won't be a cancel.

Perhaps instead of using an err group we just create an error channel for the number of arbitrators we start, start them all in a goroutine then wait for them to be completed here, by reading the number of err (or nil) from the channel as there are arbitrators.

Roasbeef · 2024-11-12T00:13:28Z

contractcourt/chain_arbitrator.go

@@ -258,6 +258,10 @@ type ChainArbitrator struct {
 	// methods and interface it needs to operate.
 	cfg ChainArbitratorConfig

+	// resolveContract is a channel which is used to signal the cleanup of


Re the commit comment: do we have a demonstration of the supposed deadlock?

Roasbeef · 2024-11-12T00:14:53Z

contractcourt/chain_arbitrator.go

@@ -509,44 +517,24 @@ func (c *ChainArbitrator) ResolveContract(chanPoint wire.OutPoint) error {
 		return err
 	}

-	// Now that the channel has been marked as fully closed, we'll stop


So what's he deadlock scenario here? That the channel arb calls this function while the chain arb is trying to stop it?

I think that can alternatively be handled with an async call from the chan arb. At that point, it's shutting down, and can't really do much with any error returned here as all the contracts have been resolved (channel is fully closed).

Roasbeef · 2024-11-12T00:16:23Z

contractcourt/channel_arbitrator.go

-	c.wg.Add(1)
-	go c.channelAttendant(bestHeight)
-	return nil
+	err = c.wg.Go(func(ctx context.Context) {


So what does this do exactly? No error is returned from channelAttendant.

Was this issue actually introduced by adding the a goroutine the prior commit?

Roasbeef · 2024-11-12T00:16:46Z

contractcourt/channel_arbitrator.go

-		c.wg.Add(1)
-		go c.resolveContract(contract, immediate)
+		err := c.wg.Go(func(ctx context.Context) {
+			c.resolveContract(contract, immediate)


Same here re not returning an err at all.

Roasbeef · 2024-11-12T00:18:32Z

contractcourt/chain_arbitrator.go

+
+			// Start arbitrator in a separate goroutine
+			go func() {
+				errChan <- arbitrator.Start(startState)


Also why isn't arb.Start() returned directly as the error here? That would actually use the errgroup features.

Zooming out, what we want here is that chanAbr.Start() actually doesn't block, but AFAICT, it'll wait for all the goroutines to start below, which can still enter the deadlock scenario we were trying to resolve.

Roasbeef · 2024-11-12T00:19:52Z

Seeing this laid out a bit, I wonder if we should entertain the other idea that @ziggie1984 had: modify ForceClose to only conditionally try to make the chan close summary.

In terms of breaking changes, we can sidestep that by using a new set of functional options for the main arg. This way we only need to update callers at the site of the new unit tests, then also then ChainArb.

ziggie1984 · 2024-11-12T19:17:21Z

@Roasbeef ok changed the approach for now. Just going for the Optional Resolution approach.

Will create a separate PR to make the startup async.

But this will definitely solve the deadlock issue, but we should however add the async arbitrator feature (will create a separate PR).

ziggie1984 · 2024-11-12T19:26:27Z

lnwallet/channel.go

+
+	// Resolutions contains all the data required for resolving the
+	// different output types of a commitment transaction.
+	Resolutions fn.Option[Resolutions]


I went for this approach here, it would be to much of a change to introduce options for all the separate types like for example:

AnchorResolution fn.Option[AnchorResolution] ...

because we use the nil case a lot and also pass it into other structures. Maybe a deeper refactor in the log run but not now.

guggero

Thanks a lot for the fix! This approach makes a lot of sense to me. Just a couple of minor suggestions, otherwise LGTM 🎉

lnwallet/channel.go

contractcourt/chain_arbitrator.go

lnwallet/channel.go

guggero · 2024-11-13T11:55:36Z

Some itests now seem to fail. Perhaps we need to increase some timeouts or wait for a different signal since things are now a bit more async?

    harness.go:353: Finished the setup, now running tests...
    --- FAIL: TestLightningNetworkDaemon/tranche00/05-of-174/btcd/channel_backup_restore_basic (55.79s)
        --- FAIL: TestLightningNetworkDaemon/tranche00/05-of-174/btcd/channel_backup_restore_basic/restore_from_RPC_backup (52.25s)
            harness_rpc.go:100: 
                	Error Trace:	/home/runner/work/lnd/lnd/lntest/rpc/harness_rpc.go:100
                	            				/home/runner/work/lnd/lnd/lntest/rpc/lnd.go:46
                	            				/home/runner/work/lnd/lnd/lntest/harness_assertion.go:90
                	            				/home/runner/work/lnd/lnd/lntest/wait/wait.go:51
                	            				/home/runner/work/lnd/lnd/lntest/wait/wait.go:27
                	            				/opt/hostedtoolcache/go/1.22.6/x64/src/runtime/asm_amd64.s:1695
                	Error:      	Received unexpected error:
                	            	rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:10630: connect: connection refused"
                	Messages:   	carol: failed to call ListPeers
            harness_assertion.go:105: 
                	Error Trace:	/home/runner/work/lnd/lnd/lntest/harness_assertion.go:105
                	            				/home/runner/work/lnd/lnd/lntest/harness_assertion.go:239
                	            				/home/runner/work/lnd/lnd/itest/lnd_channel_backup_test.go:1564
                	            				/home/runner/work/lnd/lnd/itest/lnd_channel_backup_test.go:231
                	            				/home/runner/work/lnd/lnd/itest/lnd_channel_backup_test.go:459
                	            				/home/runner/work/lnd/lnd/itest/lnd_channel_backup_test.go:427
                	Error:      	Received unexpected error:
                	            	method did not return within the timeout
                	Test:       	TestLightningNetworkDaemon/tranche00/05-of-174/btcd/channel_backup_restore_basic/restore_from_RPC_backup
                	Messages:   	unable to connect carol to dave, got error: peers not connected within 30s seconds

ziggie1984 · 2024-11-13T12:17:10Z

hmm strange this PR did not introduce any new timeout issue, I take a look

We don't always need the resolutions in the local force close summary so we make it an option.

ziggie1984 self-assigned this Nov 9, 2024

ziggie1984 requested review from Roasbeef and guggero November 9, 2024 12:17

ziggie1984 force-pushed the fix-chanArb-deadlock branch 2 times, most recently from 7a5003b to b6296d2 Compare November 9, 2024 14:24

ziggie1984 commented Nov 9, 2024

View reviewed changes

ziggie1984 force-pushed the fix-chanArb-deadlock branch 3 times, most recently from ce71936 to 1d76335 Compare November 9, 2024 15:47

ziggie1984 marked this pull request as ready for review November 9, 2024 15:49

ziggie1984 force-pushed the fix-chanArb-deadlock branch 4 times, most recently from c46f6b9 to b91eee3 Compare November 9, 2024 20:46

guggero mentioned this pull request Nov 11, 2024

fn: add goroutine manager #9141

Merged

8 tasks

guggero reviewed Nov 11, 2024

View reviewed changes

Roasbeef requested changes Nov 12, 2024

View reviewed changes

dstadulis assigned Roasbeef and unassigned ziggie1984 Nov 12, 2024

ziggie1984 force-pushed the fix-chanArb-deadlock branch from 9414e08 to 0c10a74 Compare November 12, 2024 19:16

ziggie1984 commented Nov 12, 2024

View reviewed changes

ziggie1984 force-pushed the fix-chanArb-deadlock branch from 0c10a74 to e7d50f9 Compare November 12, 2024 21:41

ziggie1984 mentioned this pull request Nov 12, 2024

Start Channel Arbitrators concurrently #9262

Open

guggero approved these changes Nov 13, 2024

View reviewed changes

lnwallet/channel.go Outdated Show resolved Hide resolved

contractcourt/chain_arbitrator.go Outdated Show resolved Hide resolved

lnwallet/channel.go Outdated Show resolved Hide resolved

ziggie1984 force-pushed the fix-chanArb-deadlock branch from e7d50f9 to 2e314a4 Compare November 13, 2024 10:59

ziggie1984 force-pushed the fix-chanArb-deadlock branch from 2e314a4 to ff0622e Compare November 13, 2024 11:00

ziggie1984 requested a review from Roasbeef November 13, 2024 11:02

ziggie1984 added 2 commits November 13, 2024 13:19

multi: introduce an option for resoltions

12cf20b

We don't always need the resolutions in the local force close summary so we make it an option.

docs: add release notes

df295ea

ziggie1984 force-pushed the fix-chanArb-deadlock branch from ff0622e to df295ea Compare November 13, 2024 12:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix chanArb deadlock #9253

fix chanArb deadlock #9253

ziggie1984 commented Nov 9, 2024 •

edited

Loading

coderabbitai bot commented Nov 9, 2024

Review skipped

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

ziggie1984 Nov 9, 2024

yyforyongyu commented Nov 10, 2024

ziggie1984 commented Nov 10, 2024

guggero left a comment

guggero Nov 11, 2024

guggero Nov 11, 2024

guggero Nov 11, 2024

Roasbeef Nov 12, 2024

guggero Nov 11, 2024

guggero Nov 11, 2024

guggero Nov 11, 2024

ziggie1984 Nov 11, 2024

Roasbeef Nov 12, 2024

ziggie1984 Nov 12, 2024

guggero Nov 11, 2024

Roasbeef Nov 12, 2024

Roasbeef Nov 12, 2024

Roasbeef Nov 12, 2024

Roasbeef Nov 12, 2024

Roasbeef Nov 12, 2024

Roasbeef commented Nov 12, 2024

ziggie1984 commented Nov 12, 2024 •

edited

Loading

ziggie1984 Nov 12, 2024

guggero left a comment

guggero commented Nov 13, 2024

ziggie1984 commented Nov 13, 2024

fix chanArb deadlock #9253

Are you sure you want to change the base?

fix chanArb deadlock #9253

Conversation

ziggie1984 commented Nov 9, 2024 • edited Loading

coderabbitai bot commented Nov 9, 2024

Review skipped

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Choose a reason for hiding this comment

yyforyongyu commented Nov 10, 2024

ziggie1984 commented Nov 10, 2024

guggero left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Roasbeef commented Nov 12, 2024

ziggie1984 commented Nov 12, 2024 • edited Loading

Choose a reason for hiding this comment

guggero left a comment

Choose a reason for hiding this comment

guggero commented Nov 13, 2024

ziggie1984 commented Nov 13, 2024

ziggie1984 commented Nov 9, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

ziggie1984 commented Nov 12, 2024 •

edited

Loading