-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix chanArb deadlock #9253
base: master
Are you sure you want to change the base?
fix chanArb deadlock #9253
Conversation
Important Review skippedAuto reviews are limited to specific labels. 🏷️ Labels to auto review (1)
Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
7a5003b
to
b6296d2
Compare
contractcourt/chain_arbitrator.go
Outdated
@@ -1192,9 +1232,6 @@ func (c *ChainArbitrator) ForceCloseContract(chanPoint wire.OutPoint) (*wire.Msg | |||
// channel has finished its final funding flow, it should be registered with | |||
// the ChainArbitrator so we can properly react to any on-chain events. | |||
func (c *ChainArbitrator) WatchNewChannel(newChan *channeldb.OpenChannel) error { | |||
c.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure but I think we cannot lock the chainArb
here and also start the ChannelArbitrator, because the ChannelArbitrator might call the ResolveContract
which needs the ChainArb lock as well, so probably this deadlock was never seen in the wild but I think we need to unlock the ChainArb
before starting the ChannelArb ?
ce71936
to
1d76335
Compare
c46f6b9
to
b91eee3
Compare
I would re-access this issue after |
That would be cool, because the main reason is that some external services might depend on the successful startup of LND however they also have dependencies when starting the ChannelArbitrator so let's see then. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the quick fixes! I think we need to handle a couple of edge cases a bit more, but the general approach looks good!
contractcourt/chain_arbitrator.go
Outdated
@@ -1192,9 +1232,6 @@ func (c *ChainArbitrator) ForceCloseContract(chanPoint wire.OutPoint) (*wire.Msg | |||
// channel has finished its final funding flow, it should be registered with | |||
// the ChainArbitrator so we can properly react to any on-chain events. | |||
func (c *ChainArbitrator) WatchNewChannel(newChan *channeldb.OpenChannel) error { | |||
c.Lock() | |||
defer c.Unlock() | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We now no longer guard the read access to c.activeChannels
below (L1205). Perhaps we need to introduce an RWMutex
instead?
contractcourt/chain_arbitrator.go
Outdated
chainArb := c.activeChannels[chanPoint] | ||
c.Unlock() | ||
if chainArb != nil { | ||
arbLog := chainArb.log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could arbLog
still be nil
at this point? Perhaps the above condition should be if chainArb != nil && chainArg.log != nil
?
contractcourt/channel_arbitrator.go
Outdated
c.wg.Add(1) | ||
go c.channelAttendant(bestHeight) | ||
return nil | ||
err = c.wg.Go(func(ctx context.Context) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could return directly here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what does this do exactly? No error is returned from channelAttendant
.
Was this issue actually introduced by adding the a goroutine the prior commit?
contractcourt/chain_arbitrator.go
Outdated
// timeouts for itests and normal operations. | ||
ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second) | ||
|
||
// Create an errgroup with the context |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: missing full stop at end of sentence, here and a couple of places below.
contractcourt/chain_arbitrator.go
Outdated
for _, arbitrator := range c.activeChannels { | ||
startState, ok := startStates[arbitrator.cfg.ChanPoint] | ||
if !ok { | ||
stopAndLog() | ||
// In case we encounter an error we need to cancel the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add empty line before comment if it isn't at the start of a block.
contractcourt/chain_arbitrator.go
Outdated
|
||
// Start arbitrator in a separate goroutine | ||
go func() { | ||
errChan <- arbitrator.Start(startState) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the whole goal of the errGroup
that we have a context that is canceled when an error occurs?
But now we're starting a new goroutine inside the err group goroutine just so we can abort Start()
?
This will work I think. But perhaps another possible approach would be to pass the cancellable context into Start()
and abort on context cancel there?
Otherwise we'd kind of kill/abandon the goroutines spawned in Start()
on shutdown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very good idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also why isn't arb.Start()
returned directly as the error here? That would actually use the errgroup
features.
Zooming out, what we want here is that chanAbr.Start()
actually doesn't block, but AFAICT, it'll wait for all the goroutines to start below, which can still enter the deadlock scenario we were trying to resolve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if I undestand your question, but I introduced the errGroup and this goroutine, because otherwise I cannot fail the goroutines as soon as a goroutines fails with an error. The normal errGroup waits until all goroutines are done which would bring us into the deadlock again. but yeah not really necessary wich the new appraoch.
contractcourt/chain_arbitrator.go
Outdated
select { | ||
// As soon as the context cancels we can be sure the | ||
// errGroup has finished waiting. | ||
case <-ctx.Done(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we can rely on the context being canceled here. The Godoc for the err group says "the first call to return a non-nil error cancels the group's context". But if there is no failure, there won't be a cancel.
Perhaps instead of using an err group we just create an error channel for the number of arbitrators we start, start them all in a goroutine then wait for them to be completed here, by reading the number of err
(or nil
) from the channel as there are arbitrators.
contractcourt/chain_arbitrator.go
Outdated
@@ -258,6 +258,10 @@ type ChainArbitrator struct { | |||
// methods and interface it needs to operate. | |||
cfg ChainArbitratorConfig | |||
|
|||
// resolveContract is a channel which is used to signal the cleanup of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re the commit comment: do we have a demonstration of the supposed deadlock?
contractcourt/chain_arbitrator.go
Outdated
@@ -509,44 +517,24 @@ func (c *ChainArbitrator) ResolveContract(chanPoint wire.OutPoint) error { | |||
return err | |||
} | |||
|
|||
// Now that the channel has been marked as fully closed, we'll stop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what's he deadlock scenario here? That the channel arb calls this function while the chain arb is trying to stop it?
I think that can alternatively be handled with an async call from the chan arb. At that point, it's shutting down, and can't really do much with any error returned here as all the contracts have been resolved (channel is fully closed).
contractcourt/channel_arbitrator.go
Outdated
c.wg.Add(1) | ||
go c.channelAttendant(bestHeight) | ||
return nil | ||
err = c.wg.Go(func(ctx context.Context) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what does this do exactly? No error is returned from channelAttendant
.
Was this issue actually introduced by adding the a goroutine the prior commit?
contractcourt/channel_arbitrator.go
Outdated
c.wg.Add(1) | ||
go c.resolveContract(contract, immediate) | ||
err := c.wg.Go(func(ctx context.Context) { | ||
c.resolveContract(contract, immediate) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here re not returning an err at all.
contractcourt/chain_arbitrator.go
Outdated
|
||
// Start arbitrator in a separate goroutine | ||
go func() { | ||
errChan <- arbitrator.Start(startState) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also why isn't arb.Start()
returned directly as the error here? That would actually use the errgroup
features.
Zooming out, what we want here is that chanAbr.Start()
actually doesn't block, but AFAICT, it'll wait for all the goroutines to start below, which can still enter the deadlock scenario we were trying to resolve.
Seeing this laid out a bit, I wonder if we should entertain the other idea that @ziggie1984 had: modify In terms of breaking changes, we can sidestep that by using a new set of functional options for the main arg. This way we only need to update callers at the site of the new unit tests, then also then |
9414e08
to
0c10a74
Compare
@Roasbeef ok changed the approach for now. Just going for the Optional Resolution approach. Will create a separate PR to make the startup async. But this will definitely solve the deadlock issue, but we should however add the async arbitrator feature (will create a separate PR). |
|
||
// Resolutions contains all the data required for resolving the | ||
// different output types of a commitment transaction. | ||
Resolutions fn.Option[Resolutions] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went for this approach here, it would be to much of a change to introduce options for all the separate types like for example:
AnchorResolution fn.Option[AnchorResolution] ...
because we use the nil
case a lot and also pass it into other structures. Maybe a deeper refactor in the log run but not now.
0c10a74
to
e7d50f9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the fix! This approach makes a lot of sense to me. Just a couple of minor suggestions, otherwise LGTM 🎉
e7d50f9
to
2e314a4
Compare
2e314a4
to
ff0622e
Compare
Some itests now seem to fail. Perhaps we need to increase some timeouts or wait for a different signal since things are now a bit more async?
|
hmm strange this PR did not introduce any new timeout issue, I take a look |
We don't always need the resolutions in the local force close summary so we make it an option.
ff0622e
to
df295ea
Compare
This PR does 2 things:
Fixes [bug]: ChannelArbitrator does not cleanly stop #8149, it now starts a chainArb level goroutine which is responsible to stop goroutines of the channel_arbitrators when they are fully resolved.
Now starts the different channel arbitrator during startup in an errorGroup and collects the result concurrently. This makes sure LND starts-up correctly. Sometimes the channelArbs depend on other subsystems like for example taproot assets, so we need to make sure we do not block here forever.
EDIT:
Changed the approach. This will prevent the deadlock from happening.
A separate PR will be created to start the arbitrator concurrently.