Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EFM] Recoverable Random Beacon State Machine #6771

Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
100a92b
Changed structure of interfaces and corresponding implementation for …
durkmurder Nov 20, 2024
719b6a6
Removed 'DKG started' from storage.
durkmurder Nov 22, 2024
01bbdcf
Updated DKG states to have extra states. They no more represent the e…
durkmurder Nov 22, 2024
1d3c6a4
Updated usages of DKG storage in reactor engine
durkmurder Nov 22, 2024
6c34886
Added back GetDKGStarted for easier usage in reactor engine. Updated …
durkmurder Nov 22, 2024
8aac526
Implemented allowed state transitions for recoverable random beacon s…
durkmurder Nov 22, 2024
064b651
Fixed unit test compilation. Updated allowed state transitions
durkmurder Nov 22, 2024
6bcaf38
Renamed interface methods
durkmurder Nov 22, 2024
c13c7f6
Updated mocks
durkmurder Nov 22, 2024
489c871
Fixed tests for reactor engine
durkmurder Nov 22, 2024
10aab93
Updated godoc and reduced number of states for Recoverable state machine
durkmurder Nov 25, 2024
fb28249
Updated usages of DKG state. Updated naming, godocs
durkmurder Nov 25, 2024
fb161f7
Removed flow.RandomBeaconKeyRecovered state. Cleanup
durkmurder Nov 28, 2024
650b8b8
Updated how recovery happens in terms of inserting values
durkmurder Nov 28, 2024
2386623
Implemented test for enforcing invariants of the uninitialized state
durkmurder Nov 29, 2024
bfe95b0
Added additional test cases.
durkmurder Nov 29, 2024
79aac04
Updated logic for state transitions
durkmurder Nov 29, 2024
1119c8c
Added additional test for Completed state
durkmurder Nov 29, 2024
c8dfca6
Added tests for failure state
durkmurder Nov 29, 2024
c9182c5
Added extra tests for Random Beacon Key Committed state
durkmurder Nov 29, 2024
577712a
Updated godoc for DKG tests
durkmurder Dec 2, 2024
7e728aa
Godoc updates
durkmurder Dec 2, 2024
1fa11c6
Updated mocks
durkmurder Dec 2, 2024
a92d8b7
Naming updates
durkmurder Dec 2, 2024
f0be4ba
Fixed broken tests
durkmurder Dec 2, 2024
62d399d
Linted
durkmurder Dec 2, 2024
bea9c1a
Fixed broken integration tests for DKG
durkmurder Dec 2, 2024
89005a8
Merge branch 'feature/efm-recovery' into yurii/6725-recoverable-rando…
durkmurder Dec 3, 2024
0b9f732
Fixed initialization of beacon private key state machine
durkmurder Dec 3, 2024
5a98f76
Updated godoc. Added specific sentinel for error handling
durkmurder Dec 3, 2024
b639742
Updated assertions with the account for sentinel errors
durkmurder Dec 3, 2024
be18c57
Updated logging
durkmurder Dec 3, 2024
d0df4fb
Linted
durkmurder Dec 3, 2024
0f98cf7
Fixed invalid exit logic in DKG reactor engine
durkmurder Dec 3, 2024
6ea64d6
Fixed broken test
durkmurder Dec 4, 2024
6f7017a
Apply suggestions from code review
durkmurder Dec 5, 2024
ea0f412
Apply suggestions from PR review
durkmurder Dec 5, 2024
b9bc92e
Apply suggestions from PR review
durkmurder Dec 5, 2024
8f15c29
Apply suggestions from PR review
durkmurder Dec 5, 2024
43d6a63
Update engine/consensus/dkg/reactor_engine.go
durkmurder Dec 5, 2024
9302497
Allowed self transition to DKGStateFailure
durkmurder Dec 5, 2024
550fd3f
Updated reactor engine to handle invalid state transition at dkg end
durkmurder Dec 5, 2024
7053c88
Apply suggestions from code review
durkmurder Dec 10, 2024
0687e30
Apply suggestions from code review
durkmurder Dec 10, 2024
8226642
Apply suggestions from PR review
durkmurder Dec 11, 2024
66f67d4
Apply suggestions from code review
durkmurder Dec 11, 2024
56b1a63
Apply suggestions from PR review
durkmurder Dec 11, 2024
c0c6b7e
Apply suggestions from PR review
durkmurder Dec 12, 2024
00f952a
Added docs
durkmurder Dec 12, 2024
21ff3ef
Merge branch 'feature/efm-recovery' into yurii/6725-recoverable-rando…
durkmurder Dec 12, 2024
12a633a
Updated mocks
durkmurder Dec 12, 2024
aa377c2
Linted
durkmurder Dec 12, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 21 additions & 15 deletions engine/consensus/dkg/reactor_engine.go
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,7 @@ func (e *ReactorEngine) Ready() <-chan struct{} {
if phase == flow.EpochPhaseSetup {
e.startDKGForEpoch(currentCounter, first)
} else if phase == flow.EpochPhaseCommitted {
// If we start up in EpochCommitted phase, ensure the DKG end state is set correctly.
// If we start up in EpochCommitted phase, ensure the DKG current state is set correctly.
e.handleEpochCommittedPhaseStarted(currentCounter, first)
}
})
Expand Down Expand Up @@ -249,10 +249,10 @@ func (e *ReactorEngine) startDKGForEpoch(currentEpochCounter uint64, first *flow
//
durkmurder marked this conversation as resolved.
Show resolved Hide resolved
// This function checks that the local DKG completed and that our locally computed
// key share is consistent with the canonical key vector. When this function returns,
// an end state for the just-completed DKG is guaranteed to be stored (if not, the
// an current state for the just-completed DKG is guaranteed to be stored (if not, the
durkmurder marked this conversation as resolved.
Show resolved Hide resolved
// program will crash). Since this function is invoked synchronously before the end
// of the current epoch, this guarantees that when we reach the end of the current epoch
// we will either have a usable beacon key (successful DKG) or a DKG failure end state
// we will either have a usable beacon key (successful DKG) or a DKG failure current state
// stored, so we can safely fall back to using our staking key.
durkmurder marked this conversation as resolved.
Show resolved Hide resolved
//
// CAUTION: This function is not safe for concurrent use. This is not enforced within
Expand All @@ -267,13 +267,18 @@ func (e *ReactorEngine) handleEpochCommittedPhaseStarted(currentEpochCounter uin
Uint64("next_epoch", nextEpochCounter). // the epoch the just-finished DKG was preparing for
Logger()

// Check whether we have already set the end state for this DKG.
// Check whether we have already set the current state for this DKG.
// This can happen if the DKG failed locally, if we failed to generate
// a local private beacon key, or if we crashed while performing this
// check previously.
currentState, err := e.dkgState.GetDKGState(nextEpochCounter)
if err != nil {
log.Fatal().Err(err).Msg("failed to get dkg state, by this point it should have been set")
if errors.Is(err, storage.ErrNotFound) {
log.Warn().Msg("failed to get dkg state, assuming this node has skipped epoch setup phase")
} else {
log.Fatal().Err(err).Msg("failed to get dkg state")
}

return
}
if currentState != flow.DKGStateCompleted {
Expand All @@ -299,7 +304,7 @@ func (e *ReactorEngine) handleEpochCommittedPhaseStarted(currentEpochCounter uin
err := e.dkgState.SetDKGState(nextEpochCounter, flow.DKGStateFailure)
if err != nil {
// TODO use irrecoverable context
log.Fatal().Err(err).Msg("failed to set dkg end state")
log.Fatal().Err(err).Msg("failed to set dkg state")
}
return
} else if err != nil {
Expand All @@ -316,7 +321,7 @@ func (e *ReactorEngine) handleEpochCommittedPhaseStarted(currentEpochCounter uin
}
localPubKey := myBeaconPrivKey.PublicKey()

// we computed a local beacon key but it is inconsistent with our canonical
// we computed a local beacon key, but it is inconsistent with our canonical
// public key - therefore it is unsafe for use
if !nextDKGPubKey.Equals(localPubKey) {
log.Warn().
Expand All @@ -326,15 +331,15 @@ func (e *ReactorEngine) handleEpochCommittedPhaseStarted(currentEpochCounter uin
err := e.dkgState.SetDKGState(nextEpochCounter, flow.DKGStateFailure)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️
I am worried that this code might be concurrently running with for example the recovery (if the node rebooted at a really unfortunate time). I think we should in each step assume that some other process might have concurrently advanced the state. So in each step, it might be possible that the flow.DKGState could have changed compared to what we just read from the data base.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intuitively it seems to be a too relaxed behavior. For failure states we allow self-transitions, for everything else where we deviate from happy path and get an unexpected error I would be inclined to return an error so we can figure out what was wrong. For your particular scenario I am not very worried, it means that operator has to try again.

if err != nil {
// TODO use irrecoverable context
log.Fatal().Err(err).Msg("failed to set dkg end state")
log.Fatal().Err(err).Msg("failed to set dkg current state")
}
return
}

err = e.dkgState.SetDKGState(nextEpochCounter, flow.RandomBeaconKeyCommitted)
if err != nil {
// TODO use irrecoverable context
e.log.Fatal().Err(err).Msg("failed to set dkg end state")
e.log.Fatal().Err(err).Msg("failed to set dkg current state")
}
log.Info().Msgf("successfully ended DKG, my beacon pub key for epoch %d is %s", nextEpochCounter, localPubKey)
}
Expand Down Expand Up @@ -427,16 +432,17 @@ func (e *ReactorEngine) end(nextEpochCounter uint64) func() error {
if crypto.IsDKGFailureError(err) {
// Failing to complete the DKG protocol is a rare but expected scenario, which we must handle.
// By convention, if we are leaving the happy path, we want to persist the _first_ failure symptom
// in the `dkgState`. If the write yields a `storage.ErrAlreadyExists`, we know the overall protocol
// has already abandoned the happy path, because on the happy path the ReactorEngine is the only writer.
// Then this function just stops and returns without error.
// in the `dkgState`. If the write yields a [storage.InvalidDKGStateTransitionError], it means that the state machine
// is in the terminal state([flow.RandomBeaconKeyCommitted]) as all other transitions(even to [flow.DKGStateFailure] -> [flow.DKGStateFailure])
// are allowed. If the protocol is in terminal state, and we have a failure symptom, then it means that recovery has happened
// before ending the DKG. In this case, we want to ignore the error and return without error.
e.log.Warn().Err(err).Msgf("node %s with index %d failed DKG locally", e.me.NodeID(), e.controller.GetIndex())
err := e.dkgState.SetDKGState(nextEpochCounter, flow.DKGStateFailure)
if err != nil {
if errors.Is(err, storage.ErrAlreadyExists) {
return nil // DKGState already being set is expected in case of epoch recovery
if storage.IsInvalidDKGStateTransitionError(err) {
return nil
}
return fmt.Errorf("failed to set dkg end state following dkg end error: %w", err)
return fmt.Errorf("failed to set dkg current state following dkg end error: %w", err)
}
return nil // local DKG protocol has failed (the expected scenario)
} else if err != nil {
Expand Down
2 changes: 1 addition & 1 deletion storage/badger/dkg_state.go
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ var allowedStateTransitions = map[flow.DKGState][]flow.DKGState{
flow.DKGStateStarted: {flow.DKGStateCompleted, flow.DKGStateFailure, flow.RandomBeaconKeyCommitted},
flow.DKGStateCompleted: {flow.RandomBeaconKeyCommitted, flow.DKGStateFailure},
flow.RandomBeaconKeyCommitted: {flow.RandomBeaconKeyCommitted},
durkmurder marked this conversation as resolved.
Show resolved Hide resolved
flow.DKGStateFailure: {flow.RandomBeaconKeyCommitted},
flow.DKGStateFailure: {flow.RandomBeaconKeyCommitted, flow.DKGStateFailure},
flow.DKGStateUninitialized: {flow.DKGStateStarted, flow.DKGStateFailure, flow.RandomBeaconKeyCommitted},
}

Expand Down
5 changes: 2 additions & 3 deletions storage/badger/dkg_state_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -275,10 +275,9 @@ func TestDKGState_FailureState(t *testing.T) {
require.True(t, storage.IsInvalidDKGStateTransitionError(err))
})

t.Run("-> flow.DKGStateFailure, not allowed", func(t *testing.T) {
t.Run("-> flow.DKGStateFailure, should be allowed", func(t *testing.T) {
err = store.SetDKGState(setupState(), flow.DKGStateFailure)
require.Error(t, err)
require.True(t, storage.IsInvalidDKGStateTransitionError(err))
require.NoError(t, err)
})

t.Run("-> flow.DKGStateCompleted, not allowed", func(t *testing.T) {
Expand Down
Loading