[EFM] Recoverable Random Beacon State Machine #6771

durkmurder · 2024-12-02T16:41:44Z

Context

The goal of this PR is to implement a state machine which allows to cover all possible cases which can happen when performing DKG. Previously our implementation was enforcing some of the rules but not all of them, what made things worse is that logic and interfaces wasn't explicit enough so an engineer can easily understand it. In this PR we have made an attempt to implement a robust approach to handling possible state and state transitions for both happy path and recovery path.

Some key points:

badger.RecoverablePrivateBeaconKeyStateMachine implements the state machine itself.
State machine transitions are initiated from dkg.ReactorEngine and dkg.BeaconKeyRecovery.
State machine doesn't care about origin of the key. It could be obtained from different sources(successful DKG or manually injected by operator) all that matters that key is committed. Caller needs to ensure that respective public key is part of the EpochCommit for respective epoch.

Link to the diagram which describes the state machine:

…DKG storage

…nd state but rather current

…internals of state machine

…tate machine

codecov-commenter · 2024-12-02T16:44:54Z

Codecov Report

Attention: Patch coverage is 20.96774% with 294 lines in your changes missing coverage. Please review.

Project coverage is 41.68%. Comparing base (af44135) to head (aa377c2).

Files with missing lines	Patch %	Lines
storage/mock/dkg_state_reader.go	0.00%	98 Missing ⚠️
storage/mock/epoch_recovery_my_beacon_key.go	0.00%	64 Missing ⚠️
storage/mock/dkg_state.go	0.00%	54 Missing ⚠️
cmd/consensus/main.go	0.00%	38 Missing ⚠️
engine/consensus/dkg/reactor_engine.go	37.50%	19 Missing and 1 partial ⚠️
model/flow/dkg.go	0.00%	10 Missing ⚠️
cmd/util/cmd/common/node_info.go	0.00%	3 Missing ⚠️
storage/badger/dkg_state.go	94.11%	2 Missing and 1 partial ⚠️
storage/errors.go	81.81%	2 Missing ⚠️
engine/common/grpc/forwarder/forwarder.go	0.00%	1 Missing ⚠️
... and 1 more

Additional details and impacted files

@@                   Coverage Diff                    @@
##           feature/efm-recovery    #6771      +/-   ##
========================================================
- Coverage                 41.72%   41.68%   -0.05%     
========================================================
  Files                      2031     2033       +2     
  Lines                    180552   180748     +196     
========================================================
  Hits                      75341    75341              
- Misses                    99017    99202     +185     
- Partials                   6194     6205      +11

Flag	Coverage Δ
unittests	`41.68% <20.96%> (-0.05%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

engine/consensus/dkg/reactor_engine.go

jordanschalm · 2024-12-04T03:27:55Z

engine/consensus/dkg/reactor_engine.go

@@ -427,10 +431,10 @@ func (e *ReactorEngine) end(nextEpochCounter uint64) func() error {
 			// has already abandoned the happy path, because on the happy path the ReactorEngine is the only writer.
 			// Then this function just stops and returns without error.
 			e.log.Warn().Err(err).Msgf("node %s with index %d failed DKG locally", e.me.NodeID(), e.controller.GetIndex())
-			err := e.dkgState.SetDKGEndState(nextEpochCounter, flow.DKGEndStateDKGFailure)
+			err := e.dkgState.SetDKGState(nextEpochCounter, flow.DKGStateFailure)
 			if err != nil {
 				if errors.Is(err, storage.ErrAlreadyExists) {


SetDKGState doesn't return this error type, it returns InvalidTransitionRandomBeaconStateMachineErr.

The comment above is outdated now, but it implies that we might already have persisted a failure state:

By convention, if we are leaving the happy path, we want to persist the first failure symptom

The state machine does not allow transitions from one state to itself (except for RandomBeaconKeyCommitted). If, as the current comment suggests, a failure state is already set at this point, we will throw InvalidTransitionRandomBeaconStateMachineErr as DKGStateFailure-> DKGStateFailure is not a valid transition. I don't think this is the case, but I'll return to this after reading further through the PR.

Suggestions:

update comment on lines 428-432

remove ErrAlreadyExists check

I have allowed transition failure -> failure to handle such situations since the timing might be tricky and I have also added updated comment regarding possible error return. Let me know what do you think: https://github.com/onflow/flow-go/pull/6771/files/43d6a63349788084c057a42134326dcb4e721ad5..550fd3f739237b9fb241f13a69b19de5b7ce56b5

model/flow/dkg.go

storage/badger/dkg_state.go

storage/dkg.go

storage/badger/dkg_state_test.go

Co-authored-by: Jordan Schalm <jordan.schalm@flowfoundation.org>

AlexHentschel · 2024-12-10T01:40:15Z

I really appreciated this summary, which explains why the change is important 🙇 . I find that particularly helpful to clarify the focus of a PR.

The goal of this PR is to implement a state machine which allows to cover all possible cases which can happen when performing DKG. Previously our implementation was enforcing some of the rules but not all of them, what made things worse is that logic and interfaces wasn't explicit enough so an engineer can easily understand it. In this PR we have made an attempt to implement a robust approach to handling possible state and state transitions for both happy path and recovery path.

AlexHentschel

Very nice. The code is a lot more expressive and rigorous about what is allowed. Very nice. While I looked at the core logic, I didn't quite get through the entire PR, but wanted to provide already my batch of comments so far.

storage/badger/dkg_state.go

storage/dkg.go

engine/consensus/dkg/reactor_engine.go

AlexHentschel · 2024-12-10T05:21:25Z

engine/consensus/dkg/reactor_engine.go

 	// public key - therefore it is unsafe for use
 	if !nextDKGPubKey.Equals(localPubKey) {
 		log.Warn().
 			Str("computed_beacon_pub_key", localPubKey.String()).
 			Str("canonical_beacon_pub_key", nextDKGPubKey.String()).
 			Msg("checking beacon key consistency: locally computed beacon public key does not match beacon public key for next epoch")
-		err := e.dkgState.SetDKGEndState(nextEpochCounter, flow.DKGEndStateInconsistentKey)
+		err := e.dkgState.SetDKGState(nextEpochCounter, flow.DKGStateFailure)


⚠️ ❓
I am worried that this code might be concurrently running with for example the recovery (if the node rebooted at a really unfortunate time). I think we should in each step assume that some other process might have concurrently advanced the state. So in each step, it might be possible that the flow.DKGState could have changed compared to what we just read from the data base.

Intuitively it seems to be a too relaxed behavior. For failure states we allow self-transitions, for everything else where we deviate from happy path and get an unexpected error I would be inclined to return an error so we can figure out what was wrong. For your particular scenario I am not very worried, it means that operator has to try again.

Co-authored-by: Alexander Hentschel <alex.hentschel@flowfoundation.org>

AlexHentschel

The following batch of comments is on the tests for RecoverableRandomBeaconStateMachine.

storage/badger/dkg_state_test.go

AlexHentschel · 2024-12-10T20:44:55Z

storage/badger/dkg_state_test.go

+			epochCounter := setupState()
+			err = store.SetDKGState(epochCounter, flow.RandomBeaconKeyCommitted)
+			require.NoError(t, err, "should be possible since we have a stored private key")
+			err = store.UpsertMyBeaconPrivateKey(epochCounter, unittest.RandomBeaconPriv())


⚠️

In my opinion this test should fail, unless we are using the same key that was initially committed in line 320:

flow-go/storage/badger/dkg_state_test.go

Line 320 in 0687e30

err = store.UpsertMyBeaconPrivateKey(epochCounter, unittest.RandomBeaconPriv())

I am not sure whether unittest.RandomBeaconPriv() returns different keys on each call (?) but if it does, this test should fail.

Can we please test both cases? Thanks

Will take care of this in follow up PR.

AlexHentschel · 2024-12-10T22:00:55Z

cmd/consensus/main.go

+			// perform this only if state machine is in initial state
+			if !started {
+				// store my beacon key for the first epoch post-spork
+				err = myBeaconKeyStateMachine.UpsertMyBeaconPrivateKey(epochCounter, beaconPrivateKey.PrivateKey)
+				if err != nil {
+					return fmt.Errorf("could not upsert my beacon private key for root epoch %d: %w", epochCounter, err)
+				}


If that is easy to do, I'd prefer if we can check that the random beacon key matches the information in the Epoch Commit event ... just to be safe against human configuration errors

Will take care of this in follow up PR.

AlexHentschel

Thank you for the great work. The state transitions for the DKG are so much clearer now (at least for me). Only remaining request would be to include your state machine diagram in the code base:

you could add a short Readme in the folder docs ... maybe call it RecoverableRandomBeaconStateMachine or something similar.

To simplify your work, I have drafted a brief explanation, which you could include in the readme (if you like it):

The RecoverableRandomBeaconStateMachine formalizes the life-cycle of the Random Beacon keys for each epoch. On the happy path, each consensus participant for the next epoch takes part in a DKG to obtain a threshold key to participate in Flow's Random Beacon. After successfully finishing the DKG protocol, the node obtains a random beacon private key, which is stored in the database along with DKG current state flow.DKGStateCompleted. If for any reason the DKG fails, then the private key will be nil and DKG current state is set to flow.DKGStateFailure.
In case of failing Epoch switchover, the network goes into Epoch Fallback Mode [EFM]. The governance committee can recover the network via a special EpochRecover transaction. In this case, the set of threshold keys is specified by the governance committee.
The current implementation focuses on the scenario, where the governance committee re-uses the threshold key set from the last successful epoch transition. While injecting other threshold keys into the nodes is conceptually possible and supported, the utilities for this recovery path are not yet implemented.

diagram

Co-authored-by: Alexander Hentschel <alex.hentschel@flowfoundation.org>

…m-beacon-state-machine

durkmurder added 23 commits November 20, 2024 20:08

Changed structure of interfaces and corresponding implementation for …

100a92b

…DKG storage

Removed 'DKG started' from storage.

719b6a6

Updated DKG states to have extra states. They no more represent the e…

01bbdcf

…nd state but rather current

Updated usages of DKG storage in reactor engine

1d3c6a4

Added back GetDKGStarted for easier usage in reactor engine. Updated …

6c34886

…internals of state machine

Implemented allowed state transitions for recoverable random beacon s…

8aac526

…tate machine

Fixed unit test compilation. Updated allowed state transitions

064b651

Renamed interface methods

6bcaf38

Updated mocks

c13c7f6

Fixed tests for reactor engine

489c871

Updated godoc and reduced number of states for Recoverable state machine

10aab93

Updated usages of DKG state. Updated naming, godocs

fb28249

Removed flow.RandomBeaconKeyRecovered state. Cleanup

fb161f7

Updated how recovery happens in terms of inserting values

650b8b8

Implemented test for enforcing invariants of the uninitialized state

2386623

Added additional test cases.

bfe95b0

Updated logic for state transitions

79aac04

Added additional test for Completed state

1119c8c

Added tests for failure state

c8dfca6

Added extra tests for Random Beacon Key Committed state

c9182c5

Updated godoc for DKG tests

577712a

Godoc updates

7e728aa

Updated mocks

1fa11c6

durkmurder assigned jordanschalm and AlexHentschel Dec 2, 2024

durkmurder added 4 commits December 2, 2024 18:45

Naming updates

a92d8b7

Fixed broken tests

f0be4ba

Linted

62d399d

Fixed broken integration tests for DKG

bea9c1a

durkmurder requested a review from jordanschalm as a code owner December 3, 2024 18:43

durkmurder added 2 commits December 3, 2024 21:06

Fixed invalid exit logic in DKG reactor engine

0f98cf7

Fixed broken test

6ea64d6

jordanschalm reviewed Dec 4, 2024

View reviewed changes

durkmurder and others added 7 commits December 5, 2024 11:34

Apply suggestions from code review

6f7017a

Co-authored-by: Jordan Schalm <jordan.schalm@flowfoundation.org>

Apply suggestions from PR review

ea0f412

Apply suggestions from PR review

b9bc92e

Apply suggestions from PR review

8f15c29

Update engine/consensus/dkg/reactor_engine.go

43d6a63

Co-authored-by: Jordan Schalm <jordan.schalm@flowfoundation.org>

Allowed self transition to DKGStateFailure

9302497

Updated reactor engine to handle invalid state transition at dkg end

550fd3f

jordanschalm mentioned this pull request Dec 6, 2024

[EFM Recovery] Backward Compatibility for DKGEndState field #6786

Open

jordanschalm approved these changes Dec 9, 2024

View reviewed changes

AlexHentschel reviewed Dec 10, 2024

View reviewed changes

durkmurder and others added 2 commits December 10, 2024 18:14

Apply suggestions from code review

7053c88

Co-authored-by: Alexander Hentschel <alex.hentschel@flowfoundation.org>

Apply suggestions from code review

0687e30

Co-authored-by: Alexander Hentschel <alex.hentschel@flowfoundation.org>

AlexHentschel reviewed Dec 10, 2024

View reviewed changes

AlexHentschel approved these changes Dec 10, 2024

View reviewed changes

durkmurder and others added 8 commits December 11, 2024 14:46

Apply suggestions from PR review

8226642

Apply suggestions from code review

66f67d4

Co-authored-by: Alexander Hentschel <alex.hentschel@flowfoundation.org>

Apply suggestions from PR review

56b1a63

Apply suggestions from PR review

c0c6b7e

Added docs

00f952a

Merge branch 'feature/efm-recovery' into yurii/6725-recoverable-rando…

21ff3ef

…m-beacon-state-machine

Updated mocks

12a633a

Linted

aa377c2

durkmurder merged commit 6c251a3 into feature/efm-recovery Dec 12, 2024
55 checks passed

durkmurder deleted the yurii/6725-recoverable-random-beacon-state-machine branch December 12, 2024 13:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EFM] Recoverable Random Beacon State Machine #6771

[EFM] Recoverable Random Beacon State Machine #6771

durkmurder commented Dec 2, 2024 •

edited

Loading

codecov-commenter commented Dec 2, 2024 •

edited

Loading

jordanschalm Dec 4, 2024

durkmurder Dec 5, 2024

AlexHentschel commented Dec 10, 2024 •

edited

Loading

AlexHentschel left a comment

AlexHentschel Dec 10, 2024

durkmurder Dec 11, 2024

AlexHentschel left a comment

AlexHentschel Dec 10, 2024

durkmurder Dec 12, 2024

AlexHentschel Dec 10, 2024

durkmurder Dec 12, 2024

AlexHentschel left a comment

[EFM] Recoverable Random Beacon State Machine #6771

[EFM] Recoverable Random Beacon State Machine #6771

Conversation

durkmurder commented Dec 2, 2024 • edited Loading

Context

codecov-commenter commented Dec 2, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlexHentschel commented Dec 10, 2024 • edited Loading

AlexHentschel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlexHentschel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

⚠️

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlexHentschel left a comment

Choose a reason for hiding this comment

durkmurder commented Dec 2, 2024 •

edited

Loading

codecov-commenter commented Dec 2, 2024 •

edited

Loading

AlexHentschel commented Dec 10, 2024 •

edited

Loading