EFM Recovery Service Event and Transaction #440

kc1116 · 2024-07-30T17:38:43Z

This PR updates the FlowEpoch smart contract to support recovering the network while in Epoch Fallback Mode. It adds a new service event EpochRecover which contains the metadata for the recovery epoch. This metadata is generated out of band using the bootstrap utility util epoch efm-recover-tx-args onflow/flow-go#5576 and submitted to the contract with the recovery_epoch.cdc transaction. The FlowEpoch contract will end the current epoch, start the recovery epoch and store the metadata for the recovery epoch in storage. This metadata will then be emitted to the network during the next heartbeat interval.

Reopening original PR: #420

jordanschalm

Copying over the main comment from the previous review: #420 (comment)

The second conditional case of the recover_epoch transaction (when unsafeAllowOverwrite is false) doesn't use the recoveryEpochCounter value at all. But if we go down that code path and FlowEpoch.currentEpochCounter != recoveryEpochCounter, we know the recovery process will fail.

So I think we should use recoveryEpochCounter in the second codepath as well. We can explicitly check that FlowEpoch.currentEpochCounter == recoveryEpochCounter, for example as a precondition, and panic if this doesn't hold.

jordanschalm · 2024-07-31T19:41:07Z

contracts/epochs/FlowEpoch.cdc

+            /// Create new EpochMetadata for the recovery epoch with the new values
+            let newEpochMetadata = EpochMetadata(
+            /// Increment the epoch counter when recovering with a new epoch
+            counter: FlowEpoch.proposedEpochCounter(),


Copying comment from previous PR: #420 (comment).

Line 624 is starting a multi-line expression, constructing an instance of EpochMetadata.

My suggestion is to indent the lines within that expression:

let newEpochMetadata = EpochMetadata( /// Increment the epoch counter when recovering with a new epoch counter: FlowEpoch.proposedEpochCounter(), seed: randomSource, startView: startView, endView: endView, stakingEndView: stakingEndView, // The following fields will be overwritten in `calculateAndSetRewards` below totalRewards: 0.0, collectorClusters: [], clusterQCs: [], dkgKeys: dkgPubKeys)

Similar to what we do on, for example, line 646-658

jordanschalm · 2024-07-31T19:43:53Z

lib/go/test/flow_epoch_test.go

+// Specifically, we execute an epoch recover transaction and confirm both scenarios are true;
+//   - epoch recover that specifies unsafeAllowOverwrite = false increments the epoch counter effectively starting a new epoch.
+//   - epoch recover that specifies unsafeAllowOverwrite = true overwrites the current epoch and does not increment the counter.
+func TestEpochRecover(t *testing.T) {


Copying comment from previous PR: #420 (comment).

Once we are passing in the recoveryEpochCounter, we must use that value as the epoch counter for the recovery epoch 100% of the time, otherwise we will produce an incompatible recovery epoch state. recoveryEpochCounter is the source of truth.

In the commit which was merged to the feature branch, the logic is:

if unsafeAllowOverwrite is set, then use recoveryEpochCounter as the counter for the recovery epoch

otherwise, ignore the recoveryEpochCounter input and assume that the the smart contract and protocol state epoch counters are synchronized

I think conditional branch 2 is unsafe and we should respect the recoveryEpochCounter codepath in all cases. Sorry if I didn't explain this properly in the previous review. The purpose of including recoveryEpochCounter at all is to make sure it is used as the epoch counter for the recovery epoch being constructed in the transaction, in all circumstances.

jordanschalm

This is coming along nicely!

My main suggestion in this review is to expand the test coverage to cover more edge cases (suggestions enumerated here). The existing tests are quite verbose, so I think it would be worthwhile to
invest time in factoring out some of the common test logic when adding test cases. After we get Josh's input on the implementation changes, I'd be OK with implementing additional test coverage in a separate PR. If you'd like to do that, let me know.

contracts/epochs/FlowEpoch.cdc

jordanschalm · 2024-07-31T23:06:54Z

contracts/epochs/FlowEpoch.cdc

+            /// Create new EpochMetadata for the recovery epoch with the new values
+            let newEpochMetadata = EpochMetadata(
+            /// Increment the epoch counter when recovering with a new epoch
+            counter: FlowEpoch.proposedEpochCounter(),


Suggested change

counter: FlowEpoch.proposedEpochCounter(),

counter: recoveryEpochCounter,

jordanschalm · 2024-07-31T23:12:50Z

contracts/epochs/FlowEpoch.cdc

+            self.stopEpochComponents()
+            let randomSource = FlowEpoch.generateRandomSource()
+
+            let numViewsInStakingAuction = FlowEpoch.configurableMetadata.numViewsInStakingAuction


We're accepting stakingEndView as an input (implicitly specifying a desired staking auction length), so we should not also use this potentially conflicting representation for the same information. Otherwise we could persist a different staking end view in the smart contract storage compared to what we emit in the service event.

jordanschalm · 2024-08-01T17:42:12Z

lib/go/test/flow_epoch_test.go

+		newEndView := endView + 1
+		args[3] = CadenceUInt64(newEndView)
+
+		// avoid initializing a new epoch, set unsafeAllowOverwrite to true


It would be great to explicitly separate this out into a separate test case.

jordanschalm · 2024-08-01T17:44:34Z

lib/go/test/flow_epoch_test.go

@@ -1516,3 +1517,182 @@ func TestEpochReset(t *testing.T) {
 		assertEqual(t, CadenceUFix64("249999.99300000"), result)
 	})
 }
+


I think we should test a broader variety of cases:

Test that we can execute Epoch Recovery while the smart contract is in each epoch phase (staking phase, setup phase, committed phase)

In particular, I'm thinking of the setup and committed phases, where a proposed EpochMetadata entry for the next epoch is already stored in that state.

Test different inputs of recoveryEpochCounter and FlowEpoch.currentEpochCounter.

Validate that overwrite attempts panic if unsafeAllowOverwrite is false

Validate that recoveryEpochCounter outside [currentCounter, proposedCounter] panics, regardless of unsafeAllowOverwrite

Test that staking rewards are paid out exactly once per epoch.

If we are overwriting the current epoch, then staking rewards should not be paid out.

jordanschalm · 2024-08-01T17:47:38Z

contracts/epochs/FlowEpoch.cdc

+                message: "recovery epoch counter should equal current epoch counter"
+            )
+            self.stopEpochComponents()
+            let numViewsInStakingAuction = FlowEpoch.configurableMetadata.numViewsInStakingAuction


Same comment here about numViewsInStakingAuction potentially conflicting with stakingEndView-startView.

jordanschalm · 2024-08-01T17:49:22Z

contracts/epochs/FlowEpoch.cdc

@@ -432,6 +497,242 @@ access(all) contract FlowEpoch {
            FlowEpoch.account.storage.load<Bool>(from: /storage/flowAutomaticRewardsEnabled)
            FlowEpoch.account.storage.save(enabled, to: /storage/flowAutomaticRewardsEnabled)
        }
+
+        access(self) fun emitEpochRecoverEvent(epochCounter: UInt64,


Could you add a short documentation for this function? In particular I would note that inputs are not validated (caller must validate with recoverEpochPreChecks).

related to Jordan's comment:

In the spirit of defensive programming, I would suggest to call recoverEpochPreChecks in the body of emitEpochRecoverEvent. Thereby, we reduce the risks of accidentally forgetting the pre-checks in a future refactoring.

jordanschalm · 2024-08-01T17:51:52Z

contracts/epochs/FlowEpoch.cdc

+            numViewsInStakingAuction: UInt64,
+            numViewsInDKGPhase: UInt64,


I would suggest omitting both these parameters:

numViewsInDKGPhase, because it isn't used in the caller and can just be read from the global config here

numViewsInStakingAuction because it could potentially conflict with stakingEndView-startView, causing us to compute incorrect values for DKG phase views below. Instead, we should just use stakingEndView in the computation to avoid the potential conflict

AlexHentschel

Thank you for the nice work. Appreciate the multitude of smaller refectorings, where you have moved auxiliary code into little service methods -- that certainly improves readability of the code.

I have added various suggestions for extending the documentation. However, given my very limited knowledge of cadence and the epoch smart contracts, I don't feel sufficiently confident in my abilities to spot potential problems/errors to approve this PR.

⚠️ There is one possibly ~~significant challenge~~ [update: its not a big risk; see Jordan's comment below] that I noticed:

the FlowEpoch smart contract offers two entry points for recovery:
1. recoverNewEpoch which requires that the counter for the recovery epoch matches the smart contract's current epoch
2. recoverCurrentEpoch enforces that counter for the recovery epoch matches is one bigger than the smart contract's current epoch
I think this places strong limitations on the scenarios we can successfully recover from (specifically the time frame in which a recovery must be successful). Lets unravel this a bit:
- so initially we assume that the Protocol State and the Epoch Smart Contract are on the happy path: both counters are (largely) in sync
- then there is a problem and the Protocol State goes into EFM. That means for the running network, where the protocol state is the source of truth that determines operation, the network remains on the epoch counter N
- However, while the protocol state stays on epoch N (extending it until successful recovery), the smart contract can continue to progress through its speculative epochs.
- I think it is very likely that failures will occur relatively close to the desired epoch switchover, because the Epoch Setup phase only ends a few hours before the target transition and that is where problems typically occur. Lets say its 3 hours before the target transition and the protocol state goes into EFM and stays at epoch N.
- The protocol state continues its work and enters epoch N+1. Everyone is stressed because the network is in EFM, some people might be OOO, the engineers are doing the best they can. The engineers trying to recover the epoch lifecycle know that they have to specify the next epoch: They query the smart contract, which tells them the system is currently in epoch N+1. So the engineers specify epoch N+2 and call recoverNewEpoch. The smart contract is happy, and emits a recovery event for epoch N+2 and enters epoch N+2 ... but the protocol state rejects the recovery because it is still in epoch N and expects epoch N+1 to be specified. And then we are screwed: the protocol state must receive a recovery epoch N+1 but the smart contract is already at N+2, it only accepts recovery data for epochs with counter ≥ N+2! ☠️
- different scenario: due to typos, stress and unfamiliarity with the recovery process the first two calls to recoverNewEpoch emit an event (each increasing the counter) which are both rejected. We end up in a similar scenario: the smart contract's epoch counter has already progressed beyond the expected value for the dynamic protocol state.
- Other scenario: too many partner nodes are offline and we would like to get them back online before attempting an epoch recovery ... reaching out and helping the partners might take some time. The network is running fine (just saying in its current EFM epoch). We decide to leave the system in EFM for more than a week (presumably nothing bad will happen), but forget to call epochReset ... so after a week the smart contract is now in epoch N+2 while the Protocol State s still in N.

Essentially our current smart contract implementation makes the very limiting assumption that the Protocol State's Epoch counter can be at most one behind the smart contract. Otherwise, we have no means for recovery.
Lets keep in mind that we are implementing a disaster prevention mechanism here: its very rare so no one really has much experience with it, occurrences of disasters cannot be planned for, people are stressed and engineers with the deep background might be unavailable, the first EFM might happen in a year, when we have already forgotten some of the critical but subtle limitations.

Hence, I am strongly of the opinion that this process should be as fault-proof as possible:

multiple/many failed recovery attempts should be possible
the system should provide ample time for successful recovery (certainly more than a week)
it should be nearly impossible to for failed recovery attempts to break anything (no matter how broken the inputs are)

I think we are pretty close but have two main hurdles:

We should prepare for the scenario where the protocol state is in EFM epoch N but the smart contract believes the system is in epoch N+k for any integer k. That would be something to solve as part of this PR (or a subsequent smart contract PR).
ideally, the fallback state machine guarantees that a successful RecoverEpoch event always is a valid epoch configuration. The recovery parameters might be manually set, so the risk of human error should be mitigated. What is missing is checking:
- that the cluster QCs are valid QCs for each collector cluster
- DKG committee has sufficient intersection with the consensus committee to allow for live consensus
This is out of scope of this PR.

As usual, we should be weighing how much engineering time that actually would take to implement. Nevertheless, it deeply worries me that we have a bunch of subtle footguns in our implementation, in that we might irreparably break mainnet in case we violate one of the several subtle constrains (either by human error, or even worse by not acting for only a week).

Also cc @durkmurder @jordanschalm for visibility, comments and thoughts.

contracts/epochs/FlowEpoch.cdc

AlexHentschel · 2024-08-09T00:23:51Z

contracts/epochs/FlowEpoch.cdc

+            )
+        }
+
+        /// Stops epoch components. If the configuration is a valid configuration the staking auction,


I think this sentence is broken:

If the configuration is a valid configuration the staking auction,

AlexHentschel · 2024-08-09T00:33:13Z

contracts/epochs/FlowEpoch.cdc

+
+        access(self) fun emitEpochRecoverEvent(epochCounter: UInt64,
+            startView: UInt64,
+            stakingEndView: UInt64,


value not used. Related to Jordan's comment

AlexHentschel · 2024-08-09T02:08:55Z

lib/go/test/epoch_test_helpers.go

+	dkgPhase3FinalView      uint64
+	targetDuration          uint64
+	targetEndTime           uint64
+	clusterQCVoteDataLength int


I am not entirely sure, but based on the implementation that would be the number of cluster QCs (or more generally, the number of clusters). Would suggest to use a more specific field name:

Suggested change

clusterQCVoteDataLength int

numberClusterQCs int

lib/go/test/epoch_test_helpers.go

AlexHentschel · 2024-08-09T02:27:19Z

lib/go/test/epoch_test_helpers.go

+	}
+}
+
+func convertClusterQcsCdc(env templates.Environment, clusters []cadence.Value) []cadence.Value {


a minimal documentation would be great.

jordanschalm · 2024-08-09T13:56:55Z

Responding to Alex's comment here 👇

We should prepare for the scenario where the protocol state is in EFM epoch N but the smart contract believes the system is in epoch N+k for any integer k

You outlined a few scenarios in your comment, but each of them relies on the smart contract continuing to transition through speculative epochs without the Protocol State following suit.

In practice the smart contract transition process provides a strong guarantee that $k \in { 0,1 }$ before any recover_epoch attempt happens.

Smart Contract Transition Logic

The smart contract transitions to the next epoch when (1) it is executed in the context of a block with view >= currentEpoch.FinalView and (2) it is in the EpochCommitted phase.
The smart contract enters the EpochCommitted phase after the DKG and cluster QC vote generation are successfully completed.
So, in order to transition epochs, the smart contract requires Protocol participation in the corresponding DKG and cluster QC voting processes.
In EFM, Protocol participants don't participate in the DKG or cluster QC voting.

Outstanding Problems

Invoking `recoverNewEpoch` with inconsistent inputs

due to typos, stress and unfamiliarity with the recovery process the first two calls to recoverNewEpoch emit an event

This is a very good point and why we added the recoveryEpochCounter as an argument. But, if that parameter is not set properly, then we can end up in a situation where $k>1$.

Like with reset_epoch, we have an automated tool which reads the current Protocol State and writes the recover_epoch transaction arguments. The intention of this is to minimize the impact of incorrect manual inputs, but of course it is always possible.

Let's consider the alternative. If we want to be able to recover from cases where $k>1$ we need to implement support in the recovery process for deleting or overwriting potentially multiple historical epoch entries from the smart contract state before injecting the recovery epoch. This increases implementation complexity and introduces additional surface area for human error to make mistakes ("Oops! I deleted the last 10 epochs"). I'm not convinced this is better.

Extra input validation

Ideally, the fallback state machine guarantees that a successful RecoverEpoch event always is a valid epoch configuration. [...]

Agree with this, just adding that any configuration validation we add to the FallbackStateMachine should also be added in the utility generating the recover_epoch transaction arguments (if possible). That way problems are caught earlier. The cluster QC validation is already done in GenerateClusterRootQC, but we can add the DKG committee size sanity check.

Co-authored-by: Alexander Hentschel <alex.hentschel@flowfoundation.org> Co-authored-by: Jordan Schalm <jordan@dapperlabs.com>

Co-authored-by: Jordan Schalm <jordan@dapperlabs.com> Co-authored-by: Alexander Hentschel <alex.hentschel@flowfoundation.org>

- replace numViewsInStakingAuction with stakingEndView - startView - don't accept numViewsInDKGPhase as a parameter read it from configurable epoch metadata

jordanschalm

Really nice expansion of the test coverage -- thank you.

Summary of feedback:

If I'm understanding correctly, we don't have a test case that executes recovery during the staking phase -- I think we should add this before merging
I added some questions about the last test case (we're doing two recoveries back-to-back and I'm not sure why)

jordanschalm · 2024-09-09T22:33:45Z

contracts/epochs/FlowEpoch.cdc

+            let numViewsInStakingAuction = stakingEndView - startView
+            let numViewsInDKGPhase = FlowEpoch.configurableMetadata.numViewsInDKGPhase
            let dkgPhase1FinalView = startView + numViewsInStakingAuction + numViewsInDKGPhase - 1
            let dkgPhase2FinalView = startView + numViewsInStakingAuction + (2 * numViewsInDKGPhase) - 1
            let dkgPhase3FinalView = startView + numViewsInStakingAuction + (3 * numViewsInDKGPhase) - 1


Suggested change

let numViewsInStakingAuction = stakingEndView - startView

let numViewsInDKGPhase = FlowEpoch.configurableMetadata.numViewsInDKGPhase

let dkgPhase1FinalView = startView + numViewsInStakingAuction + numViewsInDKGPhase - 1

let dkgPhase2FinalView = startView + numViewsInStakingAuction + (2 * numViewsInDKGPhase) - 1

let dkgPhase3FinalView = startView + numViewsInStakingAuction + (3 * numViewsInDKGPhase) - 1

let numViewsInDKGPhase = FlowEpoch.configurableMetadata.numViewsInDKGPhase

let dkgPhase1FinalView = stakingEndView + numViewsInDKGPhase

let dkgPhase2FinalView = dkgPhase1FinalView + numViewsInDKGPhase

let dkgPhase3FinalView = dkgPhase2FinalView + numViewsInDKGPhase

We can simplify the calculation here by adding to the staking end view directly.

jordanschalm · 2024-09-09T22:35:19Z

contracts/epochs/FlowEpoch.cdc

+let dkgPhase2FinalView = dkgPhase1FinalView + self.configurableMetadata.numViewsInDKGPhase
+let dkgPhase3FinalView = dkgPhase2FinalView + self.configurableMetadata.numViewsInDKGPhase


Suggested change

let dkgPhase2FinalView = dkgPhase1FinalView + self.configurableMetadata.numViewsInDKGPhase

let dkgPhase3FinalView = dkgPhase2FinalView + self.configurableMetadata.numViewsInDKGPhase

let dkgPhase2FinalView = dkgPhase1FinalView + self.configurableMetadata.numViewsInDKGPhase

let dkgPhase3FinalView = dkgPhase2FinalView + self.configurableMetadata.numViewsInDKGPhase

jordanschalm · 2024-09-09T22:38:26Z

lib/go/test/epoch_test_helpers.go

+	numStakingViews      uint64 // num views for staking auction
+	numDKGViews          uint64 // num views for DKG phase
+	numClusters          uint64 // num collector clusters
+	numEpochAccounts     int    // num collector clusters


Suggested change

numEpochAccounts int // num collector clusters

numEpochAccounts int // num accounts to setup for staking

Comment is duplicated from line above.

jordanschalm · 2024-09-09T22:49:33Z

lib/go/test/flow_epoch_test.go

+			)
+			args := getRecoveryTxArgs(env, ids, startView, stakingEndView, endView, targetDuration, targetEndTime, epochCounter)
+			// avoid using the recover epoch transaction template which has sanity checks that would prevent submitting the invalid epoch counter
+			code := `


We could stick this in a .cdc file and use go:embed instead of putting the script inline. For example this variable embeds this file in flow-go.

jordanschalm · 2024-09-09T22:54:13Z

lib/go/test/flow_epoch_test.go

+		cadence.NewArray(collectorClusters), // collectorClusters
+		cadence.NewArray(clusterQcVoteData), // clusterQCVoteData


Suggested change

cadence.NewArray(collectorClusters), // collectorClusters

cadence.NewArray(clusterQcVoteData), // clusterQCVoteData

cadence.NewArray(collectorClusters),

cadence.NewArray(clusterQcVoteData),

Suggesting to remove the comments because they're identical to the variable names

jordanschalm · 2024-09-09T23:45:33Z

lib/go/test/flow_epoch_test.go

+	// Perform epoch recovery with a new epoch and epoch recover overwriting the current epoch.
+	t.Run("Can recover the epoch and have everything return to normal", func(t *testing.T) {


Suggested change

// Perform epoch recovery with a new epoch and epoch recover overwriting the current epoch.

t.Run("Can recover the epoch and have everything return to normal", func(t *testing.T) {

// Perform epoch recovery by transitioning into a new epoch (counter incremented by one)

t.Run("Can recover the epoch with a new epoch", func(t *testing.T) {

jordanschalm · 2024-09-09T23:48:53Z

lib/go/test/flow_epoch_test.go

+				endView        uint64 = 160
+				targetDuration uint64 = numEpochViews
+				// invalid epoch counter when recovering the current epoch the counter should equal the current epoch counter
+				epochCounter  uint64 = startEpochCounter + 100


Suggested change

epochCounter uint64 = startEpochCounter + 100

epochCounter uint64 = startEpochCounter + 1

Suggesting to test the more likely problematic input (the the one that does work with the other version of recovery)

jordanschalm · 2024-09-09T23:49:29Z

lib/go/test/flow_epoch_test.go

+				endView        uint64 = 160
+				targetDuration uint64 = numEpochViews
+				// invalid epoch counter when recovering the current epoch the counter should equal the current epoch counter
+				epochCounter  uint64 = startEpochCounter + 100


Suggested change

epochCounter uint64 = startEpochCounter + 100

epochCounter uint64 = startEpochCounter

Likewise, let's test the case that's most likely to cause problems.

jordanschalm · 2024-09-10T00:07:36Z

lib/go/test/flow_epoch_test.go

+		})
+	})
+
+	t.Run("Recover epoch transaction panics when recovery epoch counter is less than currentCounter and unsafeAllowOverwrite is false", func(t *testing.T) {


Regarding the set of test cases about when we expect the transaction to panic, I feel like they can be structured to more clearly communicate our expectations.

The expectations are simple:

if you are recovering by overwriting the current epoch, you must pass in exactly recoveryEpochCounter = currentEpochCounter and unsafeAllowOverwrite = true

if you are recovering by transitioning into a new epoch, you must pass in exactly recoveryEpochCounter = currentEpochCounter+1 and unsafeAllowOverwrite = true

every other possible combination will panic

These expectations are communicated well by the descriptions of the first two test cases:

(1)

t.Run("Panics when recovering a new epoch and recover epoch counter is not equal to the current epoch counter + 1", func(t *testing.T) {

(2)

t.Run("Panics when recovering the current epoch and recover epoch counter is not equal to the current epoch counter", func(t *testing.T) {

The subsequent two test cases add more coverage by testing more input values (👍), but their descriptions describe subsets of the expectations described by (1) and (2). For me at least this made it harder to understand as a whole. I think the tests would be clearer if we added the invalid inputs we want to test within those two first test cases, where we fully define the range of valid input values that we are testing.

Suppose $C$ is the currentEpochCounter value. Then, for example, (1) could test recoveryEpochCounter inputs C, C-1, C+2. (2) could test recoveryEpochCounter inputs C-1, C+1.

For each of (1) and (2), we can extract the test functionality into a closure and then do something like:

invalidEpochCounters := []uint64{C, C-1, C+2} // C+1 is the only valid input for _, epochCounter := range invalidEpochCounters { // run the test case }

jordanschalm · 2024-09-10T00:08:44Z

lib/go/test/flow_epoch_test.go

+		}
+		runWithDefaultContracts(t, epochConfig, func(b emulator.Emulator, env templates.Environment, ids []string, idTableAddress flow.Address, IDTableSigner sdkcrypto.Signer, adapter *adapters.SDKAdapter) {
+			// Advance to epoch Setup and make sure that the epoch cannot be ended
+			advanceView(t, b, env, idTableAddress, IDTableSigner, 1, "EPOCHSETUP", false)


Unless I'm missing one, all of the test cases transition into the EpochSetup phase prior to executing the recovery. I would like to have at least one test case where we execute the recovery while in the Staking phase.

was called "publish participant" but was actually publishing admin

joshuahannan · 2024-09-23T14:57:24Z

@jordanschalm It looks like there are a lot of unresolved comments from you on this PR. Since Khalil is leaving on Friday, I'm just wondering who is going to do that and when it will get done. I also don't really know if I want to review all of this again until the comments have all been actioned.

joshuahannan

I'm also a little confused about the branches here. This is targeting the feature/efm-recovery branch, but it looks like this contains all the upgrades for efm recovery. Should this be targeting master instead?

joshuahannan · 2024-09-24T18:27:49Z

contracts/epochs/FlowClusterQC.cdc

+        access(all) let voterIDs: [String]
+
+        init(aggregatedSignature: String, voterIDs: [String]) {
+            self.aggregatedSignature = aggregatedSignature


Are there any pre-conditions we can do here to verify anything?

kc1116 · 2024-10-01T11:18:32Z

@jordanschalm It looks like there are a lot of unresolved comments from you on this PR. Since Khalil is leaving on Friday, I'm just wondering who is going to do that and when it will get done. I also don't really know if I want to review all of this again until the comments have all been actioned.

@joshuahannan I will address the feedback at the end of this week and leave a handover comment for @jordanschalm . This took longer because there is an extremely long feedback loop for cadence PR's. We expect an audit type of review from you, "will this break anything, can this be exploited" that type of review can be done while feedback is addressed. It's nice to have feedback from everyone so it can all be addressed at once, that makes the feedback loop faster.

cc: @AlexHentschel

kc1116 · 2024-10-01T11:20:35Z

I'm also a little confused about the branches here. This is targeting the feature/efm-recovery branch, but it looks like this contains all the upgrades for efm recovery. Should this be targeting master instead?

No. feature/efm-recovery is the branch we use both here in this repo and on flow-go to contain all efm feature related changes. When this PR is approved it will be merged to feature/efm-recovery. We also may make additional PR's and changes that will go into feature/efm-recovery . When all issues are covered feature/efm-recovery will be merged to master, same feature branch strategy we use throughout flow.

joshuahannan · 2024-10-01T14:26:01Z

Thanks for the clarifications! The business logic feedback changes could affect the security of it though, so my security audit review of it won't really matter if there are business logic changes to it after I review, so I'll probably still wait until all the changes have been actioned from Jordan's comments. Thank you for staying on top of it and I apologize for the slow review times during the lead up to Crescendo

since it is specifically enforcing the named Invariant (1), and checking both empty and non-empty submissions.

…ansaction

jordanschalm · 2024-10-09T17:10:00Z

Merged changes from #441 into this PR.

… correct EpochRecover service event

durkmurder · 2024-10-10T13:55:46Z

@jordanschalm I have made some changes to submit group key separately so the API was uniform with data structure of EpochCommit and EpochRecover events. Can you update tests when you will be working on the PR? 8dae842 (#440)

Co-authored-by: Tarak Ben Youssef <50252200+tarakby@users.noreply.github.com> Co-authored-by: Joshua Hannan <hannanjoshua19@gmail.com>

…ontracts into jord/6213-dkg-mapping

…ansaction

…nsaction (also regenerate) Conflicts: contracts/epochs/FlowDKG.cdc contracts/epochs/FlowEpoch.cdc lib/go/contracts/internal/assets/assets.go lib/go/test/epoch_test_helpers.go lib/go/test/test.go

efm recovery transaction

f645974

kc1116 requested review from jordanschalm and AlexHentschel July 30, 2024 17:38

kc1116 requested a review from joshuahannan as a code owner July 30, 2024 17:38

jordanschalm reviewed Jul 31, 2024

View reviewed changes

kc1116 added 3 commits August 1, 2024 11:49

use provided recovery epoch counter as the source of truth

3b3a9ec

add proper line indentation

52ff178

update assets

9f8ceed

jordanschalm reviewed Aug 1, 2024

View reviewed changes

jordanschalm added 5 commits August 7, 2024 12:00

define ResultSubmission struct

dd90e0e

annotate required changes

7638aeb

setup test framework

4e0bc0c

add preliminary new fields to EpochCommit

22ba195

add submission tracker sketch

e37f616

AlexHentschel reviewed Aug 9, 2024

View reviewed changes

kc1116 and others added 10 commits August 14, 2024 10:49

Apply suggestions from code review

4a0b5dd

Co-authored-by: Alexander Hentschel <alex.hentschel@flowfoundation.org> Co-authored-by: Jordan Schalm <jordan@dapperlabs.com>

Apply suggestions from code review

e493a46

Co-authored-by: Jordan Schalm <jordan@dapperlabs.com> Co-authored-by: Alexander Hentschel <alex.hentschel@flowfoundation.org>

adjust recover transaction logic

1b411fb

move recoverEpochPreChecks call to emitEpochRecoverEvent

6b9a7fa

fix broken sentence

0a508af

add comment describing unsafeAllowOverwrite

0e0ad05

use previous syntax for randomSource generation

fd6ddfa

replace usage of numViewsInStakingAuction and numViewsInDKGPhase

e2323ee

- replace numViewsInStakingAuction with stakingEndView - startView - don't accept numViewsInDKGPhase as a parameter read it from configurable epoch metadata

add godoc for convertClusterQcsCdc

435a55d

add additional test cases

48a491c

kc1116 requested review from jordanschalm and AlexHentschel August 30, 2024 15:19

jordanschalm reviewed Sep 10, 2024

View reviewed changes

ResultSubmission tests

26039c5

jordanschalm added 6 commits September 17, 2024 09:36

fix epoch tests

135cad6

rename publish admin script

80f1f91

was called "publish participant" but was actually publishing admin

store as much as possible in EpochMetadata

1fd997b

build static files

9b623b3

make more functions view

3e32ac5

fix test

af7e508

joshuahannan reviewed Sep 24, 2024

View reviewed changes

add dkgIdMapping

59b7108

jordanschalm added 2 commits October 3, 2024 10:10

rename isValidNilSubmission

07090b8

since it is specifically enforcing the named Invariant (1), and checking both empty and non-empty submissions.

Merge branch 'jord/6213-dkg-mapping' into khalil/5639-efm-recovery-tr…

496f9f8

…ansaction

jordanschalm changed the base branch from feature/efm-recovery to jord/6213-dkg-mapping October 9, 2024 17:09

Updated epoch recovery tx to accept dkg group key as well as emitting…

8dae842

… correct EpochRecover service event

jordanschalm and others added 10 commits October 10, 2024 11:33

update error messages

ed047b9

go generate

3aee4ca

Apply suggestions from code review

50e054a

Co-authored-by: Tarak Ben Youssef <50252200+tarakby@users.noreply.github.com> Co-authored-by: Joshua Hannan <hannanjoshua19@gmail.com>

2nd pass over error messages

526bdbe

Merge branch 'jord/6213-dkg-mapping' of github.com:onflow/flow-core-c…

ecbf7bb

…ontracts into jord/6213-dkg-mapping

fix empty whiteboard message test

464db6d

improve post whiteboard message test

4e09f85

rm dupe test

6931e78

address remaining todos in tests

f292cc7

Merge branch 'jord/6213-dkg-mapping' into khalil/5639-efm-recovery-tr…

afacbe7

…ansaction

Base automatically changed from jord/6213-dkg-mapping to feature/efm-recovery October 15, 2024 18:48

Merge branch 'feature/efm-recovery' into khalil/5639-efm-recovery-tra…

9a7af20

…nsaction (also regenerate) Conflicts: contracts/epochs/FlowDKG.cdc contracts/epochs/FlowEpoch.cdc lib/go/contracts/internal/assets/assets.go lib/go/test/epoch_test_helpers.go lib/go/test/test.go

	counter: FlowEpoch.proposedEpochCounter(),
	counter: recoveryEpochCounter,

		numViewsInStakingAuction: UInt64,
		numViewsInDKGPhase: UInt64,

		let dkgPhase2FinalView = dkgPhase1FinalView + self.configurableMetadata.numViewsInDKGPhase
		let dkgPhase3FinalView = dkgPhase2FinalView + self.configurableMetadata.numViewsInDKGPhase

	numEpochAccounts int // num collector clusters
	numEpochAccounts int // num accounts to setup for staking

		cadence.NewArray(collectorClusters), // collectorClusters
		cadence.NewArray(clusterQcVoteData), // clusterQCVoteData

		// Perform epoch recovery with a new epoch and epoch recover overwriting the current epoch.
		t.Run("Can recover the epoch and have everything return to normal", func(t *testing.T) {

	epochCounter uint64 = startEpochCounter + 100
	epochCounter uint64 = startEpochCounter + 1

	epochCounter uint64 = startEpochCounter + 100
	epochCounter uint64 = startEpochCounter

EFM Recovery Service Event and Transaction #440

Are you sure you want to change the base?

EFM Recovery Service Event and Transaction #440

Conversation

kc1116 commented Jul 30, 2024

jordanschalm left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jordanschalm left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlexHentschel left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jordanschalm commented Aug 9, 2024 • edited Loading

Smart Contract Transition Logic

Outstanding Problems

Invoking recoverNewEpoch with inconsistent inputs

Extra input validation

jordanschalm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joshuahannan commented Sep 23, 2024

joshuahannan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kc1116 commented Oct 1, 2024 • edited Loading

kc1116 commented Oct 1, 2024 • edited Loading

joshuahannan commented Oct 1, 2024

jordanschalm commented Oct 9, 2024

durkmurder commented Oct 10, 2024

jordanschalm left a comment •

edited

Loading

jordanschalm left a comment •

edited

Loading

AlexHentschel left a comment •

edited

Loading

jordanschalm commented Aug 9, 2024 •

edited

Loading

Invoking `recoverNewEpoch` with inconsistent inputs

kc1116 commented Oct 1, 2024 •

edited

Loading

kc1116 commented Oct 1, 2024 •

edited

Loading