Khalil/6959 Recover Epoch transaction args script #5576

kc1116 · 2024-03-22T15:01:38Z

This PR adds a new cmd recover-epoch-tx-args to the epochs CLI command that generates the transaction args for the RecoverEpoch tx. This command reuses logic from the sporking utility tool to generate the following transaction args.

startView: start view of the recovery epoch
stakingStartView: start view of the staking phase of the recovery epoch
endView: end view of the recovery epoch
dkgPubKeys: list of dkg key shares for each consensus node from the dkg of the last successful epoch
nodeIds: initial identity list of the last successful epoch
clusters: node clusters
clusterQcs: cluster qcs generated from the clusters

Note: Significant changes are in this file cmd/util/cmd/epochs/cmd/recover.go, other changes are due to relocating reusable funcs

- update signatures - update usages

jordanschalm

First round of comments

cmd/bootstrap/cmd/keys.go

jordanschalm · 2024-03-22T18:45:46Z

cmd/util/cmd/epochs/cmd/recover.go

+	_, clusters, err := common.ConstructClusterAssignment(log, partnerNodes, internalNodes, flagCollectionClusters)
+	if err != nil {
+		log.Fatal().Err(err).Msg("unable to generate cluster assignment")
+	}


This is generating the cluster assignment based on the partner and internal node config files from the spork bootstrapping process. These files should accurately represent the identity table state at the beginning of the spork, but may not be accurate at the time a RecoveryEpoch needs to be generated.

In order to guarantee we generate valid clusters (and don't re-insert an old node which has unstaked or omit a new node which has joined after the spork) we will need to use the identity table from the snapshot we retrieve from the network as the data source for constructing cluster assignments.

We will still need to retrieve the internal nodes info from disk (step ReadInternalNodeInfos) in order to get the private keys necessary to produce the QCs. However I don't think it is useful to retrieve the partner node info from disk, because this will be completely replaced by the snapshot. I think we should:

Exit with an error if we observe a node in InternalNodeInfos that doesn't exist in the snapshot

Populate internalNodes based on what we read from disk

Populate partnerNodes based on snapshot.Identities - internalNodes

Exit with an error if we observe a node in InternalNodeInfos that doesn't exist in the snapshot

Populate internalNodes based on what we read from disk

Populate partnerNodes based on snapshot.Identities - internalNodes

Very much possible that we are suggesting the same thing, but I am not sure. I generally agree with Jordan's comment, but I would frame it slightly differently:

The snapshot is the core object for this logic here:

It specifies the set of nodes $S$ which nodes are currently allowed to participate in the network. Note that ejected nodes are not part of $S$. So you cannot use InitialIdentities ⚠️ as you do here! Instead, you should use the Identities method from the snapshot to retrieve the set of participating nodes as of the snapshot's block.

If you use a selector that drops ejected nodes and nodes with zero weight, you receive the set $S$ of all participating nodes as of this block.

The nodes in the recovery epoch should be a subset of $S$

So we select the subset of all collectors $s_c$ from $S$. This are the collectors that can participate in the recovery epoch and we can partition them across the different clusters (method ConstructClusterAssignment)

For a cluster to work, we need a root QC, for the cluster to start its local HotStuff. Hence, each cluster must contain more than 2/3 of collectors (by weight), whose private staking key we know.

Therefore, we split $s_c$ into two subsets

$s^{(i)}_c$ are the nodes, where we know the private staking key:

$s^{(p)}_c$ are the nodes, where we do not know the private staking key

We can proceed as follows:

We load the old bootstrapping data for our internal nodes info from disk (step ReadInternalNodeInfos). Then we iterate over all nodes $n \in s_c$. If collector node $n$ is listed in the old bootstrapping data, we have a private staking key for it and it goes into $s^{(i)}_c$. Otherwise, the collector is added to $s^{(p)}_c$ (no private key).

we can then call (similar to line 124) common.ConstructClusterAssignment( log, $s^{(p)}_c$, $s^{(i)}_c$, flagCollectionClusters )

To the best of my understanding, this is the most general algorithm and works for the broadest possible range of situations. In particular:

If is possible for us to add or remove internal collectors after the original spork.

In comparison, removing one of our collectors would yield an error in Jordan's suggestion:

Exit with an error if we observe a node in InternalNodeInfos that doesn't exist in the snapshot

@AlexHentschel thanks for the in depth algorithm clarity

2b9c152

cmd/util/cmd/epochs/cmd/recover.go

jordanschalm · 2024-03-22T19:12:37Z

cmd/util/cmd/epochs/cmd/recover.go

+	generateRecoverEpochTxArgsCmd.Flags().Uint64Var(&flagStartView, "start-view", 0, "start view of the recovery epoch")
+	generateRecoverEpochTxArgsCmd.Flags().Uint64Var(&flagStakingEndView, "staking-end-view", 0, "end view of the staking phase of the recovery epoch")
+	generateRecoverEpochTxArgsCmd.Flags().Uint64Var(&flagEndView, "end-view", 0, "end view of the recovery epoch")


I think it would be easier from an ops perspective if we used the same representation for these values as we do in bootstrapping, in particular using the lengths as the input rather than the end view. I suggest we use the same flag names as the rootblock comand:

--epoch-length

--epoch-staking-phase-length

Then we can compute endView and stakingEndView as we do in the bootstrapping.

jordanschalm · 2024-03-22T19:32:45Z

cmd/util/cmd/epochs/cmd/recover.go

+		" in the JSON file: Role, Address, NodeID, NetworkPubKey, StakingPubKey)")
+	generateRecoverEpochTxArgsCmd.Flags().StringVar(&flagPartnerWeights, "partner-weights", "", "path to a JSON file containing "+
+		"a map from partner node's NodeID to their stake")
+	generateRecoverEpochTxArgsCmd.Flags().Uint64Var(&flagStartView, "start-view", 0, "start view of the recovery epoch")


Start view sensitivity

The startView parameter is very sensitive, because in order for a EpochRecover service event to be valid, it must be exactly one view after the last EpochExtension. Otherwise we would have conflicting/missing definitions for who is the leader for a view.

Suggestions:

Retrieve the "extended final view" from the snapshot (FinalView from the latest extension of the current epoch)

Validate that --start-view is equal to extendedFinalView+1

Alternatively we can revisit the strictness of the requirement, but would need to carefully think through the safety implications.

Race condition

In addition, there is a race condition here to be aware of. The design specifies that we will add a new extension periodically. If, in between retrieving the snapshot while executing this command and submitting the corresponding admin transaction, we add a new epoch extension, the RecoverEpoch event would be discarded as invalid.

This isn't the end of the world because we can just do it again, but it might be worth having the tooling help avoid this situation, for example by reporting the number of view remaining before another extension would be added. To be honest I'm kind of conflicted: on the one hand it would be useful but on the other hand, this is already an infrequent, heavily involved process where I'd expect humans to be double-checking these things.

This comment is related to the comment above it and the same commit applies 884fdbf , as far as the race condition it's unlikely to happen and we would only be able to log a useful message if we were able to get the current view from the network somehow then we can could log how many views left until the start-view .

jordanschalm · 2024-03-22T19:37:10Z

cmd/util/cmd/epochs/cmd/recover.go

+
+	dkgPubKeys := make([]cadence.Value, 0)
+	nodeIds := make([]cadence.Value, 0)
+	ids.Map(func(skeleton flow.IdentitySkeleton) flow.IdentitySkeleton {


I'd suggest just using a for loop here. Map is used to translate a list through a function into a new list, but here we're discarding the output list, which defeats the purpose.

jordanschalm · 2024-03-22T19:37:46Z

cmd/util/cmd/epochs/cmd/recover.go

+		log.Fatal().Err(err).Msg("failed to get random source cadence string")
+	}
+
+	dkgPubKeys := make([]cadence.Value, 0)


The first element here will need to be the group public key (see https://github.com/onflow/flow-go/blob/master/model/convert/service_event.go#L360-L362).

cmd/util/cmd/common/clusters.go

AlexHentschel · 2024-03-26T19:28:48Z

cmd/util/cmd/common/clusters.go

 		identifierLists[i%len(identifierLists)] = append(identifierLists[i%len(identifierLists)], node.NodeID)
-		constraint[i%nClusters] -= 2
+		constraint[i%numCollectionClusters] -= 2


per construction the length of identifierLists is numCollectionClusters:

flow-go/cmd/util/cmd/common/clusters.go

Line 52 in 7d09f47

identifierLists := make([]flow.IdentifierList, numCollectionClusters)

above, we have used this relation already:

flow-go/cmd/util/cmd/common/clusters.go

Lines 56 to 60 in 7d09f47

// first, round-robin internal nodes into each cluster

for i, node := range internals {

identifierLists[i%numCollectionClusters] = append(identifierLists[i%numCollectionClusters], node.NodeID)

constraint[i%numCollectionClusters] += 1

}

algorithmically, we are doing exactly the same thing here: distribute collector nodes amongst the clusters. So the code should have the same structure. Otherwise, people reading this will get confused.

Suggested change

identifierLists[i%len(identifierLists)] = append(identifierLists[i%len(identifierLists)], node.NodeID)

constraint[i%nClusters] -= 2

constraint[i%numCollectionClusters] -= 2

clusterIndex := i % numCollectionClusters

identifierLists[clusterIndex] = append(identifierLists[clusterIndex], node.NodeID)

constraint[clusterIndex] -= 2

cmd/util/cmd/common/clusters.go

AlexHentschel · 2024-03-26T19:34:54Z

cmd/util/cmd/common/node_info.go

+		networkPubKey := cmd.ValidateNetworkPubKey(partner.NetworkPubKey)
+		stakingPubKey := cmd.ValidateStakingPubKey(partner.StakingPubKey)
+		weight, valid := cmd.ValidateWeight(weights[partner.NodeID])
+		if !valid {
+			log.Error().Msgf("weights: %v", weights)
+			log.Fatal().Msgf("partner node id %x has no weight", nodeID)
+		}


We are mixing panics and boolean return for indicating invalid node information. Ideally, it would all be the same: sentinel errors.

If it is a huge refactoring, feel free to skip. But if we can with moderate amount of work, it would be nice to have consistent code here.

cmd/util/cmd/common/node_info.go

AlexHentschel · 2024-03-26T19:43:49Z

cmd/util/cmd/common/snapshot.go

+// GetSnapshotAtEpochAndPhase will get the latest finalized protocol snapshot and check the current epoch and epoch phase.
+// If we are past the target epoch and epoch phase we exit the retry mechanism immediately.
+// If not check the snapshot at the specified interval until we reach the target epoch and phase.
+func GetSnapshotAtEpochAndPhase(ctx context.Context, log zerolog.Logger, startupEpoch uint64, startupEpochPhase flow.EpochPhase, retryInterval time.Duration, getSnapshot GetProtocolSnapshot) (protocol.Snapshot, error) {


It would be helpful to document the meaning of some of the inputs:
startupEpoch and startupEpochPhase and getSnapshot

Co-authored-by: Jordan Schalm <jordan@dapperlabs.com>

codecov-commenter · 2024-04-01T13:14:08Z

Codecov Report

Attention: Patch coverage is 0% with 82 lines in your changes are missing coverage. Please review.

Project coverage is 56.55%. Comparing base (00e4924) to head (2eb6b11).
Report is 346 commits behind head on master.

Files	Patch %	Lines
utils/unittest/service_events_fixtures.go	0.00%	82 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5576      +/-   ##
==========================================
+ Coverage   55.65%   56.55%   +0.89%     
==========================================
  Files        1036      629     -407     
  Lines      101131    60238   -40893     
==========================================
- Hits        56287    34065   -22222     
+ Misses      40518    23675   -16843     
+ Partials     4326     2498    -1828

Flag	Coverage Δ
unittests	`56.55% <0.00%> (+0.89%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

- add filter for valid protocol participant

- emit fatal log if identity is present in internal node info from disc but missing from snap shot identities list

Co-authored-by: Jordan Schalm <jordan@dapperlabs.com>

- infer start view, staking phase end view, and epoch end view from curr epoch final view

Co-authored-by: Alexander Hentschel <alex.hentschel@flowfoundation.org>

…b.com:onflow/flow-go into khalil/6959-efm-recvery-epoch-data-generation

jordanschalm

A few final small suggestions, but otherwise looks good!

integration/tests/epochs/base_suite.go

integration/tests/epochs/recover_epoch/suite.go

jordanschalm · 2024-04-09T21:24:25Z

integration/tests/epochs/recover_epoch/suite.go

+	s.DKGPhaseLen = 50
+	s.EpochLen = 250


Just a note: When we get to making this integration test case, we'll likely be able to set these more aggressively (shorter). We can discuss later

AlexHentschel

great work. largely cosmetic suggestions.

Also wanted to say thanks for cleanup and documentation improvement also to the adjacent code. Continuous efforts on a smaller scale really add up to a big difference over time for clarity and low tech debt in our code base. Thanks, really appreciate your contributions. 💚

cmd/bootstrap/cmd/final_list.go

cmd/util/cmd/common/clusters.go

cmd/util/cmd/common/node_info.go

cmd/util/cmd/epochs/cmd/recover.go

AlexHentschel · 2024-04-11T04:41:56Z

cmd/util/cmd/epochs/cmd/recover.go

+	generateRecoverEpochTxArgsCmd.Flags().Uint64Var(&flagNumViewsInEpoch, "epoch-length", 4000, "length of each epoch measured in views")
+	generateRecoverEpochTxArgsCmd.Flags().Uint64Var(&flagNumViewsInStakingAuction, "epoch-staking-phase-length", 100, "length of the epoch staking phase measured in views")
+	generateRecoverEpochTxArgsCmd.Flags().Uint64Var(&flagEpochCounter, "epoch-counter", 0, "the epoch counter used to generate the root cluster block")


Based on my understanding, I would think these default values are drastically to short. would also like to get @jordanschalm's thoughts ...

My thoughts:

Default values should provide a sensible setting for mainnet.

We are talking about an epoch here, with the limitation that it is mainly for recovery, but aren't we planning to run a regular DKG in this recovery epoch? I think a day would be reasonable

For reference, we decided on 3000 views for each DKG phase (3 in total) for the mainnet consensus timing, plus EpochCommitSafetyThreshold (which should also be minimally 1000 blocks), plus staking and Epoch commit phase that also belong to an epoch on the happy path.
With the new consensus timing [👉 reference], one day would correspond to to 108,000 views for mainnet

In practice the CLI will require all of these to be explicitly specified in flags when run, because of the MarkFlagRequired blocks below. I think it is preferable to force operators to provide specific values and fail rather than using a default value (current behaviour).

To prevent confusion, we could set the "default" value parameter when defining the flag to 0.

Co-authored-by: Jordan Schalm <jordan@dapperlabs.com>

Co-authored-by: Alexander Hentschel <alex.hentschel@flowfoundation.org> Co-authored-by: Jordan Schalm <jordan@dapperlabs.com>

Co-authored-by: Alexander Hentschel <alex.hentschel@flowfoundation.org>

…b.com:onflow/flow-go into khalil/6959-efm-recvery-epoch-data-generation

…uster assignment

…b.com:onflow/flow-go into khalil/6959-efm-recvery-epoch-data-generation

jordanschalm · 2024-05-02T14:47:38Z

cmd/util/cmd/epochs/cmd/recover.go

+		log.Fatal().Err(cdcErr).Msg("failed to get dkg group key cadence string")
+	}
+	dkgPubKeys = append(dkgPubKeys, dkgGroupKeyCdc)
+	for _, id := range currentEpochIdentities {


Sorry to comment on a closed PR. I noticed some sanity checks we should add here while reviewing onflow/flow-core-contracts#420.

We should check that currentEpochDKG.Size() == len(currentEpochIdentities.Filter(filter.HasRole(flow.RoleConsensus)))

We already check that there is a DKG key for every consensus node, but not that there is a consensus node for every DKG key.

Added a reminder to the design doc for this.

kc1116 added 2 commits March 22, 2024 00:38

move reusable util funcs to common package

311115f

- update signatures - update usages

generate tx args

7d09f47

kc1116 requested review from durkmurder, jordanschalm and AlexHentschel March 22, 2024 15:01

jordanschalm reviewed Mar 22, 2024

View reviewed changes

AlexHentschel reviewed Mar 26, 2024

View reviewed changes

kc1116 and others added 2 commits March 28, 2024 21:16

add happy path tests

ffbd640

Update cmd/bootstrap/cmd/keys.go

45fcead

Co-authored-by: Jordan Schalm <jordan@dapperlabs.com>

kc1116 and others added 20 commits April 1, 2024 10:14

Update keys.go

18f7aec

move NotEjected filter to filter package

c82ffbc

- add filter for valid protocol participant

get subsets of internal collectors and partner collectors from snapshot

321da97

- emit fatal log if identity is present in internal node info from disc but missing from snap shot identities list

Update cmd/util/cmd/epochs/cmd/recover.go

9a4542d

Co-authored-by: Jordan Schalm <jordan@dapperlabs.com>

Update cmd/util/cmd/epochs/cmd/recover.go

91d6599

Co-authored-by: Jordan Schalm <jordan@dapperlabs.com>

Update cmd/util/cmd/epochs/cmd/recover.go

051629d

Co-authored-by: Jordan Schalm <jordan@dapperlabs.com>

Update cmd/util/cmd/epochs/cmd/recover.go

1399e18

Co-authored-by: Jordan Schalm <jordan@dapperlabs.com>

Update recover.go

2b9c152

epoch counter should be an input

3c8ee33

smart contract should generate random source with revertibleRandom

d9db0eb

Update cmd/util/cmd/epochs/cmd/recover.go

f48b5a7

Co-authored-by: Jordan Schalm <jordan@dapperlabs.com>

Update cmd/util/cmd/epochs/cmd/recover.go

0cb7cae

Co-authored-by: Jordan Schalm <jordan@dapperlabs.com>

add epoch-length and epoch-staking-phase-length

884fdbf

- infer start view, staking phase end view, and epoch end view from curr epoch final view

use for range loop

6aacb01

dkg group key should be the first key in the array

82c5803

Update cmd/util/cmd/common/clusters.go

1d67790

Co-authored-by: Alexander Hentschel <alex.hentschel@flowfoundation.org>

Merge branch 'khalil/6959-efm-recvery-epoch-data-generation' of githu…

1c86a58

…b.com:onflow/flow-go into khalil/6959-efm-recvery-epoch-data-generation

add godoc for ConstructRootQCsForClusters

8d6611f

add godoc for *PartnerInfo util funcs

263974e

document GetSnapshotAtEpochAndPhase arguments

c6a1989

kc1116 added 2 commits April 9, 2024 00:31

Update recover_epoch_efm_test.go

e07c96a

Merge branch 'master' into khalil/6959-efm-recvery-epoch-data-generation

6bc838f

kc1116 requested a review from jordanschalm April 9, 2024 15:27

jordanschalm approved these changes Apr 9, 2024

View reviewed changes

AlexHentschel approved these changes Apr 11, 2024

View reviewed changes

kc1116 mentioned this pull request Apr 12, 2024

EpochRecover service event and transaction onflow/flow-core-contracts#420

Merged

kc1116 and others added 15 commits April 15, 2024 18:13

Update integration/tests/epochs/base_suite.go

c27754f

Co-authored-by: Jordan Schalm <jordan@dapperlabs.com>

use 0 as a default value force the user to provide values

47024da

Apply suggestions from code review

603d1da

Co-authored-by: Alexander Hentschel <alex.hentschel@flowfoundation.org> Co-authored-by: Jordan Schalm <jordan@dapperlabs.com>

Update cmd/util/cmd/common/node_info.go

b40197e

Co-authored-by: Alexander Hentschel <alex.hentschel@flowfoundation.org>

Update cmd/util/cmd/common/node_info.go

6053858

Co-authored-by: Alexander Hentschel <alex.hentschel@flowfoundation.org>

Apply suggestions from code review

a7e6e2c

Co-authored-by: Alexander Hentschel <alex.hentschel@flowfoundation.org>

Update cmd/util/cmd/common/node_info.go

f823c45

Co-authored-by: Alexander Hentschel <alex.hentschel@flowfoundation.org>

Merge branch 'khalil/6959-efm-recvery-epoch-data-generation' of githu…

47361e2

…b.com:onflow/flow-go into khalil/6959-efm-recvery-epoch-data-generation

add sanity check ensure all node weights are equal when generating cl…

5e3dffd

…uster assignment

Merge branch 'master' into khalil/6959-efm-recvery-epoch-data-generation

897809d

Update node_info.go

a55fda1

Merge branch 'khalil/6959-efm-recvery-epoch-data-generation' of githu…

3ee695d

…b.com:onflow/flow-go into khalil/6959-efm-recvery-epoch-data-generation

Update clusters.go

ab88472

Update clusters.go

12e3534

Update clusters.go

e64ad8a

kc1116 added this pull request to the merge queue Apr 17, 2024

Merged via the queue into master with commit 9c4c4a3 Apr 17, 2024
55 checks passed

kc1116 deleted the khalil/6959-efm-recvery-epoch-data-generation branch April 17, 2024 15:46

jordanschalm reviewed May 2, 2024

View reviewed changes

franklywatson modified the milestone: EFM-Q2 Follow-on updates May 14, 2024

This was referenced Jul 30, 2024

Khalil/5639 epoch recovery transaction onflow/flow-core-contracts#439

Closed

EFM Recovery Service Event and Transaction onflow/flow-core-contracts#440

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Khalil/6959 Recover Epoch transaction args script #5576

Khalil/6959 Recover Epoch transaction args script #5576

kc1116 commented Mar 22, 2024 •

edited

Loading

jordanschalm left a comment

jordanschalm Mar 22, 2024

AlexHentschel Mar 26, 2024 •

edited

Loading

kc1116 Apr 2, 2024

jordanschalm Mar 22, 2024

kc1116 Apr 2, 2024

jordanschalm Mar 22, 2024

kc1116 Apr 2, 2024

jordanschalm Mar 22, 2024

kc1116 Apr 2, 2024

jordanschalm Mar 22, 2024

kc1116 Apr 2, 2024

AlexHentschel Mar 26, 2024

AlexHentschel Mar 26, 2024

kc1116 Apr 2, 2024

AlexHentschel Mar 26, 2024

kc1116 Apr 2, 2024

codecov-commenter commented Apr 1, 2024 •

edited

Loading

jordanschalm left a comment

jordanschalm Apr 9, 2024

AlexHentschel left a comment

AlexHentschel Apr 11, 2024

jordanschalm Apr 11, 2024

jordanschalm May 2, 2024 •

edited

Loading

	// first, round-robin internal nodes into each cluster
	for i, node := range internals {
	identifierLists[i%numCollectionClusters] = append(identifierLists[i%numCollectionClusters], node.NodeID)
	constraint[i%numCollectionClusters] += 1
	}

Khalil/6959 Recover Epoch transaction args script #5576

Khalil/6959 Recover Epoch transaction args script #5576

Conversation

kc1116 commented Mar 22, 2024 • edited Loading

jordanschalm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlexHentschel Mar 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Start view sensitivity

Race condition

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Apr 1, 2024 • edited Loading

Codecov Report

jordanschalm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlexHentschel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jordanschalm May 2, 2024 • edited Loading

Choose a reason for hiding this comment

kc1116 commented Mar 22, 2024 •

edited

Loading

AlexHentschel Mar 26, 2024 •

edited

Loading

codecov-commenter commented Apr 1, 2024 •

edited

Loading

jordanschalm May 2, 2024 •

edited

Loading