Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Khalil/6959 Recover Epoch transaction args script #5576

Merged
merged 57 commits into from
Apr 17, 2024

Conversation

kc1116
Copy link
Contributor

@kc1116 kc1116 commented Mar 22, 2024

This PR adds a new cmd recover-epoch-tx-args to the epochs CLI command that generates the transaction args for the RecoverEpoch tx. This command reuses logic from the sporking utility tool to generate the following transaction args.

  • startView: start view of the recovery epoch
  • stakingStartView: start view of the staking phase of the recovery epoch
  • endView: end view of the recovery epoch
  • dkgPubKeys: list of dkg key shares for each consensus node from the dkg of the last successful epoch
  • nodeIds: initial identity list of the last successful epoch
  • clusters: node clusters
  • clusterQcs: cluster qcs generated from the clusters

Note: Significant changes are in this file cmd/util/cmd/epochs/cmd/recover.go, other changes are due to relocating reusable funcs

Copy link
Member

@jordanschalm jordanschalm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First round of comments

cmd/bootstrap/cmd/keys.go Outdated Show resolved Hide resolved
Comment on lines 124 to 127
_, clusters, err := common.ConstructClusterAssignment(log, partnerNodes, internalNodes, flagCollectionClusters)
if err != nil {
log.Fatal().Err(err).Msg("unable to generate cluster assignment")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is generating the cluster assignment based on the partner and internal node config files from the spork bootstrapping process. These files should accurately represent the identity table state at the beginning of the spork, but may not be accurate at the time a RecoveryEpoch needs to be generated.

In order to guarantee we generate valid clusters (and don't re-insert an old node which has unstaked or omit a new node which has joined after the spork) we will need to use the identity table from the snapshot we retrieve from the network as the data source for constructing cluster assignments.

We will still need to retrieve the internal nodes info from disk (step ReadInternalNodeInfos) in order to get the private keys necessary to produce the QCs. However I don't think it is useful to retrieve the partner node info from disk, because this will be completely replaced by the snapshot. I think we should:

  • Exit with an error if we observe a node in InternalNodeInfos that doesn't exist in the snapshot
  • Populate internalNodes based on what we read from disk
  • Populate partnerNodes based on snapshot.Identities - internalNodes

Copy link
Member

@AlexHentschel AlexHentschel Mar 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Exit with an error if we observe a node in InternalNodeInfos that doesn't exist in the snapshot
  • Populate internalNodes based on what we read from disk
  • Populate partnerNodes based on snapshot.Identities - internalNodes

Very much possible that we are suggesting the same thing, but I am not sure. I generally agree with Jordan's comment, but I would frame it slightly differently:

  • The snapshot is the core object for this logic here:
    • It specifies the set of nodes $S$ which nodes are currently allowed to participate in the network. Note that ejected nodes are not part of $S$. So you cannot use InitialIdentities ⚠️ as you do here! Instead, you should use the Identities method from the snapshot to retrieve the set of participating nodes as of the snapshot's block.
    • If you use a selector that drops ejected nodes and nodes with zero weight, you receive the set $S$ of all participating nodes as of this block.
  • The nodes in the recovery epoch should be a subset of $S$
    • So we select the subset of all collectors $s_c$ from $S$. This are the collectors that can participate in the recovery epoch and we can partition them across the different clusters (method ConstructClusterAssignment)

    • For a cluster to work, we need a root QC, for the cluster to start its local HotStuff. Hence, each cluster must contain more than 2/3 of collectors (by weight), whose private staking key we know.

    • Therefore, we split $s_c$ into two subsets

      • $s^{(i)}_c$ are the nodes, where we know the private staking key:
      • $s^{(p)}_c$ are the nodes, where we do not know the private staking key

      We can proceed as follows:

      1. We load the old bootstrapping data for our internal nodes info from disk (step ReadInternalNodeInfos). Then we iterate over all nodes $n \in s_c$. If collector node $n$ is listed in the old bootstrapping data, we have a private staking key for it and it goes into $s^{(i)}_c$. Otherwise, the collector is added to $s^{(p)}_c$ (no private key).
      2. we can then call (similar to line 124) common.ConstructClusterAssignment( log, $s^{(p)}_c$, $s^{(i)}_c$, flagCollectionClusters )

To the best of my understanding, this is the most general algorithm and works for the broadest possible range of situations. In particular:

  • If is possible for us to add or remove internal collectors after the original spork.

    In comparison, removing one of our collectors would yield an error in Jordan's suggestion:

    Exit with an error if we observe a node in InternalNodeInfos that doesn't exist in the snapshot

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlexHentschel thanks for the in depth algorithm clarity

2b9c152

cmd/util/cmd/epochs/cmd/recover.go Outdated Show resolved Hide resolved
cmd/util/cmd/epochs/cmd/recover.go Outdated Show resolved Hide resolved
cmd/util/cmd/epochs/cmd/recover.go Outdated Show resolved Hide resolved
cmd/util/cmd/epochs/cmd/recover.go Show resolved Hide resolved
Comment on lines 61 to 63
generateRecoverEpochTxArgsCmd.Flags().Uint64Var(&flagStartView, "start-view", 0, "start view of the recovery epoch")
generateRecoverEpochTxArgsCmd.Flags().Uint64Var(&flagStakingEndView, "staking-end-view", 0, "end view of the staking phase of the recovery epoch")
generateRecoverEpochTxArgsCmd.Flags().Uint64Var(&flagEndView, "end-view", 0, "end view of the recovery epoch")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be easier from an ops perspective if we used the same representation for these values as we do in bootstrapping, in particular using the lengths as the input rather than the end view. I suggest we use the same flag names as the rootblock comand:

  • --epoch-length
  • --epoch-staking-phase-length

Then we can compute endView and stakingEndView as we do in the bootstrapping.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

" in the JSON file: Role, Address, NodeID, NetworkPubKey, StakingPubKey)")
generateRecoverEpochTxArgsCmd.Flags().StringVar(&flagPartnerWeights, "partner-weights", "", "path to a JSON file containing "+
"a map from partner node's NodeID to their stake")
generateRecoverEpochTxArgsCmd.Flags().Uint64Var(&flagStartView, "start-view", 0, "start view of the recovery epoch")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Start view sensitivity

The startView parameter is very sensitive, because in order for a EpochRecover service event to be valid, it must be exactly one view after the last EpochExtension. Otherwise we would have conflicting/missing definitions for who is the leader for a view.

Suggestions:

  • Retrieve the "extended final view" from the snapshot (FinalView from the latest extension of the current epoch)
  • Validate that --start-view is equal to extendedFinalView+1

Alternatively we can revisit the strictness of the requirement, but would need to carefully think through the safety implications.

Race condition

In addition, there is a race condition here to be aware of. The design specifies that we will add a new extension periodically. If, in between retrieving the snapshot while executing this command and submitting the corresponding admin transaction, we add a new epoch extension, the RecoverEpoch event would be discarded as invalid.

This isn't the end of the world because we can just do it again, but it might be worth having the tooling help avoid this situation, for example by reporting the number of view remaining before another extension would be added. To be honest I'm kind of conflicted: on the one hand it would be useful but on the other hand, this is already an infrequent, heavily involved process where I'd expect humans to be double-checking these things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is related to the comment above it and the same commit applies 884fdbf , as far as the race condition it's unlikely to happen and we would only be able to log a useful message if we were able to get the current view from the network somehow then we can could log how many views left until the start-view .


dkgPubKeys := make([]cadence.Value, 0)
nodeIds := make([]cadence.Value, 0)
ids.Map(func(skeleton flow.IdentitySkeleton) flow.IdentitySkeleton {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest just using a for loop here. Map is used to translate a list through a function into a new list, but here we're discarding the output list, which defeats the purpose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log.Fatal().Err(err).Msg("failed to get random source cadence string")
}

dkgPubKeys := make([]cadence.Value, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first element here will need to be the group public key (see https://github.com/onflow/flow-go/blob/master/model/convert/service_event.go#L360-L362).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cmd/util/cmd/common/clusters.go Outdated Show resolved Hide resolved
Comment on lines 64 to 65
identifierLists[i%len(identifierLists)] = append(identifierLists[i%len(identifierLists)], node.NodeID)
constraint[i%nClusters] -= 2
constraint[i%numCollectionClusters] -= 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per construction the length of identifierLists is numCollectionClusters:

identifierLists := make([]flow.IdentifierList, numCollectionClusters)

above, we have used this relation already:
// first, round-robin internal nodes into each cluster
for i, node := range internals {
identifierLists[i%numCollectionClusters] = append(identifierLists[i%numCollectionClusters], node.NodeID)
constraint[i%numCollectionClusters] += 1
}

algorithmically, we are doing exactly the same thing here: distribute collector nodes amongst the clusters. So the code should have the same structure. Otherwise, people reading this will get confused.

Suggested change
identifierLists[i%len(identifierLists)] = append(identifierLists[i%len(identifierLists)], node.NodeID)
constraint[i%nClusters] -= 2
constraint[i%numCollectionClusters] -= 2
clusterIndex := i % numCollectionClusters
identifierLists[clusterIndex] = append(identifierLists[clusterIndex], node.NodeID)
constraint[clusterIndex] -= 2

cmd/util/cmd/common/clusters.go Show resolved Hide resolved
Comment on lines 30 to 36
networkPubKey := cmd.ValidateNetworkPubKey(partner.NetworkPubKey)
stakingPubKey := cmd.ValidateStakingPubKey(partner.StakingPubKey)
weight, valid := cmd.ValidateWeight(weights[partner.NodeID])
if !valid {
log.Error().Msgf("weights: %v", weights)
log.Fatal().Msgf("partner node id %x has no weight", nodeID)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are mixing panics and boolean return for indicating invalid node information. Ideally, it would all be the same: sentinel errors.

If it is a huge refactoring, feel free to skip. But if we can with moderate amount of work, it would be nice to have consistent code here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cmd/util/cmd/common/node_info.go Show resolved Hide resolved
Comment on lines 43 to 46
// GetSnapshotAtEpochAndPhase will get the latest finalized protocol snapshot and check the current epoch and epoch phase.
// If we are past the target epoch and epoch phase we exit the retry mechanism immediately.
// If not check the snapshot at the specified interval until we reach the target epoch and phase.
func GetSnapshotAtEpochAndPhase(ctx context.Context, log zerolog.Logger, startupEpoch uint64, startupEpochPhase flow.EpochPhase, retryInterval time.Duration, getSnapshot GetProtocolSnapshot) (protocol.Snapshot, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to document the meaning of some of the inputs:
startupEpoch and startupEpochPhase and getSnapshot

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kc1116 and others added 2 commits March 28, 2024 21:16
Co-authored-by: Jordan Schalm <jordan@dapperlabs.com>
@codecov-commenter
Copy link

codecov-commenter commented Apr 1, 2024

Codecov Report

Attention: Patch coverage is 0% with 82 lines in your changes are missing coverage. Please review.

Project coverage is 56.55%. Comparing base (00e4924) to head (2eb6b11).
Report is 346 commits behind head on master.

Files Patch % Lines
utils/unittest/service_events_fixtures.go 0.00% 82 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5576      +/-   ##
==========================================
+ Coverage   55.65%   56.55%   +0.89%     
==========================================
  Files        1036      629     -407     
  Lines      101131    60238   -40893     
==========================================
- Hits        56287    34065   -22222     
+ Misses      40518    23675   -16843     
+ Partials     4326     2498    -1828     
Flag Coverage Δ
unittests 56.55% <0.00%> (+0.89%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

kc1116 and others added 20 commits April 1, 2024 10:14
- add filter for valid protocol participant
- emit fatal log if identity is present in internal node info from disc but missing from snap shot identities list
Co-authored-by: Jordan Schalm <jordan@dapperlabs.com>
Co-authored-by: Jordan Schalm <jordan@dapperlabs.com>
Co-authored-by: Jordan Schalm <jordan@dapperlabs.com>
Co-authored-by: Jordan Schalm <jordan@dapperlabs.com>
Co-authored-by: Jordan Schalm <jordan@dapperlabs.com>
Co-authored-by: Jordan Schalm <jordan@dapperlabs.com>
- infer start view, staking phase end view, and epoch end view from curr epoch final view
Co-authored-by: Alexander Hentschel <alex.hentschel@flowfoundation.org>
…b.com:onflow/flow-go into khalil/6959-efm-recvery-epoch-data-generation
@kc1116 kc1116 requested a review from jordanschalm April 9, 2024 15:27
Copy link
Member

@jordanschalm jordanschalm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few final small suggestions, but otherwise looks good!

integration/tests/epochs/base_suite.go Outdated Show resolved Hide resolved
integration/tests/epochs/recover_epoch/suite.go Outdated Show resolved Hide resolved
Comment on lines +18 to +19
s.DKGPhaseLen = 50
s.EpochLen = 250
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note: When we get to making this integration test case, we'll likely be able to set these more aggressively (shorter). We can discuss later

Copy link
Member

@AlexHentschel AlexHentschel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great work. largely cosmetic suggestions.

Also wanted to say thanks for cleanup and documentation improvement also to the adjacent code. Continuous efforts on a smaller scale really add up to a big difference over time for clarity and low tech debt in our code base. Thanks, really appreciate your contributions. 💚

cmd/bootstrap/cmd/final_list.go Outdated Show resolved Hide resolved
cmd/bootstrap/cmd/final_list.go Outdated Show resolved Hide resolved
cmd/util/cmd/common/clusters.go Outdated Show resolved Hide resolved
cmd/util/cmd/common/clusters.go Show resolved Hide resolved
cmd/util/cmd/common/node_info.go Outdated Show resolved Hide resolved
cmd/util/cmd/common/node_info.go Outdated Show resolved Hide resolved
cmd/util/cmd/common/node_info.go Outdated Show resolved Hide resolved
cmd/util/cmd/epochs/cmd/recover.go Outdated Show resolved Hide resolved
cmd/util/cmd/epochs/cmd/recover.go Outdated Show resolved Hide resolved
Comment on lines 67 to 69
generateRecoverEpochTxArgsCmd.Flags().Uint64Var(&flagNumViewsInEpoch, "epoch-length", 4000, "length of each epoch measured in views")
generateRecoverEpochTxArgsCmd.Flags().Uint64Var(&flagNumViewsInStakingAuction, "epoch-staking-phase-length", 100, "length of the epoch staking phase measured in views")
generateRecoverEpochTxArgsCmd.Flags().Uint64Var(&flagEpochCounter, "epoch-counter", 0, "the epoch counter used to generate the root cluster block")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my understanding, I would think these default values are drastically to short. would also like to get @jordanschalm's thoughts ...

My thoughts:

  • Default values should provide a sensible setting for mainnet.

  • We are talking about an epoch here, with the limitation that it is mainly for recovery, but aren't we planning to run a regular DKG in this recovery epoch? I think a day would be reasonable

    • For reference, we decided on 3000 views for each DKG phase (3 in total) for the mainnet consensus timing, plus EpochCommitSafetyThreshold (which should also be minimally 1000 blocks), plus staking and Epoch commit phase that also belong to an epoch on the happy path.
      With the new consensus timing [👉 reference], one day would correspond to to 108,000 views for mainnet

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice the CLI will require all of these to be explicitly specified in flags when run, because of the MarkFlagRequired blocks below. I think it is preferable to force operators to provide specific values and fail rather than using a default value (current behaviour).

To prevent confusion, we could set the "default" value parameter when defining the flag to 0.

kc1116 and others added 15 commits April 15, 2024 18:13
Co-authored-by: Jordan Schalm <jordan@dapperlabs.com>
Co-authored-by: Alexander Hentschel <alex.hentschel@flowfoundation.org>
Co-authored-by: Jordan Schalm <jordan@dapperlabs.com>
Co-authored-by: Alexander Hentschel <alex.hentschel@flowfoundation.org>
Co-authored-by: Alexander Hentschel <alex.hentschel@flowfoundation.org>
Co-authored-by: Alexander Hentschel <alex.hentschel@flowfoundation.org>
Co-authored-by: Alexander Hentschel <alex.hentschel@flowfoundation.org>
…b.com:onflow/flow-go into khalil/6959-efm-recvery-epoch-data-generation
…b.com:onflow/flow-go into khalil/6959-efm-recvery-epoch-data-generation
@kc1116 kc1116 added this pull request to the merge queue Apr 17, 2024
Merged via the queue into master with commit 9c4c4a3 Apr 17, 2024
55 checks passed
@kc1116 kc1116 deleted the khalil/6959-efm-recvery-epoch-data-generation branch April 17, 2024 15:46
log.Fatal().Err(cdcErr).Msg("failed to get dkg group key cadence string")
}
dkgPubKeys = append(dkgPubKeys, dkgGroupKeyCdc)
for _, id := range currentEpochIdentities {
Copy link
Member

@jordanschalm jordanschalm May 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to comment on a closed PR. I noticed some sanity checks we should add here while reviewing onflow/flow-core-contracts#420.

We should check that currentEpochDKG.Size() == len(currentEpochIdentities.Filter(filter.HasRole(flow.RoleConsensus)))

We already check that there is a DKG key for every consensus node, but not that there is a consensus node for every DKG key.

Added a reminder to the design doc for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants