Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement v1.2.0 optimistic sync tests #4174

Merged
merged 4 commits into from
Sep 27, 2022
Merged

implement v1.2.0 optimistic sync tests #4174

merged 4 commits into from
Sep 27, 2022

Conversation

tersec
Copy link
Contributor

@tersec tersec commented Sep 25, 2022

This also requires

  • implementing latest valid hash support, hitherto treated as a noncritical optimization
  • adding merge (not just Bellatrix) support to the test block infrastructure
  • covering some additional LVH cases the EF consensus spec tests don't

@github-actions
Copy link

github-actions bot commented Sep 25, 2022

Unit Test Results

       9 files  ±0     675 suites  +9   19m 16s ⏱️ - 5m 59s
1 993 tests +5  1 844 ✔️ +3  149 💤 +2  0 ±0 
8 105 runs  +9  7 932 ✔️ +7  173 💤 +2  0 ±0 

Results for commit 840d097. ± Comparison against base commit 7ac95c6.

♻️ This comment has been updated with latest results.

Copy link
Contributor

@etan-status etan-status left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the added tests and for merge simulation in testblockutil!

beacon_chain/consensus_object_pools/blockchain_dag.nim Outdated Show resolved Hide resolved
@@ -1898,6 +1898,45 @@ proc updateHead*(
dag.finalizedHead.blck.root, stateRoot, dag.finalizedHead.slot.epoch)
dag.onFinHappened(dag, data)

proc getEarliestInvalidRoot*(
dag: ChainDAGRef, initialSearchRoot: Eth2Digest, lvh: Eth2Digest,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lvh --> latestValidHash maybe? Could be confusing if revisited in a couple months.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -1898,6 +1898,45 @@ proc updateHead*(
dag.finalizedHead.blck.root, stateRoot, dag.finalizedHead.slot.epoch)
dag.onFinHappened(dag, data)

proc getEarliestInvalidRoot*(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getEarliestInvalidHashBlockRoot or getDescendantOfLatestValidHash would be clearer. Or, at the very least, getEarliestInvalidBlockRoot to remove ambiguity with other roots e.g. state root

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd want to use Root in some capacity here, because while it's dealing with the LVH (a hash, from the EL), what it's returning is firmly in CL-land, an SSZ root. HashBlockRoot is a bit redundant in that sense, and confusing the boundary between EL-"hash" and CL-"root".

But yes, state vs block root is a useful distinction to make, so getEarliestInvalidBlockRoot is a good change.

840d097

var curBlck = dag.getBlockRef(initialSearchRoot).valueOr:
# Being asked to traverse a chain which the DAG doesn't know about -- but
# that'd imply the block's otherwise invalid for CL as well as EL.
return static(default(Eth2Digest))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, would it be correct to return defaultEarliestInvalidRoot here as well?

If yes, the return type could be changed to Opt[Eth2Digest], and the caller could then do dag.getEarliestInvalidRoot(initialSearchRoot, lvh).get(defaultEarliestInvalidRoot). This would avoid polluting this function with defaultEarliestInvalidRoot.

Comment on lines 250 to 251
self.attestationPool[].forkChoice.mark_root_invalid(
earliestKnownInvalidRoot)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Not sure whether this is correct in situations where the EL is restarted or changed to a different EL while Nimbus is running. Or, multi-EL scenarios which may not have the same sync progress.

The previous mechanism of only marking the specific block invalid, and then re-selecting a new head, doing fcu to it and then marking it invalid/valid one at a time seems more robust, in general, while not losing correctness (just a bit of latency), especially when considering how persistent an addUnviable can be (requiring a restart if done so incorrectly due to intermittent EL bug for example).

We could shortcut the recovery mechanism though, by using the getEarliestInvalidRoot for the next fcu instead. If that is indeed reported as invalid, then we don't need all the extra fcu for the intermediate blocks. however, if it is reported as valid, it just means that the EL was not synced as far as the DAG at the time it was asked for verdict.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with keeping the previous logic in place (plus mark_invalid_root), yes, but then it starts diverging from the optimistic sync tests. As much as feasible, I want to keep them aligned, so what's tested is what's running. The non-earliestInvalid versions are in this PR mostly kept for as a fallback.

The previous mechanism still exists, too, and it's necessary in intial optimistic sync, where there can't be an LVH because there hasn't yet been a VALID block (here, LVH only reports actual-EL-VALID, not just not-INVALIDATED). So that recovery mechanism is still necessary.

Intermittent EL INVALIDATED bugs should be fixed. It seems unwise to contort Nimbus excessively to handle that "well", beyond, as it does, not persisting them so at least restarts clear them.

I'm not sure exactly what the multi-EL scenario will look like, but if they disagree with each other on this, well, that's not going to work out well.

But, I'm fine with keeping this unchanged in this PR and exploring this in a future PR, while the LVH handling remains in place to pass the EF optimistic sync tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

313a35d reverts the LVH parts, while still adding mark_root_invalid.

@@ -442,6 +442,7 @@ func mark_root_invalid*(self: var ForkChoice, root: Eth2Digest) =
self.backend.proto_array.nodes.offset
if nodePhysicalIdx < self.backend.proto_array.nodes.buf.len:
self.backend.proto_array.nodes.buf[nodePhysicalIdx].invalid = true
self.backend.proto_array.propagateInvalidity(nodePhysicalIdx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this already be covered for purpose of scoring / selecting heads?
nodeLeadsToViableHead should already be false if the current block is invalid, treating the entire branch as invalid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The updating is done in maybeUpdateBestChildAndDescendant

func maybeUpdateBestChildAndDescendant(self: var ProtoArray,
parentIdx: Index,
childIdx: Index): FcResult[void] =
## Observe the parent at `parentIdx` with respect to the child at `childIdx` and
## potentially modify the `parent.bestChild` and `parent.bestDescendant` values
##
## There are four scenarios:
##
## 1. The child is already the best child
## but it's now invalid due to a FFG change and should be removed.
## 2. The child is already the best child
## and the parent is updated with the new best descendant
## 3. The child is not the best child but becomes the best child
## 4. The child is not the best child and does not become the best child
let child = self.nodes[childIdx]
if child.isNone():
return err ForkChoiceError(
kind: fcInvalidNodeIndex,
index: childIdx)
let parent = self.nodes[parentIdx]
if parent.isNone():
return err ForkChoiceError(
kind: fcInvalidNodeIndex,
index: parentIdx)
let childLeadsToViableHead = ? self.nodeLeadsToViableHead(child.get())

Which is called by applyScoreChanges which iterates from descendants to ancestors

for nodePhysicalIdx in countdown(self.nodes.len - 1, 0):
if node.root.isZero:
continue
if node.parent.isSome():
let parentLogicalIdx = node.parent.unsafeGet()
let parentPhysicalIdx = parentLogicalIdx - self.nodes.offset
if parentPhysicalIdx < 0:
# Orphan
continue
let nodeLogicalIdx = nodePhysicalIdx + self.nodes.offset
? self.maybeUpdateBestChildAndDescendant(parentLogicalIdx, nodeLogicalIdx)

Which is correct for the scores as such -- it allows quick updates. However, invalidity propagates the other direction, from ancestors to descendants, unlike the scores, and there's no existing mechanism in fork choice to propagate information from ancestors to descendants.

Comment on lines +580 to +581
if self.nodes.buf[parentPhysicalIdx].invalid:
self.nodes.buf[nodePhysicalIdx].invalid = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about new descendants of an invalid block that are being added after propagateInvalidity was originally called?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, they shouldn't be added in the first place, as they'd be rejected as unviable by the DAG/quarantine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other fallback here is that they'll ask the EL again if someone does try to build on them, and get that block instead. Since by then things should have mostly caught up to head, it's more likely to be the only remaining set of blocks on which building is still happening, and therefore should iteratively converge.

break
curBlck = curBlck.parent

curBlck.root
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not necessarily correct.

When someone switches from a different CL to Nimbus but retains their EL, the EL's lvh may be far in the future. This means that none of our DAG blocks contains lvh. This implementation would then return the first descendant of dag.finalizedHead (or dag.finalizedHead itself if no descendants were yet added).

If lvh is not found, maybe better to return the defaultEarliestInvalidRoot or ZERO_HASH instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, in extended periods of non-finality, the linear scan may become expensive if performed repeatedly.
One way to accelerate could be to stop the scan at the latest known-VALID block, instead of all the way back at finalized. Or, also having a constant maximum search depth.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#4205

In theory, latest-known VALID block is exactly where this will stop. If it doesn't, that's an EL bug. EL bugs shouldn't crash Nimbus, or result in durably incorrect behavior, but EL bugs can trigger suboptimally slow behavior. It's meant to be a mutually trusted system, and I'm wary of adding complication to Nimbus to handle buggy ELs, when that complication can carry its own risks .

In particular, https://github.com/ethereum/execution-apis/blob/main/src/engine/specification.md#payloadstatusv1 defines:

latestValidHash: DATA|null, 32 Bytes - the hash of the most recent valid block in the branch defined by payload and its ancestors

This also addresses the concern about the EL's lvh being far in the future in the new-CL-database/existing-EL-database case: the lvh is relative to the payload provided by the CL. Here, the failure mode for buggy EL is that Nimbus needs to be restarted, which seems reasonable.

As far as cost in general, this should be a relatively one-time thing per invalid branch -- once invalidated, it shouldn't be doing that backwards search again, so potentially better to do it once than multiple times, along the lines of how pruning huge swaths of finalized blocks after a long-unfinalizing network finalizes is expensive, but one-time. Doing halfway versions introduces less-well-defined state and doesn't necessarily save time overall, in a throughput sense.

There is an issue here where the quarantine unviables are sometimes checked only as direct parents rather than ancestors, which is an argument for not using the LVH-based search for adding to quarantine.

The other aspect is that it should rediscover all of this by just asking the EL, if it misses something. So everything here should be considered a cache.

For the moment, this is all only used in tests, so another approach is to move this out from blockchain_dag and into tests/, so ensure that it doesn't accidentally get used in blockchain_dag.

While the initial intent, and still one that seems broadly ideal, is to maximally match test behavior with non-test behavior (otherwise, what is the test testing?), just the status quo here is that this LVH infrastructure is test-only, so it can/should reflect that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants