implement v1.2.0 optimistic sync tests #4174

tersec · 2022-09-25T15:50:10Z

This also requires

implementing latest valid hash support, hitherto treated as a noncritical optimization
adding merge (not just Bellatrix) support to the test block infrastructure
covering some additional LVH cases the EF consensus spec tests don't

github-actions · 2022-09-25T19:07:15Z

Unit Test Results

      9 files ±0   675 suites +9 19m 16s ⏱️ - 5m 59s
1 993 tests +5 1 844 ✔️ +3 149 💤 +2 0 ❌ ±0
8 105 runs +9 7 932 ✔️ +7 173 💤 +2 0 ❌ ±0

Results for commit 840d097. ± Comparison against base commit 7ac95c6.

♻️ This comment has been updated with latest results.

etan-status

Thanks for the added tests and for merge simulation in testblockutil!

beacon_chain/consensus_object_pools/blockchain_dag.nim

etan-status · 2022-09-26T17:32:01Z

beacon_chain/consensus_object_pools/blockchain_dag.nim

@@ -1898,6 +1898,45 @@ proc updateHead*(
        dag.finalizedHead.blck.root, stateRoot, dag.finalizedHead.slot.epoch)
      dag.onFinHappened(dag, data)

+proc getEarliestInvalidRoot*(
+    dag: ChainDAGRef, initialSearchRoot: Eth2Digest, lvh: Eth2Digest,


lvh --> latestValidHash maybe? Could be confusing if revisited in a couple months.

etan-status · 2022-09-26T17:35:55Z

beacon_chain/consensus_object_pools/blockchain_dag.nim

@@ -1898,6 +1898,45 @@ proc updateHead*(
        dag.finalizedHead.blck.root, stateRoot, dag.finalizedHead.slot.epoch)
      dag.onFinHappened(dag, data)

+proc getEarliestInvalidRoot*(


getEarliestInvalidHashBlockRoot or getDescendantOfLatestValidHash would be clearer. Or, at the very least, getEarliestInvalidBlockRoot to remove ambiguity with other roots e.g. state root

I'd want to use Root in some capacity here, because while it's dealing with the LVH (a hash, from the EL), what it's returning is firmly in CL-land, an SSZ root. HashBlockRoot is a bit redundant in that sense, and confusing the boundary between EL-"hash" and CL-"root".

But yes, state vs block root is a useful distinction to make, so getEarliestInvalidBlockRoot is a good change.

840d097

etan-status · 2022-09-26T18:12:50Z

beacon_chain/consensus_object_pools/blockchain_dag.nim

+  var curBlck = dag.getBlockRef(initialSearchRoot).valueOr:
+    # Being asked to traverse a chain which the DAG doesn't know about -- but
+    # that'd imply the block's otherwise invalid for CL as well as EL.
+    return static(default(Eth2Digest))


Hmm, would it be correct to return defaultEarliestInvalidRoot here as well?

If yes, the return type could be changed to Opt[Eth2Digest], and the caller could then do dag.getEarliestInvalidRoot(initialSearchRoot, lvh).get(defaultEarliestInvalidRoot). This would avoid polluting this function with defaultEarliestInvalidRoot.

etan-status · 2022-09-26T18:28:17Z

beacon_chain/consensus_object_pools/consensus_manager.nim

+    self.attestationPool[].forkChoice.mark_root_invalid(
+      earliestKnownInvalidRoot)


Hmm. Not sure whether this is correct in situations where the EL is restarted or changed to a different EL while Nimbus is running. Or, multi-EL scenarios which may not have the same sync progress.

The previous mechanism of only marking the specific block invalid, and then re-selecting a new head, doing fcu to it and then marking it invalid/valid one at a time seems more robust, in general, while not losing correctness (just a bit of latency), especially when considering how persistent an addUnviable can be (requiring a restart if done so incorrectly due to intermittent EL bug for example).

We could shortcut the recovery mechanism though, by using the getEarliestInvalidRoot for the next fcu instead. If that is indeed reported as invalid, then we don't need all the extra fcu for the intermediate blocks. however, if it is reported as valid, it just means that the EL was not synced as far as the DAG at the time it was asked for verdict.

I'm fine with keeping the previous logic in place (plus mark_invalid_root), yes, but then it starts diverging from the optimistic sync tests. As much as feasible, I want to keep them aligned, so what's tested is what's running. The non-earliestInvalid versions are in this PR mostly kept for as a fallback.

The previous mechanism still exists, too, and it's necessary in intial optimistic sync, where there can't be an LVH because there hasn't yet been a VALID block (here, LVH only reports actual-EL-VALID, not just not-INVALIDATED). So that recovery mechanism is still necessary.

Intermittent EL INVALIDATED bugs should be fixed. It seems unwise to contort Nimbus excessively to handle that "well", beyond, as it does, not persisting them so at least restarts clear them.

I'm not sure exactly what the multi-EL scenario will look like, but if they disagree with each other on this, well, that's not going to work out well.

But, I'm fine with keeping this unchanged in this PR and exploring this in a future PR, while the LVH handling remains in place to pass the EF optimistic sync tests.

313a35d reverts the LVH parts, while still adding mark_root_invalid.

etan-status · 2022-09-26T18:32:10Z

beacon_chain/fork_choice/fork_choice.nim

@@ -442,6 +442,7 @@ func mark_root_invalid*(self: var ForkChoice, root: Eth2Digest) =
        self.backend.proto_array.nodes.offset
    if nodePhysicalIdx < self.backend.proto_array.nodes.buf.len:
      self.backend.proto_array.nodes.buf[nodePhysicalIdx].invalid = true
+    self.backend.proto_array.propagateInvalidity(nodePhysicalIdx)


Shouldn't this already be covered for purpose of scoring / selecting heads?
nodeLeadsToViableHead should already be false if the current block is invalid, treating the entire branch as invalid?

The updating is done in maybeUpdateBestChildAndDescendant

nimbus-eth2/beacon_chain/fork_choice/proto_array.nim

Lines 424 to 451 in ad286b9

func maybeUpdateBestChildAndDescendant(self: var ProtoArray,

parentIdx: Index,

childIdx: Index): FcResult[void] =

## Observe the parent at `parentIdx` with respect to the child at `childIdx` and

## potentially modify the `parent.bestChild` and `parent.bestDescendant` values

##

## There are four scenarios:

##

## 1. The child is already the best child

## but it's now invalid due to a FFG change and should be removed.

## 2. The child is already the best child

## and the parent is updated with the new best descendant

## 3. The child is not the best child but becomes the best child

## 4. The child is not the best child and does not become the best child

let child = self.nodes[childIdx]

if child.isNone():

return err ForkChoiceError(

kind: fcInvalidNodeIndex,

index: childIdx)

let parent = self.nodes[parentIdx]

if parent.isNone():

return err ForkChoiceError(

kind: fcInvalidNodeIndex,

index: parentIdx)

let childLeadsToViableHead = ? self.nodeLeadsToViableHead(child.get())

Which is called by applyScoreChanges which iterates from descendants to ancestors

nimbus-eth2/beacon_chain/fork_choice/proto_array.nim

Lines 270 to 282 in ad286b9

for nodePhysicalIdx in countdown(self.nodes.len - 1, 0):

if node.root.isZero:

continue

if node.parent.isSome():

let parentLogicalIdx = node.parent.unsafeGet()

let parentPhysicalIdx = parentLogicalIdx - self.nodes.offset

if parentPhysicalIdx < 0:

# Orphan

continue

let nodeLogicalIdx = nodePhysicalIdx + self.nodes.offset

? self.maybeUpdateBestChildAndDescendant(parentLogicalIdx, nodeLogicalIdx)

Which is correct for the scores as such -- it allows quick updates. However, invalidity propagates the other direction, from ancestors to descendants, unlike the scores, and there's no existing mechanism in fork choice to propagate information from ancestors to descendants.

etan-status · 2022-09-26T18:34:55Z

beacon_chain/fork_choice/proto_array.nim

+    if self.nodes.buf[parentPhysicalIdx].invalid:
+      self.nodes.buf[nodePhysicalIdx].invalid = true


What about new descendants of an invalid block that are being added after propagateInvalidity was originally called?

In theory, they shouldn't be added in the first place, as they'd be rejected as unviable by the DAG/quarantine.

The other fallback here is that they'll ask the EL again if someone does try to build on them, and get that block instead. Since by then things should have mostly caught up to head, it's more likely to be the only remaining set of blocks on which building is still happening, and therefore should iteratively converge.

etan-status · 2022-09-26T18:43:55Z

beacon_chain/consensus_object_pools/blockchain_dag.nim

+      break
+    curBlck = curBlck.parent
+
+  curBlck.root


This is not necessarily correct.

When someone switches from a different CL to Nimbus but retains their EL, the EL's lvh may be far in the future. This means that none of our DAG blocks contains lvh. This implementation would then return the first descendant of dag.finalizedHead (or dag.finalizedHead itself if no descendants were yet added).

If lvh is not found, maybe better to return the defaultEarliestInvalidRoot or ZERO_HASH instead.

Also, in extended periods of non-finality, the linear scan may become expensive if performed repeatedly.
One way to accelerate could be to stop the scan at the latest known-VALID block, instead of all the way back at finalized. Or, also having a constant maximum search depth.

#4205

In theory, latest-known VALID block is exactly where this will stop. If it doesn't, that's an EL bug. EL bugs shouldn't crash Nimbus, or result in durably incorrect behavior, but EL bugs can trigger suboptimally slow behavior. It's meant to be a mutually trusted system, and I'm wary of adding complication to Nimbus to handle buggy ELs, when that complication can carry its own risks .

In particular, https://github.com/ethereum/execution-apis/blob/main/src/engine/specification.md#payloadstatusv1 defines:

latestValidHash: DATA|null, 32 Bytes - the hash of the most recent valid block in the branch defined by payload and its ancestors

This also addresses the concern about the EL's lvh being far in the future in the new-CL-database/existing-EL-database case: the lvh is relative to the payload provided by the CL. Here, the failure mode for buggy EL is that Nimbus needs to be restarted, which seems reasonable.

As far as cost in general, this should be a relatively one-time thing per invalid branch -- once invalidated, it shouldn't be doing that backwards search again, so potentially better to do it once than multiple times, along the lines of how pruning huge swaths of finalized blocks after a long-unfinalizing network finalizes is expensive, but one-time. Doing halfway versions introduces less-well-defined state and doesn't necessarily save time overall, in a throughput sense.

There is an issue here where the quarantine unviables are sometimes checked only as direct parents rather than ancestors, which is an argument for not using the LVH-based search for adding to quarantine.

The other aspect is that it should rediscover all of this by just asking the EL, if it misses something. So everything here should be considered a cache.

For the moment, this is all only used in tests, so another approach is to move this out from blockchain_dag and into tests/, so ensure that it doesn't accidentally get used in blockchain_dag.

While the initial intent, and still one that seems broadly ideal, is to maximally match test behavior with non-test behavior (otherwise, what is the test testing?), just the status quo here is that this LVH infrastructure is test-only, so it can/should reflect that.

Co-authored-by: Etan Kissling <etan@status.im>

…rliestInvalidRoot` -> `defaultEarliestInvalidBlockRoot`

implement v1.2.0 optimistic sync tests

28ede8f

etan-status approved these changes Sep 26, 2022

View reviewed changes

tersec and others added 3 commits September 26, 2022 18:53

Update beacon_chain/consensus_object_pools/blockchain_dag.nim

bdad06b

Co-authored-by: Etan Kissling <etan@status.im>

lvh -> latestValidHash and only invalidate one specific block"

313a35d

getEarliestInvalidRoot -> getEarliestInvalidBlockRoot; `defaultEa…

840d097

…rliestInvalidRoot` -> `defaultEarliestInvalidBlockRoot`

zah merged commit 0f6d19b into unstable Sep 27, 2022

zah deleted the tJv branch September 27, 2022 12:11

This was referenced Oct 3, 2022

move LVH handling to tests/; increase maximum fork choice retries #4205

Merged

Descendants of INVALID block are not deleted in optsync #3954

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement v1.2.0 optimistic sync tests #4174

implement v1.2.0 optimistic sync tests #4174

tersec commented Sep 25, 2022

github-actions bot commented Sep 25, 2022 •

edited

Loading

etan-status left a comment

etan-status Sep 26, 2022

tersec Sep 26, 2022

etan-status Sep 26, 2022

tersec Sep 26, 2022

etan-status Sep 26, 2022

etan-status Sep 26, 2022

tersec Sep 26, 2022

tersec Sep 26, 2022

etan-status Sep 26, 2022

tersec Sep 27, 2022

etan-status Sep 26, 2022

tersec Sep 27, 2022

tersec Oct 3, 2022

etan-status Sep 26, 2022

etan-status Sep 26, 2022

tersec Oct 3, 2022

		self.attestationPool[].forkChoice.mark_root_invalid(
		earliestKnownInvalidRoot)

	func maybeUpdateBestChildAndDescendant(self: var ProtoArray,
	parentIdx: Index,
	childIdx: Index): FcResult[void] =
	## Observe the parent at `parentIdx` with respect to the child at `childIdx` and
	## potentially modify the `parent.bestChild` and `parent.bestDescendant` values
	##
	## There are four scenarios:
	##
	## 1. The child is already the best child
	## but it's now invalid due to a FFG change and should be removed.
	## 2. The child is already the best child
	## and the parent is updated with the new best descendant
	## 3. The child is not the best child but becomes the best child
	## 4. The child is not the best child and does not become the best child

	let child = self.nodes[childIdx]
	if child.isNone():
	return err ForkChoiceError(
	kind: fcInvalidNodeIndex,
	index: childIdx)

	let parent = self.nodes[parentIdx]
	if parent.isNone():
	return err ForkChoiceError(
	kind: fcInvalidNodeIndex,
	index: parentIdx)

	let childLeadsToViableHead = ? self.nodeLeadsToViableHead(child.get())

	for nodePhysicalIdx in countdown(self.nodes.len - 1, 0):
	if node.root.isZero:
	continue

	if node.parent.isSome():
	let parentLogicalIdx = node.parent.unsafeGet()
	let parentPhysicalIdx = parentLogicalIdx - self.nodes.offset
	if parentPhysicalIdx < 0:
	# Orphan
	continue

	let nodeLogicalIdx = nodePhysicalIdx + self.nodes.offset
	? self.maybeUpdateBestChildAndDescendant(parentLogicalIdx, nodeLogicalIdx)

		if self.nodes.buf[parentPhysicalIdx].invalid:
		self.nodes.buf[nodePhysicalIdx].invalid = true

implement v1.2.0 optimistic sync tests #4174

implement v1.2.0 optimistic sync tests #4174

Conversation

tersec commented Sep 25, 2022

github-actions bot commented Sep 25, 2022 • edited Loading

Unit Test Results

etan-status left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Sep 25, 2022 •

edited

Loading