-
Notifications
You must be signed in to change notification settings - Fork 231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement v1.2.0 optimistic sync tests #4174
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the added tests and for merge simulation in testblockutil
!
@@ -1898,6 +1898,45 @@ proc updateHead*( | |||
dag.finalizedHead.blck.root, stateRoot, dag.finalizedHead.slot.epoch) | |||
dag.onFinHappened(dag, data) | |||
|
|||
proc getEarliestInvalidRoot*( | |||
dag: ChainDAGRef, initialSearchRoot: Eth2Digest, lvh: Eth2Digest, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lvh
--> latestValidHash
maybe? Could be confusing if revisited in a couple months.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -1898,6 +1898,45 @@ proc updateHead*( | |||
dag.finalizedHead.blck.root, stateRoot, dag.finalizedHead.slot.epoch) | |||
dag.onFinHappened(dag, data) | |||
|
|||
proc getEarliestInvalidRoot*( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getEarliestInvalidHashBlockRoot
or getDescendantOfLatestValidHash
would be clearer. Or, at the very least, getEarliestInvalidBlockRoot
to remove ambiguity with other roots e.g. state root
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd want to use Root
in some capacity here, because while it's dealing with the LVH (a hash, from the EL), what it's returning is firmly in CL-land, an SSZ root. HashBlockRoot
is a bit redundant in that sense, and confusing the boundary between EL-"hash" and CL-"root".
But yes, state vs block root is a useful distinction to make, so getEarliestInvalidBlockRoot
is a good change.
var curBlck = dag.getBlockRef(initialSearchRoot).valueOr: | ||
# Being asked to traverse a chain which the DAG doesn't know about -- but | ||
# that'd imply the block's otherwise invalid for CL as well as EL. | ||
return static(default(Eth2Digest)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, would it be correct to return defaultEarliestInvalidRoot
here as well?
If yes, the return type could be changed to Opt[Eth2Digest]
, and the caller could then do dag.getEarliestInvalidRoot(initialSearchRoot, lvh).get(defaultEarliestInvalidRoot)
. This would avoid polluting this function with defaultEarliestInvalidRoot
.
self.attestationPool[].forkChoice.mark_root_invalid( | ||
earliestKnownInvalidRoot) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Not sure whether this is correct in situations where the EL is restarted or changed to a different EL while Nimbus is running. Or, multi-EL scenarios which may not have the same sync progress.
The previous mechanism of only marking the specific block invalid, and then re-selecting a new head, doing fcu to it and then marking it invalid/valid one at a time seems more robust, in general, while not losing correctness (just a bit of latency), especially when considering how persistent an addUnviable
can be (requiring a restart if done so incorrectly due to intermittent EL bug for example).
We could shortcut the recovery mechanism though, by using the getEarliestInvalidRoot
for the next fcu instead. If that is indeed reported as invalid, then we don't need all the extra fcu for the intermediate blocks. however, if it is reported as valid, it just means that the EL was not synced as far as the DAG at the time it was asked for verdict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with keeping the previous logic in place (plus mark_invalid_root
), yes, but then it starts diverging from the optimistic sync tests. As much as feasible, I want to keep them aligned, so what's tested is what's running. The non-earliestInvalid
versions are in this PR mostly kept for as a fallback.
The previous mechanism still exists, too, and it's necessary in intial optimistic sync, where there can't be an LVH because there hasn't yet been a VALID
block (here, LVH only reports actual-EL-VALID
, not just not-INVALIDATED
). So that recovery mechanism is still necessary.
Intermittent EL INVALIDATED
bugs should be fixed. It seems unwise to contort Nimbus excessively to handle that "well", beyond, as it does, not persisting them so at least restarts clear them.
I'm not sure exactly what the multi-EL scenario will look like, but if they disagree with each other on this, well, that's not going to work out well.
But, I'm fine with keeping this unchanged in this PR and exploring this in a future PR, while the LVH handling remains in place to pass the EF optimistic sync tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
313a35d reverts the LVH parts, while still adding mark_root_invalid
.
@@ -442,6 +442,7 @@ func mark_root_invalid*(self: var ForkChoice, root: Eth2Digest) = | |||
self.backend.proto_array.nodes.offset | |||
if nodePhysicalIdx < self.backend.proto_array.nodes.buf.len: | |||
self.backend.proto_array.nodes.buf[nodePhysicalIdx].invalid = true | |||
self.backend.proto_array.propagateInvalidity(nodePhysicalIdx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this already be covered for purpose of scoring / selecting heads?
nodeLeadsToViableHead
should already be false
if the current block is invalid
, treating the entire branch as invalid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The updating is done in maybeUpdateBestChildAndDescendant
nimbus-eth2/beacon_chain/fork_choice/proto_array.nim
Lines 424 to 451 in ad286b9
func maybeUpdateBestChildAndDescendant(self: var ProtoArray, | |
parentIdx: Index, | |
childIdx: Index): FcResult[void] = | |
## Observe the parent at `parentIdx` with respect to the child at `childIdx` and | |
## potentially modify the `parent.bestChild` and `parent.bestDescendant` values | |
## | |
## There are four scenarios: | |
## | |
## 1. The child is already the best child | |
## but it's now invalid due to a FFG change and should be removed. | |
## 2. The child is already the best child | |
## and the parent is updated with the new best descendant | |
## 3. The child is not the best child but becomes the best child | |
## 4. The child is not the best child and does not become the best child | |
let child = self.nodes[childIdx] | |
if child.isNone(): | |
return err ForkChoiceError( | |
kind: fcInvalidNodeIndex, | |
index: childIdx) | |
let parent = self.nodes[parentIdx] | |
if parent.isNone(): | |
return err ForkChoiceError( | |
kind: fcInvalidNodeIndex, | |
index: parentIdx) | |
let childLeadsToViableHead = ? self.nodeLeadsToViableHead(child.get()) |
Which is called by applyScoreChanges
which iterates from descendants to ancestors
nimbus-eth2/beacon_chain/fork_choice/proto_array.nim
Lines 270 to 282 in ad286b9
for nodePhysicalIdx in countdown(self.nodes.len - 1, 0): | |
if node.root.isZero: | |
continue | |
if node.parent.isSome(): | |
let parentLogicalIdx = node.parent.unsafeGet() | |
let parentPhysicalIdx = parentLogicalIdx - self.nodes.offset | |
if parentPhysicalIdx < 0: | |
# Orphan | |
continue | |
let nodeLogicalIdx = nodePhysicalIdx + self.nodes.offset | |
? self.maybeUpdateBestChildAndDescendant(parentLogicalIdx, nodeLogicalIdx) |
Which is correct for the scores as such -- it allows quick updates. However, invalidity propagates the other direction, from ancestors to descendants, unlike the scores, and there's no existing mechanism in fork choice to propagate information from ancestors to descendants.
if self.nodes.buf[parentPhysicalIdx].invalid: | ||
self.nodes.buf[nodePhysicalIdx].invalid = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about new descendants of an invalid block that are being added after propagateInvalidity
was originally called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory, they shouldn't be added in the first place, as they'd be rejected as unviable by the DAG/quarantine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other fallback here is that they'll ask the EL again if someone does try to build on them, and get that block instead. Since by then things should have mostly caught up to head, it's more likely to be the only remaining set of blocks on which building is still happening, and therefore should iteratively converge.
break | ||
curBlck = curBlck.parent | ||
|
||
curBlck.root |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not necessarily correct.
When someone switches from a different CL to Nimbus but retains their EL, the EL's lvh
may be far in the future. This means that none of our DAG blocks contains lvh
. This implementation would then return the first descendant of dag.finalizedHead
(or dag.finalizedHead
itself if no descendants were yet added).
If lvh
is not found, maybe better to return the defaultEarliestInvalidRoot
or ZERO_HASH
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, in extended periods of non-finality, the linear scan may become expensive if performed repeatedly.
One way to accelerate could be to stop the scan at the latest known-VALID block, instead of all the way back at finalized. Or, also having a constant maximum search depth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory, latest-known VALID
block is exactly where this will stop. If it doesn't, that's an EL bug. EL bugs shouldn't crash Nimbus, or result in durably incorrect behavior, but EL bugs can trigger suboptimally slow behavior. It's meant to be a mutually trusted system, and I'm wary of adding complication to Nimbus to handle buggy ELs, when that complication can carry its own risks .
In particular, https://github.com/ethereum/execution-apis/blob/main/src/engine/specification.md#payloadstatusv1 defines:
latestValidHash
:DATA|null
, 32 Bytes - the hash of the most recent valid block in the branch defined by payload and its ancestors
This also addresses the concern about the EL's lvh
being far in the future in the new-CL-database/existing-EL-database case: the lvh
is relative to the payload provided by the CL. Here, the failure mode for buggy EL is that Nimbus needs to be restarted, which seems reasonable.
As far as cost in general, this should be a relatively one-time thing per invalid branch -- once invalidated, it shouldn't be doing that backwards search again, so potentially better to do it once than multiple times, along the lines of how pruning huge swaths of finalized blocks after a long-unfinalizing network finalizes is expensive, but one-time. Doing halfway versions introduces less-well-defined state and doesn't necessarily save time overall, in a throughput sense.
There is an issue here where the quarantine unviables are sometimes checked only as direct parents rather than ancestors, which is an argument for not using the LVH-based search for adding to quarantine.
The other aspect is that it should rediscover all of this by just asking the EL, if it misses something. So everything here should be considered a cache.
For the moment, this is all only used in tests, so another approach is to move this out from blockchain_dag
and into tests/
, so ensure that it doesn't accidentally get used in blockchain_dag
.
While the initial intent, and still one that seems broadly ideal, is to maximally match test behavior with non-test behavior (otherwise, what is the test testing?), just the status quo here is that this LVH infrastructure is test-only, so it can/should reflect that.
Co-authored-by: Etan Kissling <etan@status.im>
…rliestInvalidRoot` -> `defaultEarliestInvalidBlockRoot`
This also requires