simulators/ethereum/engine: Re-org to Chain with Missing Invalid Payload #526

marioevz · 2022-04-25T16:37:38Z

This PR implements test case where the CL attempts to re-org into a side chain with unknown payloads to the EL.
The test first generates a side chain with N payloads, which at some point contains an invalid payload, but the side chain remains unknown to the EL:

CommonAncestor◄─▲── P1 ◄── P2  ◄─ P3  ◄─ ... ◄─ Pn
                │
                └── P1' ◄─ P2' ◄─ P3' ◄─ ... ◄─ Pn'

After Pn, the CL performs an fcU(Pn') to the head of the side chain.
Then the CL proceeds to reveal the remaining payloads in the following order:

newPayload(Pn')
...
newPayload(P2')
newPayload(P1')

Currently two scenarios are implemented:

P1' is invalid, and since CommonAncestor is known to the EL, the response to newPayload(P1') should be INVALID with latestValidHash==CommonAncestor.Hash.
P3' is invalid and the response to newPayload(P3') should be ACCEPTED with latestValidHash==nil because when the payload is received there is not enough information to invalidate the payload.

Current outcome:

Geth: Correctly returns INVALID on the first scenario, but returns an incorrect LVH with latestValidHash==Pn.Hash
Nethermind: Pass

The PR is open for discussion and suggestions.

cc @mkalinin, @MarekM25, @MariusVanDerWijden

mkalinin · 2022-04-27T15:43:12Z

IMO, the following order of message should better reflect a real CL flow:

newPayload(P1')
...
newPayload(Pn')
forkchoiceUpdated(Pn')

In this case if newPayload(P1') can't be processed by EL because parent state is unavailable the response will be ACCEPTED, then forkchoiceUpdated(Pn') will be responded with SYNCING which eventually turns CL into optimistic sync mode.

Inducing this scenario requires CommonAncestor state prune, i.e. Pn chain (canonical chain) should be long enough to induce pruning on EL side while Pn' chain may be shorter. How difficult would it be to re-produce this scenario in tests?

We may also skip newPayload(P1') and then make EL to pull P1' from the network after forkchoiceUpdated(Pn') call but there will have to be a node that serves invalid execution blocks if P1' is invalid, but we may make P2' invalid instead to avoid this latter requirement.

marioevz · 2022-04-29T18:22:09Z

Thanks for the feedback @mkalinin , I think I can do some experiments on how many blocks do we need to advance on the canonical chain before CommonAncestor's state is pruned, because I assume this is implementation dependent.

I also like the idea of the client having to pull P1' from the network.

I will start working on this update.

marioevz · 2022-05-03T00:42:58Z

Hi @mkalinin, I ran the experiment of doing a long canonical chain to attempt to induce a state prune of the commonAncestor with no success. This was the procedure:

Canonical chain with a length of 2000 payloads
Each payload filled to target gas limit with contract creation transactions (big zero-filled contracts)
Side chain contains invalid payload P2'
Send newPayload(P1'), and wait for ACCEPTED response

Geth response to newPayload(P1') was VALID after multiple attempts, so it doesn't seem feasible in this test environment to try to induce state pruning, at least with default configuration.

Let me know what you think about this scenario, I think the original test case is still valuable even without state pruning (latestValidHash verification was valuable to find an issue in geth).

mkalinin · 2022-05-03T07:47:02Z

I think that there need to be a contract deployment in CommonAncestor with initialization of one of its slots with non-zero value. Then in P1 there should be a transaction that updates the value of the slot, this would induce removal of data bit of a trie that represents an initial value of the slot 128 blocks after CommonAncestor. Though, it needs to be double checked with client devs. Also, worth checking if pruning is enabled in Hive environment.

marioevz · 2022-05-03T18:04:19Z

I think that there need to be a contract deployment in CommonAncestor with initialization of one of its slots with non-zero value. Then in P1 there should be a transaction that updates the value of the slot, this would induce removal of data bit of a trie that represents an initial value of the slot 128 blocks after CommonAncestor. Though, it needs to be double checked with client devs. Also, worth checking if pruning is enabled in Hive environment.

I checked with @MariusVanDerWijden and it seems like state pruning only occurs after 90,000 blocks, at least for Geth, which I think would make this test take extremely long for a regression test.

Would sending P2' first instead of P1' have a similar effect in regards to not having the prerequisite state in memory to validate ?

marioevz · 2022-05-06T00:35:57Z

Hi @mkalinin, @MariusVanDerWijden, @djrtwo, I've added the Invalid Payload Sync test cases that have the Geth error.

The test case is structured as follows:

Have a canonical chain of length N (10 in this case), while in parallel construct an alternate chain which at some point contains an invalid payload (The height of the invalid payload is customizable).
Send all payloads of the alternate chain up to the point before the invalid payload to the secondary client.
Send all remaining payloads, starting from the invalid payload, to the primary client.
The primary client returns SYNCING for the newPayload and fcU calls with the invalid payloads, because it needs to sync with the secondary client to obtain the missing blocks.
The secondary client feeds the blocks because the chain, as far as it knows, is completely valid
We keep polling the primary client with fcU calls to the head of the alternate chain, until it returns INVALID.

At the moment none of the clients passes these tests because either:

The client detects the invalid payload, discards it, but the CL never knows about it because the EL keeps returning SYNCING.
The client returns VALID due to a bug.

I have added a test case per each possible header field that can be manipulated in order to produce an invalid payload, with a total of 14 test cases. If this is too exhaustive please let me know and we can reduce the test cases to the most significant ones.

mkalinin · 2022-05-06T08:05:38Z

I have added a test case per each possible header field that can be manipulated in order to produce an invalid payload, with a total of 14 test cases. If this is too exhaustive please let me know and we can reduce the test cases to the most significant ones.

I always think the more variants of the test the better. Especially, if they run on CI. If tests consume a lot of time and this becomes an issue then we should think about the way to reduce this number, unless that I'd say run as many as we can.

MariusVanDerWijden · 2022-05-06T09:55:33Z

@marioevz Does the geth bug happen with a particular test or on all of them?

marioevz · 2022-05-06T15:12:35Z

@marioevz Does the geth bug happen with a particular test or on all of them?

I've only found one combination where the issue reproduces (Modified StateRoot + No Tx) out of all the combinations.

marioevz · 2022-05-06T18:03:53Z

This PR should be ready for review/merge.

One consideration to take into account is that the new tests are currently failing for all clients due to the reasons explained in a previous comment.

marioevz · 2022-05-09T15:58:05Z

Hi @mkalinin, I've added the check on the fcU LVH on the invalid ancestor chain.

holiman · 2022-05-10T06:08:34Z

I checked with @MariusVanDerWijden and it seems like state pruning only occurs after 90,000 blocks, at least for Geth, which I think would make this test take extremely long for a regression test.

That's not quite the case. State pruning occurs after 128 blocks -- that's when we start removing the trie roots.
After 90K blocks, that's when we start removing sidechain blocks, so we only keep the canonical block data.

marioevz · 2022-05-10T15:12:49Z

I checked with @MariusVanDerWijden and it seems like state pruning only occurs after 90,000 blocks, at least for Geth, which I think would make this test take extremely long for a regression test.

That's not quite the case. State pruning occurs after 128 blocks -- that's when we start removing the trie roots. After 90K blocks, that's when we start removing sidechain blocks, so we only keep the canonical block data.

Ah I see, the original idea for the test was that at some point the newPayload with payload.Parent equal to a very old block would have to return ACCEPTED, meaning that the payload could not be instantly validated due to state pruning. However I tested up to 2000 old block as parent and could not get the desired outcome.

I think the test scenario is correctly achieved by the sync test case, but it's open for discussion if there could be a way we could trigger the ACCEPTED response on newPayload on an old block somehow.

marioevz · 2022-05-10T19:28:12Z

I have updated the test cases to be more in line with the implementation description here:
https://hackmd.io/GDc0maGsQeKfP8o2C7L52w#Implementation

The test cases now invalidate the second to last payload on the side chain, therefore, tip.Parent of the invalid ancestor chain points to the invalid payload, and newPayload(tip) should return invalid.

The test is still failing for all clients.

LMK if this change looks ok @mkalinin @MariusVanDerWijden

fjl · 2022-05-16T21:08:02Z

simulators/ethereum/engine/helper.go

+
+const (
+	InvalidParentHash           InvalidPayloadField = "ParentHash"
+	InvalidStateRoot                                = "StateRoot"


BTW, the constants here will not have the type you expect. While InvalidParentHash has type InvalidPayloadField, the other constants are of type 'untyped string'. See https://go.dev/play/p/tyAGBY8nkuS for an example. The code compiles anyway because a value of type 'untyped string' can be assigned to a variables of type InvalidPayloadField (since the underlying type is string).

You can fix this by listing the type in every line. We usually do it like this:

const ( InvalidParentHash = InvalidPayloadField("ParentHash") ... )

marioevz changed the title ~~simulators/engine: Re-org to Chain with Missing Invalid Payload~~ simulators/ethereum/engine: Re-org to Chain with Missing Invalid Payload Apr 26, 2022

MariusVanDerWijden mentioned this pull request May 2, 2022

eth/catalyst: update implementation to spec ethereum/go-ethereum#24802

Merged

marioevz added 3 commits May 5, 2022 15:56

simulators/engine: reorg to missing invalid payload chain tests

3428cd5

simulators/ethereum/engine: add contract transaction helper

977c203

simulators/ethereum/engine: Add p2p sync to invalid payloads test

c04a549

marioevz force-pushed the reorg-missing-invalid-payload branch from e02e8fc to c04a549 Compare May 6, 2022 00:17

marioevz added 2 commits May 6, 2022 16:54

simulators/ethereum/engine: add readme info

91293a5

simulators/ethereum/engine: minor change to invalid ancestor test

8cd9dfb

marioevz marked this pull request as ready for review May 6, 2022 18:01

simulators/ethereum/engine: check fcu on invalid ancestor too

a10e944

simulators/ethereum/engine: invalid ancestor tests change

eb66056

fjl reviewed May 16, 2022

View reviewed changes

fjl merged commit b87c710 into ethereum:master May 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

simulators/ethereum/engine: Re-org to Chain with Missing Invalid Payload #526

simulators/ethereum/engine: Re-org to Chain with Missing Invalid Payload #526

marioevz commented Apr 25, 2022 •

edited

Loading

mkalinin commented Apr 27, 2022

marioevz commented Apr 29, 2022

marioevz commented May 3, 2022

mkalinin commented May 3, 2022

marioevz commented May 3, 2022

marioevz commented May 6, 2022

mkalinin commented May 6, 2022

MariusVanDerWijden commented May 6, 2022

marioevz commented May 6, 2022

marioevz commented May 6, 2022

marioevz commented May 9, 2022

holiman commented May 10, 2022

marioevz commented May 10, 2022

marioevz commented May 10, 2022

fjl May 16, 2022 •

edited

Loading

simulators/ethereum/engine: Re-org to Chain with Missing Invalid Payload #526

simulators/ethereum/engine: Re-org to Chain with Missing Invalid Payload #526

Conversation

marioevz commented Apr 25, 2022 • edited Loading

mkalinin commented Apr 27, 2022

marioevz commented Apr 29, 2022

marioevz commented May 3, 2022

mkalinin commented May 3, 2022

marioevz commented May 3, 2022

marioevz commented May 6, 2022

mkalinin commented May 6, 2022

MariusVanDerWijden commented May 6, 2022

marioevz commented May 6, 2022

marioevz commented May 6, 2022

marioevz commented May 9, 2022

holiman commented May 10, 2022

marioevz commented May 10, 2022

marioevz commented May 10, 2022

fjl May 16, 2022 • edited Loading

Choose a reason for hiding this comment

marioevz commented Apr 25, 2022 •

edited

Loading

fjl May 16, 2022 •

edited

Loading