Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simulators/ethereum/engine: Re-org to Chain with Missing Invalid Payload #526

Merged
merged 7 commits into from
May 16, 2022

Conversation

marioevz
Copy link
Member

@marioevz marioevz commented Apr 25, 2022

This PR implements test case where the CL attempts to re-org into a side chain with unknown payloads to the EL.
The test first generates a side chain with N payloads, which at some point contains an invalid payload, but the side chain remains unknown to the EL:

CommonAncestor◄─▲── P1 ◄── P2  ◄─ P3  ◄─ ... ◄─ Pn
                │
                └── P1' ◄─ P2' ◄─ P3' ◄─ ... ◄─ Pn'

After Pn, the CL performs an fcU(Pn') to the head of the side chain.
Then the CL proceeds to reveal the remaining payloads in the following order:

newPayload(Pn')
...
newPayload(P2')
newPayload(P1')

Currently two scenarios are implemented:

  • P1' is invalid, and since CommonAncestor is known to the EL, the response to newPayload(P1') should be INVALID with latestValidHash==CommonAncestor.Hash.
  • P3' is invalid and the response to newPayload(P3') should be ACCEPTED with latestValidHash==nil because when the payload is received there is not enough information to invalidate the payload.

Current outcome:

  • Geth: Correctly returns INVALID on the first scenario, but returns an incorrect LVH with latestValidHash==Pn.Hash
  • Nethermind: Pass

The PR is open for discussion and suggestions.

cc @mkalinin, @MarekM25, @MariusVanDerWijden

@marioevz marioevz changed the title simulators/engine: Re-org to Chain with Missing Invalid Payload simulators/ethereum/engine: Re-org to Chain with Missing Invalid Payload Apr 26, 2022
@mkalinin
Copy link

IMO, the following order of message should better reflect a real CL flow:

  1. newPayload(P1')
  2. ...
  3. newPayload(Pn')
  4. forkchoiceUpdated(Pn')

In this case if newPayload(P1') can't be processed by EL because parent state is unavailable the response will be ACCEPTED, then forkchoiceUpdated(Pn') will be responded with SYNCING which eventually turns CL into optimistic sync mode.

Inducing this scenario requires CommonAncestor state prune, i.e. Pn chain (canonical chain) should be long enough to induce pruning on EL side while Pn' chain may be shorter. How difficult would it be to re-produce this scenario in tests?

We may also skip newPayload(P1') and then make EL to pull P1' from the network after forkchoiceUpdated(Pn') call but there will have to be a node that serves invalid execution blocks if P1' is invalid, but we may make P2' invalid instead to avoid this latter requirement.

@marioevz
Copy link
Member Author

Thanks for the feedback @mkalinin , I think I can do some experiments on how many blocks do we need to advance on the canonical chain before CommonAncestor's state is pruned, because I assume this is implementation dependent.

I also like the idea of the client having to pull P1' from the network.

I will start working on this update.

@marioevz
Copy link
Member Author

marioevz commented May 3, 2022

Hi @mkalinin, I ran the experiment of doing a long canonical chain to attempt to induce a state prune of the commonAncestor with no success. This was the procedure:

  • Canonical chain with a length of 2000 payloads
  • Each payload filled to target gas limit with contract creation transactions (big zero-filled contracts)
  • Side chain contains invalid payload P2'
  • Send newPayload(P1'), and wait for ACCEPTED response

Geth response to newPayload(P1') was VALID after multiple attempts, so it doesn't seem feasible in this test environment to try to induce state pruning, at least with default configuration.

Let me know what you think about this scenario, I think the original test case is still valuable even without state pruning (latestValidHash verification was valuable to find an issue in geth).

@mkalinin
Copy link

mkalinin commented May 3, 2022

I think that there need to be a contract deployment in CommonAncestor with initialization of one of its slots with non-zero value. Then in P1 there should be a transaction that updates the value of the slot, this would induce removal of data bit of a trie that represents an initial value of the slot 128 blocks after CommonAncestor. Though, it needs to be double checked with client devs. Also, worth checking if pruning is enabled in Hive environment.

@marioevz
Copy link
Member Author

marioevz commented May 3, 2022

I think that there need to be a contract deployment in CommonAncestor with initialization of one of its slots with non-zero value. Then in P1 there should be a transaction that updates the value of the slot, this would induce removal of data bit of a trie that represents an initial value of the slot 128 blocks after CommonAncestor. Though, it needs to be double checked with client devs. Also, worth checking if pruning is enabled in Hive environment.

I checked with @MariusVanDerWijden and it seems like state pruning only occurs after 90,000 blocks, at least for Geth, which I think would make this test take extremely long for a regression test.

Would sending P2' first instead of P1' have a similar effect in regards to not having the prerequisite state in memory to validate ?

@marioevz marioevz force-pushed the reorg-missing-invalid-payload branch from e02e8fc to c04a549 Compare May 6, 2022 00:17
@marioevz
Copy link
Member Author

marioevz commented May 6, 2022

Hi @mkalinin, @MariusVanDerWijden, @djrtwo, I've added the Invalid Payload Sync test cases that have the Geth error.

The test case is structured as follows:

  • Have a canonical chain of length N (10 in this case), while in parallel construct an alternate chain which at some point contains an invalid payload (The height of the invalid payload is customizable).
  • Send all payloads of the alternate chain up to the point before the invalid payload to the secondary client.
  • Send all remaining payloads, starting from the invalid payload, to the primary client.
  • The primary client returns SYNCING for the newPayload and fcU calls with the invalid payloads, because it needs to sync with the secondary client to obtain the missing blocks.
  • The secondary client feeds the blocks because the chain, as far as it knows, is completely valid
  • We keep polling the primary client with fcU calls to the head of the alternate chain, until it returns INVALID.

At the moment none of the clients passes these tests because either:

  • The client detects the invalid payload, discards it, but the CL never knows about it because the EL keeps returning SYNCING.
  • The client returns VALID due to a bug.

I have added a test case per each possible header field that can be manipulated in order to produce an invalid payload, with a total of 14 test cases. If this is too exhaustive please let me know and we can reduce the test cases to the most significant ones.

@mkalinin
Copy link

mkalinin commented May 6, 2022

I have added a test case per each possible header field that can be manipulated in order to produce an invalid payload, with a total of 14 test cases. If this is too exhaustive please let me know and we can reduce the test cases to the most significant ones.

I always think the more variants of the test the better. Especially, if they run on CI. If tests consume a lot of time and this becomes an issue then we should think about the way to reduce this number, unless that I'd say run as many as we can.

@MariusVanDerWijden
Copy link
Member

@marioevz Does the geth bug happen with a particular test or on all of them?

@marioevz
Copy link
Member Author

marioevz commented May 6, 2022

@marioevz Does the geth bug happen with a particular test or on all of them?

I've only found one combination where the issue reproduces (Modified StateRoot + No Tx) out of all the combinations.

@marioevz marioevz marked this pull request as ready for review May 6, 2022 18:01
@marioevz
Copy link
Member Author

marioevz commented May 6, 2022

This PR should be ready for review/merge.

One consideration to take into account is that the new tests are currently failing for all clients due to the reasons explained in a previous comment.

@marioevz
Copy link
Member Author

marioevz commented May 9, 2022

Hi @mkalinin, I've added the check on the fcU LVH on the invalid ancestor chain.

@holiman
Copy link
Collaborator

holiman commented May 10, 2022

I checked with @MariusVanDerWijden and it seems like state pruning only occurs after 90,000 blocks, at least for Geth, which I think would make this test take extremely long for a regression test.

That's not quite the case. State pruning occurs after 128 blocks -- that's when we start removing the trie roots.
After 90K blocks, that's when we start removing sidechain blocks, so we only keep the canonical block data.

@marioevz
Copy link
Member Author

I checked with @MariusVanDerWijden and it seems like state pruning only occurs after 90,000 blocks, at least for Geth, which I think would make this test take extremely long for a regression test.

That's not quite the case. State pruning occurs after 128 blocks -- that's when we start removing the trie roots. After 90K blocks, that's when we start removing sidechain blocks, so we only keep the canonical block data.

Ah I see, the original idea for the test was that at some point the newPayload with payload.Parent equal to a very old block would have to return ACCEPTED, meaning that the payload could not be instantly validated due to state pruning. However I tested up to 2000 old block as parent and could not get the desired outcome.

I think the test scenario is correctly achieved by the sync test case, but it's open for discussion if there could be a way we could trigger the ACCEPTED response on newPayload on an old block somehow.

@marioevz
Copy link
Member Author

I have updated the test cases to be more in line with the implementation description here:
https://hackmd.io/GDc0maGsQeKfP8o2C7L52w#Implementation

The test cases now invalidate the second to last payload on the side chain, therefore, tip.Parent of the invalid ancestor chain points to the invalid payload, and newPayload(tip) should return invalid.

The test is still failing for all clients.

LMK if this change looks ok @mkalinin @MariusVanDerWijden


const (
InvalidParentHash InvalidPayloadField = "ParentHash"
InvalidStateRoot = "StateRoot"
Copy link
Collaborator

@fjl fjl May 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, the constants here will not have the type you expect. While InvalidParentHash has type InvalidPayloadField, the other constants are of type 'untyped string'. See https://go.dev/play/p/tyAGBY8nkuS for an example. The code compiles anyway because a value of type 'untyped string' can be assigned to a variables of type InvalidPayloadField (since the underlying type is string).

You can fix this by listing the type in every line. We usually do it like this:

const (
      InvalidParentHash = InvalidPayloadField("ParentHash")
      ...
)

@fjl fjl merged commit b87c710 into ethereum:master May 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants