Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calling engine_preparePayload in advance #2715

Closed
paulhauner opened this issue Oct 15, 2021 · 13 comments
Closed

Calling engine_preparePayload in advance #2715

paulhauner opened this issue Oct 15, 2021 · 13 comments
Labels
RFC Request for comment

Comments

@paulhauner
Copy link
Member

paulhauner commented Oct 15, 2021

Description

In a post-merge beacon chain, a CL (consensus layer/eth2) node will need to call two functions in order to prepare a block:

The ultimate goal of these two calls is to return an ExecutionPayload, which is effectively an execution (eth1) block to be included in a consensus (eth2) block.

The reason there are separate preparePayload and getPayload calls is to allow the CL nodes to be able to give the EL (execution layer/eth1) nodes some time to prepare the payload (i.e., find the best set of transactions it can). To this end, in the ideal case we should call preparePayload some time before we call getPayload.

The purpose of this issue is to establish when the CL nodes should call preparePayload and to consider the engineering requirements for CL implementations (e.g., Lighthouse).

When to call preparePayload

Lets start with three basic constraints about when and how to call preparePayload:

  1. preparePayload only needs to be called if we expect to call getPayload during some slot s.
    • I.e., only call preparePayload if a beacon node (BN) expects to propose a block in slot s.
  2. Since preparePayload accepts a parentHash, we can only call it after we know the parent of the block at slot s.
    • I.e., preparePayload needs to be called sometime during slot s - 1.
  3. preparePayload parameters are determined by what we expect to be the canonical head block at the start of slot s.

Given these constraints, we could say that preparePayload should be called whenever the canonical head changes during slot s - 1.

But alas, there is an edge-case. What if the node never receives a block at slot s - 1 (i.e., s - 1 is a "skip slot")? The head could remain unchanged (e.g. the block at slot s - 2) and therefore we'd never call preparePayload.

In light of skip slots, it seems we may need to decide at some point during slot s - 1 that we're probably not going to get a block and that we should call preparePayload with the current head (e.g. s - 2). This point would be the threshold at which we assume there is a skip slot, so lets call it assumed_skip_slot_threshold.

We can now form a general definition of when to call preparePayload:

General definition

If a CL node expects to propose a block at slot s, then it should call preparePayload with values computed from the canonical head whenever the following events occur during slot s - 1:

  1. The canonical head changes.
  2. The assumed_skip_slot_threshold is reached, and the first condition (1) has not already been triggered.

The nitty gritty of implementation

Proposer shuffling

Our previous definition makes the assumption that we always know the proposers for slot s at slot s - 1. This is not strictly true. The proposer shuffling for epoch e can only be known after the final block in epoch e - 1 is processed.

This means that if we're in the last slot of the epoch (i.e., (s + 1) % SLOTS_PER_EPOCH == 0), we won't know what the proposer shuffling is until we either (a) receive a block at slot s - 1 or (b) hit assumed_skip_slot_threshold and assume that there is no block at s - 1.

With this in mind, we can create a more implementation-specific definition that is aware of proposer-shuffling constraints:

Proposer-shuffling aware definition

If the CL node is performing duties for any active validators, then it should run the maybe_prepare_payload routine whenever:

  1. The canonical head changes.
  2. The assumed_skip_threshold is reached, and the first condition (1) has not already been triggered.

Where maybe_prepare_payload involves:

  1. Taking the canonical head block and running process_slots to advance it to slot s.
  2. Determining if the CL node is performing duties for the block proposer at slot s. If so, continue, else exit.
  3. Computing the values for preparePayload and issuing the request to the EL node.

Note: maybe_prepare_payload can be optimized in the non-epoch-boundary scenario to avoid calling process_slots, but this definition aims to be simple and general.

Is the VC or BN driving?

You may notice that I've used "CL node" instead of referring to the duties of a beacon node (BN) or validator client (VC). That's because it's not immediately clear whether the BN or VC should be the one driving this series of events.

VC driving

In the "VC driving" scenario, the BN has no idea about which validators may produce blocks at slot s. It is up to the VC to ensure that the BN issues a relevant preparePayload request at the correct time(s). The "VC driving" process looks like this:

If the VC is performing duties for any active validators, then it should run the maybe_prepare_payload routine whenever:

  1. The canonical head changes (i.e., it receives a head SSE event).
  2. The assumed_skip_threshold is reached, and the first condition (1) has not already been triggered.

Where maybe_prepare_payload involves:

  1. Determining the proposer duties for slot s
    • It may have these cached, or it may need to use the BNs duties/proposer endpoint.
  2. Determining if the VC is performing duties for the block proposer at slot s. If so, continue, else exit.
  3. Issuing a request to the BN API which, in turn, makes it issue a preparePayload request to the EL node.
    • Such a BN API does not yet exist, but let's call it validator/prepare_payload for the time being.

The definition of validator/prepare_payload requires some thought too. I propose it should take (slot, head_block_root) as parameters and return nothing. It will be the duty of the BN to hold the payloadId and provide it during a getPayload request. For the input parameters, slot is the slot in which the VC expects to propose a slot (i.e., s) and head_block_root will be head block at the time of the call (i.e., the expected parent of the beacon block it expects to propose at s).

BN driving

In the "BN driving" scenario, the VC knows nothing of the preparePayload request. Instead, just tells the BN which validators it is managing and the BN transparently calls preparePayload when it sees fit.

The "BN driving" process looks like this:

  1. The VC sends a message to the BN with the list of validator indices it controls
    • The validator/beacon_committee_subscriptions endpoint could theoretically be repurposed to also do this.
    • Alternatively we could create a new validator/potential_beacon_proposers endpoint (naming can be improved).
    • It would probably make sense for this "subscription" to potential beacon proposers to expire after some time, since it does incur effort for the EL node and a once-and-forever subscription could end up wasteful.
  2. The BN follows exactly the steps described in the Proposer shuffling aware definition.

What does @paulhauner think about VC or BN driving?

At this stage, I think I prefer BN driving because it strives for simplicity in the VC (the scary secret-key-holding thing) and it also allows for more optimization inside the BN. Some clients (Lighthouse, Teku, at least) are already doing optimizations to compute the proposer duties for epoch e at the end of e - 1, these could be leveraged to make preparePayload more efficient.

Open Questions

I'm not sure what to define assumed_skip_slot_threshold as. One way to do it would be to set it at roughly the last time in which we usually expect a beacon block. In my experience this would be somewhere between 4-8s since slot start. However, it would be good to know if there's a point of diminishing returns regarding the delay between preparePayload and getPayload. For example, if it never takes the EL more than 3s to build the ideal ExecutionPayload, then lets just set it to 9s (12s - 3s) after slot start.

@paulhauner paulhauner added the RFC Request for comment label Oct 15, 2021
@paulhauner
Copy link
Member Author

As pointed out by @djrtwo, the engine_preparePayload will be integrated into engine_forkchoiceUpdated in this PR: ethereum/execution-apis#83

This issue still remains relevant, I'll just change some naming when that PR merges.

@mcdee
Copy link
Contributor

mcdee commented Oct 15, 2021

With the current VC/BN split I would very much favor using the VC to send a preparation call. There are two reasons for this: it removes the assumption that a single BN is responsible for a a given validator, which is not the case in more advanced architectures, and it allows the VC to provide the coinbase address (which you didn't mention in your API call, but should be in there).

@paulhauner
Copy link
Member Author

With the current VC/BN split I would very much favor using the VC to send a preparation call. There are two reasons for this: it removes the assumption that a single BN is responsible for a a given validator, which is not the case in more advanced architectures, and it allows the VC to provide the coinbase address (which you didn't mention in your API call, but should be in there).

Good point about the feeRecipient address, I'll add that in. It could easily be provided in either scenario, so I don't think it's a reason for VC driving, though.

I think BN driving is just as suitable for 1:many VC:BN. A VC can tell multiple BNs that they should be preparing payloads. It may result in some unnecessary preparePayload calls, but I'm not sure that's an issue considering that in PoW the execution client always expects to produce the next block. I'm not sure about the "advanced architectures", but I'm certainly open to hearing what they might be.

Another way to think about BN driving is that it's exactly equivalent to VC driving, but calls the preparePayload in advance and defers the exact timing to the BN.

@mcdee
Copy link
Contributor

mcdee commented Oct 17, 2021

Good point about the feeRecipient address, I'll add that in. It could easily be provided in either scenario, so I don't think it's a reason for VC driving, though.

The VC does, in general, have control of the information and flow. Graffiti is the most obvious current counterpart today for information: if the BN held the fee recipient information then we'd end up with having to configure both the VC (for graffiti) and BN (for fee recipient) for each validator, which seems overly restrictive. As for flow, the VC already tells the BN what to do with calls like sync committee subscriptions so this seems to fit the existing flow better. And if we ever end up with any proof of custody scheme we will have to have the VC call the BN as it will need to provide a proof, so this seems to be more future-proofed as well.

I think BN driving is just as suitable for 1:many VC:BN. A VC can tell multiple BNs that they should be preparing payloads. It may result in some unnecessary preparePayload calls, but I'm not sure that's an issue considering that in PoW the execution client always expects to produce the next block. I'm not sure about the "advanced architectures", but I'm certainly open to hearing what they might be.

As for advanced architectures, one example would be a pool of VCs working with a pool of BNs. Any of the BNs could be asked by a VC to prepare a block, but getting all of them to do so would be wasteful. It would also require the VC to notify all of the BNs after it had fetched its block from one of them, to tell them to stop preparing, which is a very odd way round of doing things compared to selecting a relevant BN and asking it and it alone to generate the block.

@paulhauner
Copy link
Member Author

if the BN held the fee recipient information then we'd end up with having to configure both the VC (for graffiti) and BN (for fee recipient) for each validator, which seems overly restrictive.

I've never suggested the BN should control the feeRecipient, I just forgot about it originally. Perhaps we have our wires crossed here.

After considering your points, I think the "BN driving" scenario can be designed in a way that allows simplistic VC implementations to go "hands-free" whilst still working just as well for the more complicated scenarios you describe.

Here's an updated description:

BN Driving

At some point in time, the VC publishes this message to the BN:

# POST validator/potential_beacon_proposers
[
  {
    "validator_index": "0",
    "feeRecipient": "0xabc.."
  }
]

Upon receiving that message, the BN knows that it should try and ensure a payload is prepared if it ever expects to validator 0 to produce a block.

This approach has the following properties:

  • The message can be sent regardless of if validator 0 will ever produce a block (the VC does not require this knowledge).
  • The message can be sent at arbitrary times (e.g., at slot 32 when validator 0 will produce a block at slot 38)
    • I think we should probably specify that this "subscription" only lasts for 2 epochs. (No firm reason for 2, just finger to the wind).
  • The VC does not need to track changes in the BNs head. It sends one message and the BN takes care of the rest.
  • The VC can send a subset of its validator indices.
  • The VC can decide to only send this message for a single validator, in the slot immediately before it proposes a block (therefore there is no wastage in a multi-BN setup).

VC Driving

At exactly the correct time(s), the VC publishes this/these message(s) to the BN:

# POST validator/prepare_payload
{
  "slot": "42",
  "head_block_root": "0x123...",
  "feeRecipient": "0xabc.."
}
  • The VC must track changes in the BN head and re-issue this command if the head changes.
    • There was some talk of a "cancellation" message. If that goes ahead, it would need to be integrated into the VC and a new VC<>BN API endpoint created.
  • The VC must issue this command at the precise time in which it is relevant.
  • The VC must ensure that the message is relevant for the correct BN (e.g., in a multiple BN scenario, it would be invalid to send this message to a BN that does not have the 0x123... block).
  • It is wasteful to publish this message unless the VC expects to propose a block in slot "42".

Summary

Hopefully it is shown that "BN driving" is still able to avoid wastefulness in a multi-BN environment. I find it to be more flexible and equally as powerful as "VC driving". I also find "BN driving" to be more amenable to optimization in the BN (in the scenario where it is given advance notice of proposers) and also more agreeable to VC implementations that aim for simplicity. Additionally, "BN driving" further distances the BN<>VC API from changes in the EL<>CL workflow.

One thing to note is that "VC driving" allows the VC to specify the head_block_root and slot parameters. This might seem like more power to the VC, but given that GET valdiator/blocks only accepts a slot (not a head_block_root), I'd argue that it only give the VC the power to be wrong.

@realbigsean
Copy link
Member

realbigsean commented Oct 19, 2021

I'm not seeing why head_block_root would be included by the VC in the VC-driven example here. Is it meant to a be a proxy for shuffling? Because in that case (slot, validatorId, feeRecipient) seems like it'd be reasonable.

The core question seems to be whether the BN or VC is responsible for mapping the shuffling to a specific proposer. Having the VC provide (slot, validator_index, feeRecipient) implies the VC is responsible for this, whereas having it provide (validator_index, feeRecipient) implies the BN.

This does seem like a reasonable responsibility for the validator client but the new design of engine_forkChoiceUpdated (tightly coupling execution payload prep information with the fork choice) really seems to suggest the BN should have knowledge of this as the fork choice is being updated... which would strongly suggest the BN should be responsible for this. Unless we're expecting engine_forkChoiceUpdated to potentially be called multiple time with the same fork choice information but different execution payload prep information (which I don't think we are).

@djrtwo
Copy link

djrtwo commented Oct 19, 2021

I definitely think it is a simpler thing for the BN to handle this responsbility and for the BN to also eventually call get_payload to bundle into a valid block to pass the requested block to VC for signature.

I would not want one piece of software to call prepare and another call get as you then have synchronization issues between the two. -- e.g. was prepare already called?

Another reason that I think this should go into BN responsibility is that BN is aware of the head (and changes to it). Thus BN is the most up-to-date entity to trigger build processes. Not to mention -- in the next wave of specs prepare will be integrated into forkchoiceUpdated to avoid race conditions between the two (previously) separate calls. Such a race condition would only be worse if you split the BN and VC duties here (one calling updateforkchoice, and the other calling prepare). I personally think that the engineAPI is much much simpler when only one entity is in charge of making the EL do work that can only be done singularly -- updating head and building a payload. (note, an EL could handle process_payload requests from multiple sources in a sane way).

@mcdee
Copy link
Contributor

mcdee commented Oct 19, 2021

Thank you for the detailed consideration, although I do think that the concerns of VC driving are largely unfounded as these situations are already all dealt with at current by a VC that creates block proposals.

With the revised "BN driving" option I think that this addresses most of my concerns. There remains an issue around a BN restarting after the potential_beacon_proposers call has been made but before a proposal is due, but that can most likely be handled suitably at the communications library level. I worry slightly about the implementation at scale (will potential_beacon_proposers be performant with 10K validators, for example) but that's something that can be looked at when the testnets are live.

@djrtwo
Copy link

djrtwo commented Oct 19, 2021

although I do think that the concerns of VC driving are largely unfounded as these situations are already all dealt with at current by a VC that creates block proposals.

I'm curious -- in a VC driven prepare_payload who should call get_payload in your view?

The value returned from get_payload needs to be packaged into a beacon block to be of value so if VC calls get_payload it would have to then pass it to BN when BN is creating the beacon block. Is this pass through what would be required if VC was driving? Or am I missing something

If get_paylaod would be called by BN when BN is tasked with making a beacon block, how would BN be sure that a prepare_payload had already been called? This is the primary reason this seems prone to failure, imo

Beyond that, because prepare_payload is coupled into forkchoiceUpdated you now how two competing entities updaing EL fokrchoice (BN and VC), but only one head supported on EL (by default) which can lead to race conditions and synchronicity errors between BN and VC calling this method

@mkalinin
Copy link

mkalinin commented Nov 2, 2021

Open Questions

I'm not sure what to define assumed_skip_slot_threshold as. One way to do it would be to set it at roughly the last time in which we usually expect a beacon block. In my experience this would be somewhere between 4-8s since slot start. However, it would be good to know if there's a point of diminishing returns regarding the delay between preparePayload and getPayload. For example, if it never takes the EL more than 3s to build the ideal ExecutionPayload, then lets just set it to 9s (12s - 3s) after slot start.

IMO, assumed_skip_slot_threshold value should be set to 4 seconds. And then if the head gets updated at whatever time during the slot in question a subsequent forkchoiceUpdated call with updated instance of PayloadAttributes should be sent which results in the restart of the building process on the EL side. An EL client software must respond immediately to the getPayload call as it is stated in the Engine API spec. This simple strategy should bring the desired result in the majority of the cases. If the head change induced restart of the building process e.g. 10ms before the getPayload call then getPayload will likely respond with an empty payload but it seems there is nothing to do in this case as proposer have to build on top of the current head despite of the payload supply utilization.

@mcdee
Copy link
Contributor

mcdee commented Nov 8, 2021

I'm curious -- in a VC driven prepare_payload who should call get_payload in your view?

Sorry, for some reason I didn't receive notifications about your response.

The value returned from get_payload needs to be packaged into a beacon block to be of value so if VC calls get_payload it would have to then pass it to BN when BN is creating the beacon block. Is this pass through what would be required if VC was driving? Or am I missing something.

The main thrust of "VC driving" for me would be that the VC would call prepare to the BN, which would trigger the BN calling engine_preparePayload with a combination of data from the VC call and the BN state. If the BN didn't receive a prepare it wouldn't do anything.

But as mentioned above, with short-term subscriptions that Paul put forward for the BN driving I think that the API makes everyone happy. Perhaps the arbitrary "2 epochs" could become an until_epoch parameter, but that may be something worth deciding on when the API is in place.

If @paulhauner is happy with me doing so I'll write up a PR for the beacon-apis repo that contains this design.

@paulhauner
Copy link
Member Author

If @paulhauner is happy with me doing so I'll write up a PR for the beacon-apis repo that contains this design.

That would be great, thank you!

@michaelsproul
Copy link
Member

I stumbled on this searching for something else. I think we can safely close it now that this feature is implemented and working!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
RFC Request for comment
Projects
None yet
Development

No branches or pull requests

6 participants