Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rebalance duty times #3433

Open
wants to merge 6 commits into
base: dev
Choose a base branch
from
Open

Rebalance duty times #3433

wants to merge 6 commits into from

Conversation

arnetheduck
Copy link
Contributor

@arnetheduck arnetheduck commented Jun 15, 2023

As the chain keeps growing and duties are being added to the block construction and verification pipeline, it becomes increasingly difficult for clients to complete block duties on time leading to poor attestation performance and reorg frequency increases.

This PR proposes to rebalance the relative timings of the 3 main activities, namely block production, attestation and aggregation such that they happen on 0/6/9 seconds into each slot instead of 0/4/8.

This reduces the time for attestations to reach aggregators and aggregates to reach block producers but increases the time the consensus and execution clients have to produce and validate blocks.

Each upgrade has so far increased complexity and processing requirements around block production and so will future upgraddes: due to increased size, blocks with blobs will take longer to dissemenate and additional verification / cryptography is needed to validate them.

Attached are a few graphs that show the number of reorgs growing over the last 6 months as well as typical receipt times of attestations and aggregates relative the the block start.

Generally, ~95% of attestations are submitted within 2s of the broadcast cutoff and >99% of aggregates. This gives us some margin to reach even better numbers when 3s are allotted to each of these activities.

We can see that these numbers tell a story where attestations take longer than aggregates to produce - this could have a number of underlying reasons including the fact that attestations are many and aggregates are few putting load on the network, and clients delaying attestation production slightly already due to natural block processing delays (ie if the pipeline is already clogged with block verification, clients may be blocked and not produce attestations at the same time).

One could thus introduce an uneven balance, ie 0/7/10 but this seems premature - with more time dedicated to block verification, clients should be able to produce timely attestations with higher frequency.

This PR also introduces stronger language around the already existing requirement that clients send out attestations as soon as they have observed a block - doing so would help the network distribute load more evenly and thus better absorb continued growth.

Metrics for attestation / aggregate receipt as observed by a Nimbus node:

# HELP beacon_attestation_delay Time(s) between slot start and attestation reception
# TYPE beacon_attestation_delay histogram
beacon_attestation_delay_sum 1439984500.361385
beacon_attestation_delay_count 348930364.0
beacon_attestation_delay_created 1684694965.0
beacon_attestation_delay_bucket{le="2.0"} 26304030.0
beacon_attestation_delay_bucket{le="4.0"} 125419664.0
beacon_attestation_delay_bucket{le="6.0"} 327924668.0
beacon_attestation_delay_bucket{le="8.0"} 347277541.0
beacon_attestation_delay_bucket{le="10.0"} 348235076.0
beacon_attestation_delay_bucket{le="12.0"} 348526697.0
beacon_attestation_delay_bucket{le="14.0"} 348652289.0
beacon_attestation_delay_bucket{le="+Inf"} 348930364.0

The above metrics show that most attestations are observed between 4-6 seconds while some arrive earlier than the current 4s cutoff.

# HELP beacon_aggregate_delay Time(s) between slot start and aggregate reception
# TYPE beacon_aggregate_delay histogram
beacon_aggregate_delay_sum 196980906.9307191
beacon_aggregate_delay_count 23475554.0
beacon_aggregate_delay_created 1684694965.0
beacon_aggregate_delay_bucket{le="2.0"} 247.0
beacon_aggregate_delay_bucket{le="4.0"} 700.0
beacon_aggregate_delay_bucket{le="6.0"} 2673.0
beacon_aggregate_delay_bucket{le="8.0"} 72015.0
beacon_aggregate_delay_bucket{le="10.0"} 23319966.0
beacon_aggregate_delay_bucket{le="12.0"} 23414399.0
beacon_aggregate_delay_bucket{le="14.0"} 23438978.0
beacon_aggregate_delay_bucket{le="+Inf"} 23475554.0

For aggregates, the cutoff is much more clear since there exists no "early broadcast" rule. A possible early broadcast rule would be that all member of the committee have voted and a perfect aggregate has been reached.

image

The same data, but in a graph over time.

Reorg frequencies for the past 6 months:

image

Among concerns are:

  • aggregation takes time due to the computational load of signature aggregations - a validator assigned to aggregation duties must today aggregate ~350, growing by 1 for each 2k new validators, thus it is possible that aggregators may be delayed in sending out aggregates.
  • dissemination also takes time - this may lead to a poorer selection of aggregates for block producers

This PR updates the validator spec - the fork choice spec and possibly other parts would have to be updated accordingly.

Marked draft pending further investigations into relative timings. This looks like a pretty good idea :)

As the chain keeps growing and duties are being added to the block
construction and verification pipeline, it becomes increasingly
difficult for clients to complete block duties on time leading to poor
attestation performance and reorg frequency increases.

This PR proposes to rebalance the relative timings of the 3 main
activites, namely block production, attestation and aggregation such
that they happen on 0/6/9 seconds into each slot instead of 0/4/8.

This reduces the time for attestations to reach aggregators and
aggregates to reach block producers but increases the time the consensus
and execution clients have to produce and validate blocks.

Each upgrade has so far increased complexity and processing requirements
around block production and so will future upgraddes: due to increased
size, blocks with blobs will take longer to dissemenate and additional
verification / cryptography is needed to validate them.
@potuz
Copy link
Contributor

potuz commented Jun 15, 2023

As we talked on Discord these proposed numbers may represent a considerable rework of Prysm design at least. The way we handle aggregation requires us currently to have by default all attestations arriving 1.5 seconds before the aggregation count (this would be 7.5 seconds into the slot). Similarly, our implementation of #3034 requires a couple of forkchoice calls before the end of the slot and this in turn requires all aggregations to have arrived to the node or this would risk several orphaned blocks on chain.

I do believe that these numbers are still within manageable boundaries for our implementation and I fully support increasing the first bracket of the slot. However I would want to see very good benchmarks of all client implementations and their handling with different parameters to justify these changes.

The benchmarks we used to justify the cut at 1.5 seconds before the aggregation time were here prysmaticlabs/prysm#12350

Similarly we use 10 seconds as a time in the slot to decide or not a reorg, shortening this (like moving closer to the boundary as Lighthouse does) works fine in general but it hurts local execution builders in case of a failed reorg attempt.

@mcdee
Copy link
Contributor

mcdee commented Jun 15, 2023

This PR also introduces stronger language around the already existing requirement that clients send out attestations as soon as they have observed a block - doing so would help the network distribute load more evenly and thus better absorb continued growth.

Personally I would like to see this happen first, and then measure again the distribution of attestation marks to see what impact increasing the attestation space in the slot would provide.

As an example: Vouch follows the spec and attests as soon as possible i.e. at 4 seconds, or before if a block arrives. A sample distribution of attestation marks (this is when the attestation process completes) for one of our nodes is as follows:

screenshot_2023-06-15-211841

It can be seen that the majority of attestations are created by the 3s mark. and the tail past 4s is basically down to missed slots. If all validator clients followed the spec here it could well result in less pressure later in the slot, and somewhat mitigate the requirement to extend the first part of the slot. At the least, it would give a better view of what the in-slot timings should be.

Separately, I would like to point out that a significant jump in orphan rates started in April, at the point that MEV relays (unilaterally) took over the proposal broadcast responsibilities for MEV blocks. A breakdown of the orphan rate for locally-proposed blocks only would give a better view as to the health of the network.

@@ -95,7 +95,7 @@ def on_block(store: Store, signed_block: SignedBeaconBlock) -> None:

# Add proposer score boost if the block is timely
time_into_slot = (store.time - store.genesis_time) % SECONDS_PER_SLOT
is_before_attesting_interval = time_into_slot < SECONDS_PER_SLOT // INTERVALS_PER_SLOT
is_before_attesting_interval = time_into_slot < SECONDS_PER_SLOT // 2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a constant

@arnetheduck
Copy link
Contributor Author

arnetheduck commented Jun 16, 2023

It can be seen that the majority of attestations are created by the 3s mark. and the tail past 4s is basically down to missed slots.

It can be also seen in the graph that we're approaching the point where the current 4s rule will take more and more slots over due to an ever increasing amount of work the client has to do in order to be able to send out an attestation (ie in your graph, the 3-4s range is not empty either) - ie what's happening is that we're pushing back the "block arrived" time with additional validation (blob dispersal and validation): future versions will push this back further.

The "attest-early" rule only helps those blocks which arrive early and are easy to validate, not the worst-case blocks that nonetheless are permitted by the spec (ie up to MAX_BLOBS_PER_BLOCK blobs etc). It is good for the network that clients support it, but we also need to acknowledge that the "staggering" effect where the network takes time to observe a block as it's being disseminated through the network will further increase with Deneb.

Separately, I would like to point out that a significant jump in orphan rates started in April, at the point that MEV relays (unilaterally) took over the proposal broadcast responsibilities for MEV blocks.

I've zoomed in the graph a little (the gap is an out-of-disk-space event on the collector):

image

I've previously assumed that capella was more strongly involved, but it is indeed the case that the overall increase predates capella (Apr 12) by a few days.

cc @michaelneuder

@mcdee
Copy link
Contributor

mcdee commented Jun 16, 2023

It can be also seen in the graph that we're approaching the point where the current 4s rule will take more and more slots over due to an ever increasing amount of work the client has to do in order to be able to send out an attestation (ie in your graph, the 3-4s range is not empty either) - ie what's happening is that we're pushing back the "block arrived" time with additional validation (blob dispersal and validation): future versions will push this back further.

I agree, so perhaps attestation times aren't the best way of measuring this. You mention the "block arrived" time, I think that this roughly equates to the emission of the "head" event on the event stream, which is after the block has been processed and validated. In which case, here is a graph showing the average time in the slot in which we receive the "head" event (specifically, we take the median arrival time of the "head" event for all beacon nodes in our mainnet environment per slot, and then average this per day to give a single value for each day):

headdelay

So yes we are definitely seeing an increase in the time at which these events are emitted. That said, the average is below 2s at time of writing (but then again, there are various other ways in which we could slice this data that may bring different conclusions as to the headroom that we really have, and note that this methodology definitely misses out the impact of blocks that never make it to be head). One thing I'll try to do is to see if I can find the difference between the "block" and "head" events emitted, as that could give a better indicator of the actual processing time.

@mcdee
Copy link
Contributor

mcdee commented Jul 19, 2023

Following on from the previous comment, here is a graph of the processing time for blocks.

Methodology here is that we gather the timestamp of nodes emitting the 'block' and 'head' events on the /events stream, and calculate the difference for each (slot,node). We then take the median value for each slot and average those values over each day, as per the previous graph (note that we do not include teku in this data because it is at current providing block events after head events).

processingtime

Broadly similar shape to that of the head delay graph. Not as high an increase at the capella hard fork as I would have guessed prior to seeing the data, and I assume that the pull-back soon after was due to improvements in client code, but still definitely seeing an upwards trend. However, the actual number still seems pretty low in the grand scheme of things, and I wonder if there are more places to optimize before looking at changing the timings.

(And unrelated to the above, but if we do change the timings I wonder if we should stop trying to stick to second boundaries and break a slot in to 128 increments, which would allow us finer-grained control in future if necessary.)

@mcdee
Copy link
Contributor

mcdee commented Jul 19, 2023

Additional data on the graph, showing the 90th percentiles and 95th percentiles. 90th percentile is probably the most interesting, in that it suggests that the increase in number of validators has had less of an impact than originally thought in the "normal" case. The 95th percentile does show that the work done in worst case situations has been increasing a fair bit since the merge.

processingtimeadditional

And the same graph but topping the Y axis out at 200ms:

processingtimeadditional-detail

@rolfyone
Copy link
Contributor

rolfyone commented Jul 19, 2023

Not all networks are 12 second slots, and in the past there's been curious timings on account of networks choosing odd slot times that aren't divisible by 3...

Are we better off just making ATTESTATION_DUE and AGGREGATION_DUE kind of network parameters rather than baking math into the spec for when they're due so that the individual settings can be tweaked per network?

This would allow someone with an odd timing to be able to tweak for their network rather than get stuck on division issues like has happened in the past with some clients using ms and some clients using second granularity...

you'd end up with something like

SECONDS_PER_SLOT: 4
ATTESTATION_DUE: 2
AGGREGATION_DUE: 2.5

or in mainnet

SECONDS_PER_SLOT: 12
ATTESTATION_DUE: 6
AGGREGATION_DUE: 9

Potentially could set a sync committee message due setting, or expect it to align with attestation duties...

or bake MS into the numbers so that we used something like

MILLISECONDS_PER_SLOT: 12000
ATTESTATION_DUE_MILLISECONDS: 6000
AGGREGATION_DUE_MILLISECONDS: 9000

Anyway this isn't a fully formed idea but figured it was worth sharing given we're talking about changing this area...

I should say I agree with the stronger wording about attestation production...

@dapplion
Copy link
Collaborator

dapplion commented Jul 20, 2023

SECONDS_PER_SLOT: 4
ATTESTATION_DUE: 2
AGGREGATION_DUE: 2.5

Agree it make more sense to specify as this

@mkalinin
Copy link
Collaborator

SECONDS_PER_SLOT: 4
ATTESTATION_DUE: 2
AGGREGATION_DUE: 2.5

Agree it make more sense to specify as this

I am personally in favour of integer parameters, from my perspective milliseconds should work better

@rolfyone
Copy link
Contributor

The other thing I wasn't sure of was if we have something for specifying the cutoff for blocks like the late block reorg PR...

If we're doing these, that could also be just another parameter, so it could be set at whatever value was decided to allow us a consistent way of referencing the drop dead point in a slot for late blocks...

LATE_BLOCK_MILLISECONDS: 4000 

as an example

Also note potential for out-of-order messages
@arnetheduck
Copy link
Contributor Author

af72748 introduces constants and notes that within a slot, messages may arrive out-of-order and that clients should be prepared to handle this case.

Comment on lines +51 to +54
| `ATTESTATION_DUE_MS` | `6000` |
| `AGGREGATE_DUE_MS` | `9000` |
| `SYNC_MESSAGE_DUE_MS` | `6000` |
| `CONTRIBUTION_DUE_MS` | `9000` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good idea splitting sync committee and contributions to their own constants, I like it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think sync messages should be aligned with attestations. the logic is the same as for attestations in that they should be created once the block has been fully validated and it is reasonable to assume that it propagated.

@rolfyone
Copy link
Contributor

af72748 introduces constants and notes that within a slot, messages may arrive out-of-order and that clients should be prepared to handle this case.

Do we need to specifically call out that ATTESTATION_DUE_MS must be strictly less than AGGREGATION_DUE_MS, and same for SYNC_MESSAGE_DUE_MS being less than CONTRIBUTION_DUE_MS? It seems obvious, but I think this is the possibility you might be referring to?

@arnetheduck
Copy link
Contributor Author

I think this is the possibility you might be referring to?

the possibility is on the receiving end - ie an attestation might propagate faster than a block therefore you might receive them out of order - to handle such cases, clients must cache attestations for the given slot if they have not yet observed the block (and wait with propagation until they've observed the block). On the sending end, I don't think we need to point it out (ie it follows from the fact that you're aggregating attestations and not the other way around)

@arnetheduck arnetheduck marked this pull request as ready for review August 16, 2023 11:03
@hwwhww
Copy link
Contributor

hwwhww commented Sep 15, 2023

Hey @arnetheduck, I wanted to fix the conflicts, but I don't have permission to push commits to status-im. 😅

Do teams want to experiment with it at dencun-devnet-9?

@hwwhww
Copy link
Contributor

hwwhww commented Sep 15, 2023

@arnetheduck To test it with minimal specs (SECONDS_PER_SLOT: 6), we should move these new constants to Configuration & update configs/minimal.yaml and configs/mainnet.yaml. I may open a new PR due to the permission issue above.

@arnetheduck
Copy link
Contributor Author

I don't have permission
fixed, I think ;)

@etan-status
Copy link
Contributor

etan-status commented Nov 29, 2023

Seems like there are trends to voluntarily reduce the first slot interval by 1s down to 3s by proposing late intentionally:

That's sort of the.. opposite direction of this PR

@casparschwa
Copy link
Member

casparschwa commented Nov 29, 2023

Rebalancing duty times, ATTESTATION_DUE_MS specifically, without mitigating timing games (paper, tldr thread) should not be done imo. In short, shifting the attestation deadline from 4000ms to 6000ms would simply shift the rational block releasing strategy back by ~2000ms.

The current honest validator specifications ask block proposers to proposer their block at the beginning of the slot, which is not rational to do. A rational block proposer builds their block later in the slot to capture more MEV. They only need to release the block early enough for attestors to see it in time such that they vote for it, effectively making the block canonical.

Seems like there are trends to voluntarily reduce the first slot interval by 1s down to 3s by proposing late intentionally

We have not seen these timing games played to their fullest extent either imo.

@potuz
Copy link
Contributor

potuz commented Nov 29, 2023

I agree with Caspar, before clients shipped the late block reorg feature, seeing blocks at 11 seconds was common. However, I would support increasing the attestation deadline if it's shown that 4 seconds is not enough for honest validators building locally at Deneb.

@arnetheduck
Copy link
Contributor Author

timing games ... should not be done imo.

These are two separate problems - timing games exist before and after this PR.

The PR just rebalances the timings to better reflect the reality of relative weights of the actions involved in a slot - it addresses a "cost misalignment" in the spec where creating and propagating hundreds-of-kilobytes block with hundreds of signatures, transactions and so on is assumed to have the same cost as sending a 200-byte attestation with a single signature in it.

Whether these timings are appropriate for everyone is a topic of separate discussion and mitigating timing games does not involve picking a specific value for the timeouts over another - it requires a different mechanism entirely - as such, I'd keep timing game discussions entirely out of this thread.

@etan-status
Copy link
Contributor

Would a rational attester also delay the attestation to capture extra rewards for timely voting? They only need to release the attestation early enough for aggregators to see it in time. If they were to attest on time, they may miss out on rewards if a block is received late.

@potuz
Copy link
Contributor

potuz commented Nov 29, 2023

Would a rational attester also delay the attestation to capture extra rewards for timely voting? They only need to release the attestation early enough for aggregators to see it in time. If they were to attest on time, they may miss out on rewards if a block is received late.

No, rational attesters would attest as early as possible, there's no need to delay the attestation and in any case there's no need to attest to anything that came after the deadline that is very likely to be reorged.

@etan-status
Copy link
Contributor

It's not necessarily likely to be reorged, for example, if it was received on time but is still pending validation at the 4s mark, e.g., low resource system.

@casparschwa
Copy link
Member

casparschwa commented Nov 29, 2023

Rational attestors vote as soon as possible, but might delay their attestation deadline still:

At the limit, a rational proposer knows their block needs to receive only 40% of the committee's attestation votes (proposer boost). If they would maximally delay their block proposal they would target a split in the committee such that 40% of the committee hears the block before the attestation deadline (and so vote for it), and 60% do not hear the block before the attestation deadline (and so vote for parent).

An attestor wants to get their head vote correct and thus wants ensure to not be part of the 60%. They can achieve this by delaying their attestation slightly (while making sure they propagate it in time for aggregation - timing games all over). This could allow block proposers to delay their blocks even further, incentivizing attestors to further delay their attestation deadline... This could spiral towards the end of the slot.

Attestation committees are large enough that targeting splits should be feasible. Practically a proposer would add safety margins around those timings, but in the extreme this is where this could be headed...

@etan-status
Copy link
Contributor

If, say, attestation deadline would move to 6s, could the timing games be limited by.. simply stopping to gossip blocks 3s into the slot? As in, if late blocks are reorged anyway, does it make sense to still gossip them when it's already late?

@casparschwa
Copy link
Member

So you make the slot longer, from 4s to 6s, but effectively the slot is now shorter in terms of allowed block propagation time (3s)? I don't follow.

@etan-status
Copy link
Contributor

Right now, the timings are:

  • 1s waste to capture MEV
  • 3s block propagation, processing, attesting
  • 4s aggregation
  • 4s sending aggregates

the PR here proposes to change the deadline timings to 0/6/9, effectively promoting this behaviour:

  • 3s waste to capture MEV
  • 3s block propagation, processing, attesting
  • 3s aggregation
  • 3s sending aggregates

If 3s is enough for block propagation, what could be done is:

  • 3s block propagation (after this, disable blocks gossip)
  • 3s block processing, attesting
  • 3s aggregation
  • 3s sending aggregates

@etan-status
Copy link
Contributor

As in, if you run a low resource system, you would be guaranteed to have at least 3s to process the block and determine validity verdict. That could not be reduced by MEV games

@mcdee
Copy link
Contributor

mcdee commented Nov 29, 2023

If 3s is enough for block propagation...

If it really is enough for the whole block creation/signing/propagation process then there should be no need to change the timings. But as @arnetheduck says in the OP this is increasing not going to be the case, especially as blocks become larger.

I'd be against stopping block gossip at an arbitrary time within the first period; it would likely push people towards higher-end hardware and faster 'net connections, and both of those would go against home staking.

@rolfyone
Copy link
Contributor

I'd be against stopping block gossip at an arbitrary time within the first period; it would likely push people towards higher-end hardware and faster 'net connections, and both of those would go against home staking.

The advantage of ignoring blocks after a deadline (maybe not stopping gossip altogether, but ignoring) would be that the people pushing later and later blocks then know the drop dead time... but you could bet they'd be getting as close to that line as possible given the financial advantages... i'm not sure what the answer is...

@potuz
Copy link
Contributor

potuz commented Nov 30, 2023

Stopping gossip before the attestation deadline is simply a bad idea, any late block that becomes canonical will have to be imported by RPC, any split view that can be caused by timely delivery is exacerbated since now clients that did not get the block on time will have to wait much longer to get the block, possibly across several hops if their peers don't have them. It risks network partitioning for not reason.

@ppopth
Copy link
Member

ppopth commented Nov 2, 2024

Any chance we can push this further to make it real? 0/6/9 is far better than 0/4/8. Did I miss anything? Did we have a consensus that we will keep it 0/4/8?

@mcdee
Copy link
Contributor

mcdee commented Nov 2, 2024

I don't think that we saw much consensus to change these values. Most of the agreement was around moving to absolute values rather than fractions of the slot time in the spec, which is reasonable but doesn't involve changes in the practical operation of the chain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants