Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Withdrawals into queue in single function #3042

Closed
wants to merge 3 commits into from
Closed

Conversation

djrtwo
Copy link
Contributor

@djrtwo djrtwo commented Oct 16, 2022

Rework of @rolfyone's PR (#3032)

There were some minor bugs introduced in the last one. I think I cleaned it up but have not reworked tests.
I'd prefer to only do so once we have consensus on the general feature

Original comment:

Partially to initiate a discussion.

Basically use a single pipeline for withdrawal processing to simplify the flow and reduce the need for full validator scans during epoch processing.

This should make the feature much more testable amongst other benefits, and makes withdrawal of partial or full amounts an equal priority based on your position in the validator list.

The timing remains approximately equal to previous, except there may have previously been 4 exits per epoch to add that are now just done during sweeping, where previously they were done as a priority

Danny's comments:

Note, that this does have some shifts in "fairness" wrt when you become withdrawable and when you actually are withdrawn. It used to be immediate but not it could be quite a long time between when you become withdrawable and when withdrawn (on the order of a week at current vset size). Additioanlly, some people might become withdrawable sooner than others (thus exited sooner in some cases) but might then be later in the sweep.

I like the previous iteration a bit better for a few reasons:

  1. The two mechanisms were isolated so were easier to test on the spec side. That said reworking tests shouldn't be too hard
  2. The timing and fairness thing. In the prior mechanism, once you become withdrawable you get to get all of your ETH out immediately rather than waiting for what might be a very long partial withdrawals sweep. I understand the complexity of doing a full validator sweep to find the withdrawable folks but note that it's not like those that are withdrawable can change quickly, you can easily have the ones to check precomputed.

Re (2). If this is a major engineering complexity/hurdle, then I'm okay with the changes. But it does seem a bit weird to prioritize partial withdrawals (which usually all the set are ready for) over the much smaller (and more important) set of full withdrawals.

@djrtwo djrtwo mentioned this pull request Oct 16, 2022
@mcdee
Copy link
Contributor

mcdee commented Oct 16, 2022

Like the simplicity of this. the downside is that full withdrawals won't necessarily be processed immediately the validator is withdrawable, but I don't see that being a major issue.

I do want to go back to MAX_WITHDRAWALS_PER_EPOCH, however. The current value of 256 will work through the entire current validator active set in around 7 days. Looking at current rewards, an average validator would expect to earn around 0.025 ETH in that 7 day period, That seems like quite a small number as the "unit of value" to be going to the effort of transferring across chains.

The other related, but somewhat more nebulous, concern I have is that I can envisage smart contracts wanting to obtain proof on-chain that a withdrawal has come from a certain validator. Providing and verifying a proof for a 0.025 ETH amount could end up eating a fair chunk of the value of the withdrawal.

If MAX_WITHDRAWALS_PER_EPOCH was reduced to 64 it would still get through the current active set in a month, and would be skimming values of around 0.1 ETH for active validators. It would reduce the number of withdrawal operations on-chain, and would results in proofs taking up significantly less ETH as a percentage of the transfer. Is there a reason outside of responsiveness (which I do understand) to keep this at 256?

@ethereum ethereum deleted a comment from Where-2 Oct 16, 2022
@potuz
Copy link
Contributor

potuz commented Oct 16, 2022

I'll leave the same comment here as in Paul's PR, an option is to process full withrawals in the same loop as effective balance updates. That may be equivalent to the original approach without the extra loop

@dapplion
Copy link
Collaborator

dapplion commented Oct 17, 2022

At least for Lodestar, (on a preliminary look, not implemented yet) the previous approach won't require a full sweep of the validator set. We do all sweeps in epoch processing in one go, essentially merging multiple spec functions into one.

@potuz
Copy link
Contributor

potuz commented Oct 17, 2022

At least for Lodestar, (on a preliminary look, not implemented yet) the previous approach won't require a full sweep of the validator set. We do all sweeps in epoch processing in one go, essentially merging multiple spec functions into one.

In order to do this you need to prove that doing it that way is equivalent to the specified version which explicitly adds a loop after balances changes. It needs to be checked that these processes are independent

@dapplion
Copy link
Collaborator

dapplion commented Oct 17, 2022

At least for Lodestar, (on a preliminary look, not implemented yet) the previous approach won't require a full sweep of the validator set. We do all sweeps in epoch processing in one go, essentially merging multiple spec functions into one.

In order to do this you need to prove that doing it that way is equivalent to the specified version which explicitly adds a loop after balances changes. It needs to be checked that these processes are independent

Right. Then you can collect an initial list of fully withdraw-able validators at the beginning of epoch processing, and then re-check that balance > 0. The diff between those lists will be none in most cases. I can only see cases like a validator being slashed for all its balance causing a diff there

@rolfyone
Copy link
Contributor

To me, I don't see why full withdrawals should be prioritised, although there are definite periods where you may exit second and get processed before an earlier full exit.

Full withdrawals getting bumped to the top of the queue immediately does basically unfairly advantage them compared to active contributing members of the network, I'm not sure why the prioritisation call was made in the initial instance...
Processing them during updated balance calculation is probably ok, but the spec calling out that it's right at the end and the code being completely at odds does seem to be a little odd (if balance calculation also exits validators)...

The 'fairness' is not the reason I made the change though, it was purely reducing processing for the epoch transition.

@hwwhww
Copy link
Contributor

hwwhww commented Oct 17, 2022

I fixed the tests. Some notes:

  • Merged run_process_full_withdrawals and run_process_partial_withdrawals into run_process_withdrawals_into_queue
  • Since full and partial withdrawals are now in the same function, I removed randomize_state call from run_random_partial_withdrawals_test because random_state might also trigger the full withdrawal cases and lead to unexpected partial withdrawal num.

@hwwhww hwwhww added the Capella label Oct 17, 2022
@ethereum ethereum deleted a comment Oct 18, 2022
Copy link
Contributor

@hwwhww hwwhww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new queue solution looks good to me.

@potuz: I'll leave the same comment here as in Paul's PR, an option is to process full withrawals in the same loop as effective balance updates. That may be equivalent to the original approach without the extra loop

I think the drawback is that the optimized solution is less easy to test and debug the new withdrawal feature?

To all CL clients: we are going to cut a new spec release soon and are likely to include this PR. Let me know if there is any strong objection 🙏

Copy link
Contributor Author

@djrtwo djrtwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great job!

@with_capella_and_later
@spec_state_test
def test_random_5(spec, state):
yield from run_random_partial_withdrawals_test(spec, state, Random(5))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want a randomizer that does both but the full randomized block tests do as well

@dapplion
Copy link
Collaborator

IMO opinion this PR does not add any tangible benefits in performance, while slowing full withdrawals for no reason. Partial withdrawals are processed "slowly" because we have to. Full withdrawals are processed "faster" because we can. It's purely technical, no fairness argument influences those facts.

@rkapka
Copy link
Contributor

rkapka commented Oct 18, 2022

I think we should benchmark this. How long does it realistically take to loop through the validator set? Isn't this premature optimization?

@ajsutton
Copy link
Contributor

We know in teku that looping through the full set of validators is expensive and it makes any algorithm O(n) in the number of validators. So yes, we would definitely have to use an optimised version of this.

The argument for having a separate loop for full vs partial withdrawals was to make testing simpler, but that isn't the case once we need to add a cache for which validators have become withdrawn since tests would need to cover that caching. In other words, a separate loop actually expands the scope of testing required rather than reducing it since we now need to cover every way a validator may become withdrawable.

Besides which, we aim for simplicity in the way consensus works and having a single approach to withdrawal is simpler than having two different approaches depending on why the validator has balance that could be withdrawn.

@hwwhww
Copy link
Contributor

hwwhww commented Oct 20, 2022

I tried to make a summary table:

Solution Pros Cons
Separate partial func and full func prioritize full withdrawal? (but most devs don't think it's benefit) [@ajsutton] we're all saying we'll optimise it in clients, but then the tests for it will be written based on the simple implementation in the spec and won't provide coverage for the optimisations actual clients would need to do. Given Danny's argument for the separatation of withdrawals was to make it simpler to test, the fact it makes the tests less effective seems to be shooting ourselves in the foot.
#3042 simplest [@dapplion] I don't find justifiable to add delays on withdrawals for the rationale of the PR. There are optimizations possible that achieve the same end without worsen the validator UX
[@etan-status] Downside is that it adds this arbitrary delay of 3-4 days on avg (but more in future) for trying to get out, during which no rewards are earned anymore. This means that exits need to be scheduled as late as possible into a withdrawal interval for a specific validator to not miss out on those rewards. This can be quite tricky because of the exit queue. Tracking the exited-but-not-yet-withdrawn validators separately is straight forward as an optimization. But please provide tests for edge case such as depositing into a validator that has already exited or fully withdrawn to re-enter the priority queue accordingly.
#3042 + loop in effective balance updates (@potuz's suggestion) most efficient
probably closer to client's optimization?
[hww] more difficult to debug and test (but only in specs?)

My impression:

  1. Clients will optimize the loop anyway, so the spec is less likely to be as same as the production implementation.
  2. If (1) is true, the epoch processing tests (process_withdrawals_into_queue) may not be applied on the client side anyway.
  3. Most devs like the simplicity of Withdrawals into queue in single function #3042 on the spec side.
  4. Withdrawals into queue in single function #3042 may introduce some side effects, i.e., the delay. It needs to be tested properly if we go with Withdrawals into queue in single function #3042.

My read is that it’s slightly toward 👍…? 😬

@ralexstokes
Copy link
Member

There are good arguments on both sides here and I'll chime in with another view against this change :)

I think we should prioritize spec simplicity above all else, weighted against speed to ship the feature to mainnet.

There will always be things we need to do to optimize implementations in clients so I don't think this is a big ask for client teams. Caches etc have been a large source of bugs in the past and to mitigate the concern we develop/maintain a large corpus of consensus test cases. And the more the spec itself is optimized for the production/maintenance of these test cases, the better mitigation it provides against bugs due to implementation complexity.

And alongside this, I want to get to a place where we stop changing the capella spec as soon as possible so there is something stable for client teams to target by end of the year. And the best way to get stability is to stop changing the spec, i.e. do not make this change.

@potuz
Copy link
Contributor

potuz commented Oct 28, 2022

One big difference that this PR introduces is that the withdrawal queue is much smaller. This is the first time we have a queue in the specs and I am worried about memory fragmentation. I started implementing this in Go and all implementations we are evaluating either leak memory or suffer from continuous reallocations. Either way, if this PR passes I think it would be good to put WITHRAWAL_QUEUE_LIMIT = MAX_WITHRAWALS_PER_EPOCH or some aggressive limit like this, effectively capping the queue size to a constant bound since anyway we add to this queue on epoch transition and we cannot empty it during the coming epoch.

Having the queue of constant size allows us to optimize this further by keeping a rolling index and adding on place. This would be application specific, but I imagine all languages will benefit from having a constant size rather than potentially a big chunk of the validator set.

@potuz
Copy link
Contributor

potuz commented Oct 28, 2022

Coming to think about this: if we go with rolling sweeps as in this PR, there's no need for the queue at all: knowing what are the next withdrawal indices determines uniquely the next withdrawals that a block can include, so we can just check against that and there's not even the need to keep the queue in the state it seems.

@michaelsproul
Copy link
Contributor

michaelsproul commented Oct 29, 2022

@potuz One of the nice things about the queue is that it fixes the balance of the withdrawal at the point where it was processed in process_epoch. One thing that concerns me about the queueless approach (and #3068) is that we have to deal with the complexity of balances changing after every block, meaning that whatever cache we use for withdrawals needs to be updated on every slot and needs to handle more frequent re-orgs, etc. It makes withdrawals feel a lot more like attestations, which we know are more complex to pack than e.g. voluntary exits which are epoch-based.

@etan-status
Copy link
Contributor

Balances can already change on each slot due to slashing.

@michaelsproul
Copy link
Contributor

@etan-status That doesn't effect the queued approach though, because the balance decrement for the withdrawal happens immediately on the epoch transition. If a validator that is getting partially withdrawn gets slashed before their withdrawal is processed then the slashing just applies to the remaining balance (which I think is fine).

@hwwhww
Copy link
Contributor

hwwhww commented Nov 11, 2022

closing due to #3068

@hwwhww hwwhww closed this Nov 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.