Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIP-1][Discussion] Proposer selection improvements within Consensus Protocol #9

Closed
igor-aptos opened this issue Dec 7, 2022 · 10 comments
Labels

Comments

@igor-aptos
Copy link
Contributor

igor-aptos commented Dec 7, 2022

AIP Discussion

Discussion and feedback thread for AIP.

Link to AIP: https://github.com/aptos-foundation/AIPs/blob/main/aips/aip-1.md

Summary

This change brings two simple improvements to proposer selection:

  • it changes so that we look at more recent voting history, making system react faster
  • it makes proposer selection much less predictable, reducing attack surface by malicious actors

Background

In Aptos Blockchain, progress is organized into rounds. In each round, new block is proposed and voted on. There is a special role of proposer, deterministically decided for each round, that is responsible for collecting votes from the previous round and proposing a new block for current round. Goals of proposer selection (a decision on which node should be a proposer in a round) are:

  • be fair to all nodes - both so that all nodes are asked to do their fair share of work, as well as so they can get their fair share of rewards (in combination with staking rewards logic). Fair share means it should be proportional to their stake.
  • prefer nodes that are operating correctly, as round failures increase commit latency and reduce throughput

Current proposer selection is done via ProposerAndVoter LeaderReputation algorithm. It looks at the past history, in one window for proposer history, and a smaller window for voting history. Then reputation_weight is chosen for each node:

  • if proposer round failure rate within the proposer window is strictly above threshold, use failed_weight (currently 1)
  • otherwise, if node had no proposal rounds and no successful votes, use inactive_weight (currently 10).
  • otherwise, use the default active_weight (currently 1000).

And then, reputation_weight is scaled by staked_amount, and next proposer is pseudo-randomly selected, given those weights.

Window sizes are chosen such that they are large enough to have enough signal to be reasonably stable, and not too large - to be able to adapt to changes quicker. For every block, we get proposer signal only for a single node, but voting single for two-thirds of the nodes. That means that proposer window needs to be larger, while we can keep voting window shorter.

Motivation and Specification

This proposal is to upgrade ProposerAndVoter into ProposerAndVoterV2 selection algorithm. New proposer selection algorithm makes two changes to the logic:

  • voter history window.
    • For looking at historical node performance, we look at proposals within (round - 10*num_validators - 20, round - 20) window. For voters we are looking at (round - 10*num_validators - 20, round - 9*num_validators - 20). We ignore last 20 rounds, because history is looked at committed information, and consensus and commit is decoupled, and there can be a few rounds delay between each. Beyond that, voters window is unnecessarily stale. With the new change, we will be looking at (round - num_validators - 20, round - 20) range for voters.
    • Untitled Diagram drawio(1)
    • Main effect of this change will be, that nodes that are joining validator set or were offline/lagging for a while and just caught up, will have a significantly shorter delay before being treated as active and being selected as proposer.
  • seed for pseudo-random selection
    • Currently seed used for pseudo-random selection is tuple (epoch, round). This makes every round be an independent random choice, but makes it predictable. That means that it is relatively easy to figure out who are going to be selected proposers for the future rounds, and this gives malicious actors easier ways to attack/exploit the network. There are various known ways that predictable leader election simplifies the attacks - Denial-of-service can be more easily achieved by only attacking the leaders, potential front-running of transactions is easier if leader is known in advance, etc. With the new change, seed is going to be changed to (root_hash, epoch, round), making it much less predictable.

Reference Implementation

aptos-labs/aptos-core#4253
aptos-labs/aptos-core#4973

Risks and Drawbacks

Future Potential

Suggested implementation timeline

Above PRs have been committed, and are being tested, and prepared to be released to the mainnet.
To enable the change above, additional governance proposal for consensus onchain config needs to be executed. E2E smoke test has been landed as well, to confirm governance proposal can be executed smoothly.

It is running on devnet for more than a week, though devnet has limit and only AptosLabs run validators, so change is not stress-tested.
We will test it out on testnet in a week or two.
If no further changes are needed, proposal is planned to be created, and sent for voting, by the end of December.

@igor-aptos
Copy link
Contributor Author

AIP PR - #7

@sherry-x sherry-x changed the title AIP: Proposer selection improvements within Consensus Protocol [AIP-1][Discussion] Proposer selection improvements within Consensus Protocol Dec 7, 2022
@PolkachuIntern
Copy link

That's a great proposal. It is better to reduce attack vectors early before we have to. We know from Solana that motivated "attackers" are willing to spam the network and take down proposal leaders in order to get a few JPEGs.

@dylanschultzie
Copy link

Will this have any periphery effect of changing the behavior around validators that go offline without exiting the active set?

Right now I see that as a primary attack vector which this addresses indirectly, but I'm curious about known knock-on effects.

@artifactstaking
Copy link

It is not clear exactly where the root_hash param is coming from. Is the root_hash generated by OS data on each individual validator host for its own use? Could the seed tuple be thought of as [local host derived data, epoch, round] ?

A follow up question if the above is true: Is the root_hash deterministic over time and OS config? For example, will this seed always be the same over time as long as the validator node config does not change? If so, it may be worth looking into a way to roll the root_hash seed so that it is changing at some rate, either constantly or a few times per day. There are many ways to do this that do not involve a system clock.

@kinrokinro
Copy link

Looks like an awesome improvement which should have been added a bit earlier, but may we can add some additional hashes/data to the seed?

@igor-aptos
Copy link
Contributor Author

For the root_hash, I should've been clearer, that is a hash (merkle-tree root hash) of on-chain state (i.e. all accounts and their resources/data).
For the root hash of state after block x, it is unknown before block x is executed and committed, and is known and deterministic afterwards. For proposer selection purposes, in the current protocol:

  • it needs to be deterministic , so that all nodes agree on who is the leader
  • it needs to be known before proposer proposes a block, as previous round votes are being collected by the next round proposer. I.e. proposer builds a chain on top of a previous block, with quorum certificate for that previous block.

So previously, you could compute proposers for rounds far in the future, where after this change, a window in which next proposer is known is short - up to a few seconds.
We are thinking about stronger sources of randomness, and can improve upon here, once those are implemented.

@igor-aptos
Copy link
Contributor Author

Will this have any periphery effect of changing the behavior around validators that go offline without exiting the active set?
Right now I see that as a primary attack vector which this addresses indirectly, but I'm curious about known knock-on effects.

voting history is only used for determining if a node is active or not, and is combined with longer proposer history to determine that. Proposal success/failure rate is used as a primary signal for active nodes on deciding chances of selection.

That means that above change mostly affects things when the node is joining network, and doesn't affect it much when it is leaving (as longer proposal window will keep treating it as active for longer).
But that is OK, as after a failure or a few, node will be selected less often, and that signal is generally incorporated fast.

@sirouk
Copy link

sirouk commented Dec 12, 2022

Would it be possible to chart the difference between current vs. proposed proposer/voter algos?

@mathiyarasu65
Copy link

This proposal looks good to me.

@michelle-aptos
Copy link
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants