Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fix] Increase the inbound queue capacity (that leads into the memory pool) #3124

Merged
merged 6 commits into from
Mar 12, 2024

Conversation

raychu86
Copy link
Contributor

@raychu86 raychu86 commented Feb 23, 2024

Motivation

This PR increases the capacity of the inbound queue (that leads into the memory pool) to the following:

  • Deployments: 10 -> 1024
  • Executions: 40 -> 1024
  • Solutions: 50 -> 1024

And bounds the number of deployments the BFT processes at any given interval to 1 deployment from the inbound queue entering into the memory pool.

Previously the capacity was too low and would cause the nodes to drop transactions during times of high load. These numbers are not final, and should be tuned to support the expected network throughputs.

@raychu86 raychu86 requested review from kpandl and vicsn February 23, 2024 18:52
@apruden2008
Copy link
Contributor

Don't we want the nodes to drop transactions during high load in the event of a flood? What is the "expected behavior" supposed to be in this case?

/// Percentage of mempool transactions capacity reserved for executions.
const CAPACITY_FOR_EXECUTIONS: usize = 80;
/// The capacity of the mempool reserved for deployments.
const CAPACITY_FOR_DEPLOYMENTS: usize = 1 << 8;
Copy link
Contributor

@vicsn vicsn Feb 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of this constant further below (line 302) is now broken.

To avoid missing it in the future, I suggest we introduce a new constant, e.g. DEPLOY_VERIFICATION_RATE. This constant ensures that for every 10 transactions we are going to verify, at most DEPLOY_VERIFICATION_RATE will be deployments, preventing DoS from bogus deployments.

In this PR a rate of 20 was approved, but we can in the future go up to 50 to enhance deployment throughput.

@vicsn
Copy link
Contributor

vicsn commented Feb 24, 2024

Don't we want the nodes to drop transactions during high load in the event of a flood? What is the "expected behavior" supposed to be in this case?

@apruden2008 The expected behaviour is that we don't drop transactions too quickly as this hurts usability. If the mempool would have a hypothetical size of 1, almost all transactions would be dropped, rendering the chain unusable. Increasing the mempool size does not hurt performance because they are just waiting in a queue. The only limiting factor is how much memory the nodes have.

Deployments: 10 -> 256

@raychu86 If we assume 300 MB a piece (as noted here), we will run out of memory with 256 deployments. it seems plausible programs can be that big, I just trivially made one of 50MB. A maximum of 50 deployments seems prudent. Moreover in contrast to the execution queue, deployments can more easily be filled with bogus, so a big queue can also hurt usability.

@apruden2008
Copy link
Contributor

In my view, we need to prioritize liveness over useability for now. If each program can be 300 MB, than even 10 deployments is 3 GB of not only memory but also bandwidth. Plus I'm assuming all the circuits need to be synthesized.

Given the prior issues we had with consensus and syncing, it feels like increasing this parameter only opens up a DoS vector & demand for transactions is clearly nowhere close to this level based on data from existing explorers

I would argue putting this down to the lowest possible setting per the acceptance criteria for launch. We can raise it later as chain usage grows.

@vicsn
Copy link
Contributor

vicsn commented Feb 27, 2024

In my view, we need to prioritize liveness over useability for now. If each program can be 300 MB, than even 10 deployments is 3 GB of not only memory but also bandwidth. Plus I'm assuming all the circuits need to be synthesized.

Just so we're on the same page, increasing mempool capacity does not consume more bandwidth nor (synthesizing) processing time.

But it does use memory and your other concerns are valid.

@raychu86
Copy link
Contributor Author

@apruden2008 I've noted your concerns and want to make sure we are all on the same page. As Victor has mentioned, the mempool capacity is just in memory storage of the transactions (that we have yet to verify and hold no other resources other than space in memory).

The memory capacity is usually not the limiting factor of node operation, as we usually see memory usage hover at single digit percentages. The verification and processing of these transactions is the more important metric, which is why we bound the number of deployments we can process for any given round to reduce CPU usage.

In any case, I've lowered the bounds for deployments to 32 deployments in the mempool and a verification/process rate of 15% (which is at most 7 deployments per certificate) to further address any concerns.

@howardwu
Copy link
Contributor

howardwu commented Mar 2, 2024

I just read this thread and there appears to be a game of telephone happening here.

Programs cannot be 300MBs, not even close.

@vicsn can you provide clear context as to how we arrived at this number?

@howardwu
Copy link
Contributor

howardwu commented Mar 2, 2024

I just read this thread and there appears to be a game of telephone happening here.

Programs cannot be 300MBs, not even close.

@vicsn can you provide clear context as to how we arrived at this number?

Discussed offline. The 300MBs is a network message size, however it is unrelated to programs. There was a dummy program written which sized up to 50MBs by leveraging the struct construction.

Following this discussion, the decision was made to reduce the number of structs that can be defined in a program, and to bound any other constructs of a program so that the size of a program is inherently restricted.

@vicsn and @d0cd will be responsible with implementing a PR to contain this.

@raychu86 raychu86 changed the title Increase the mempool capacity Increase the inbound queue capacity (that leads into the memory pool) Mar 4, 2024
@howardwu
Copy link
Contributor

howardwu commented Mar 4, 2024

Limiting the queue to 16 deployments is reasonable to mitigate potential DOS vectors (using malformed deployments), and restricting each interval to 1 deployment limits what the memory pool introduces into consensus. We will want to wait for the PR with program size restrictions, which will further reduce networking overhead of deployments.

Note: this PR does not change the memory pool limits (which remains gated at 50 transmissions per batch proposal). Rather, this PR limits the inbound queue, which leads into the memory pool (recall Narwhal is the memory pool by definition).

@vicsn
Copy link
Contributor

vicsn commented Mar 11, 2024

With https://github.com/AleoHQ/snarkOS/issues/3141 merged, deployments are max 100KB. Noting that we don't want to enable attackers to fill the queue with malformed deployments, @howardwu do you want to keep CAPACITY_FOR_DEPLOYMENTS limited to 16?

@howardwu
Copy link
Contributor

howardwu commented Mar 12, 2024 via email

@vicsn
Copy link
Contributor

vicsn commented Mar 12, 2024

At 16 deployments, we technically run the risk that a malicious actor could prevent program deployments by DOS'ing the network with a large quantity of new deployment transactions, thereby deleting the earlier deployment transactions, because we operate on an LRU policy in the inbound queue. As such, it is advisable for us to raise this limit as well. At 100KB, we may want to consider the limit at 1024. Is there an alternative value you might propose?

After additional thought, I agree 1024 is the way to go. Pushed the adustment: 12cca35

In the scenario where a validator is compromised, whatever the queue length, it is very susceptible to attacks. But in the scenario where no validator is compromised, 1024 is strictly better.

@howardwu howardwu merged commit f8d3c90 into mainnet-staging Mar 12, 2024
23 of 24 checks passed
@howardwu howardwu deleted the increase-cache-size branch March 12, 2024 23:32
@howardwu howardwu changed the title Increase the inbound queue capacity (that leads into the memory pool) [Fix] Increase the inbound queue capacity (that leads into the memory pool) Mar 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants