Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug][Aptos Framework][API][Transaction][VM] Multisig v2 FIFO invariant facilitates DoS attacks #8411

Closed
alnoki opened this issue May 28, 2023 · 21 comments · Fixed by #12241
Closed
Labels
bug Something isn't working move move-framework Issues related to the Framework modules/libraries stale-exempt Prevents issues from being automatically marked and closed as stale

Comments

@alnoki
Copy link
Contributor

alnoki commented May 28, 2023

@banool @chen-robert @davidiw @JackyWYX @jjleng @LeviHHH @lightmark @movekevin @wrwg @zekun000

Executive summary

Presently, the Multisig v2 paradigm added in #5894 enforces a first-in-first-out (FIFO) queue for pending multisig transaction proposals: on-chain multisig transaction proposals must be either executed or rejected in order of submission (sequence number).

This invariant facilitates a denial-of-service (DoS) attack vector whereby a malicious signatory, or an attacker who has access to the compromised private key of an honest signatory, can flood the transaction queue with malicious transactions and essentially gridlock multisig activity. To mitigate this attack vector, a flushing operation is proposed, such that an approved flush transaction can be executed out-of-order and reset the sequence number flow.

Example attack

Ace, Bee, Cad, Dee, and Edd are owners of a 3-of-5 multisig v2 account that controls:

  1. A CoinStore with $1 M of Coin<AptosCoin> (APT)
  2. A protocol package with $10 M total value locked (TVL)
  3. A protocol coin package (PRO) with a market cap of $100 M.

The multisig has executed fifteen legitimate transactions and has no pending transactions:

  • aptos_framework::multisig_account::MultisigAccount.last_executed_sequence_number = 15
  • aptos_framework::multisig_account::MultisigAccount.next_sequence_number = 16

Each signatory has only $50 k worth of APT.

An attacker compromises Edd's private key and uses it to rotate the authentication key on Edd's aptos_framework::account::Account, sends in an additional $200 k of APT to 0xedd..., opens up a short position on PRO, then prepares to DoS:

Owner Status APT holdings
0xace Honest $50 k
0xbee Honest $50 k
0xcad Honest $50 k
0xdee Honest $50 k
0xedd Attacker $250 k

The attacker then attempts to seize control of the entire multisig by sending the same transaction proposal ad infinitum:
remove all signatories from the multisig other than 0xedd. Minutes later Ace notices an attack is underway, but at this point the attacker has already submitted hundreds of thousands of transactions using a bot, exhausting tens of thousands of dollars of gas in the process. Ace quickly sends in a transaction to remove 0xedd from the account, and the attacker starts proposing a different transaction: send all APT to 0xedd. The FIFO queue is now as follows:

Sequence number Proposer Content
15 0xbee Legitimate governance transaction
16 0xedd Remove all signatories other than 0xedd
17 0xedd Remove all signatories other than 0xedd
.. 0xedd Remove all signatories other than 0xedd
300,000 0xedd Remove all signatories other than 0xedd
300,001 0xedd Remove all signatories other than 0xedd
300,002 0xace Remove 0xedd from the multisig
300,003 0xedd Transfer all APT to 0xedd
300,004 0xedd Transfer all APT to 0xedd
... 0xedd Transfer all APT to 0xedd

At this point, due to the FIFO invariant, the following steps are necessary just to prevent the attacker from proposing more APT transfers:

  1. All honest signatories must vote no on proposals 16 through 300,001
  2. At least one honest signatory must call aptos_framework::multisig_account:execute_rejected_transaction for each such rejection, thereby incurring additional gas costs
  3. All honest signatories must vote yes on proposal 300,002
  4. At least one honest signatory must invoke the VM-level execution on proposal 300,002

Note that the attacker is a well-financed savvy programmer who has:

  1. Already automated the above pre-meditated attack
  2. Calculated from transaction simulations that Ace, Bee, Cad, and Dee will collectively exhaust all of their APT by rejection 200,000
  3. Performed contact tracing on each honest signatory's account and has estimated with two-sigma (95%) likelihood that they collectively only have enough APT in linked on-chain accounts to cover rejections up through proposal 275,000

Even if the honest signatories can cobble together enough funds and computational resources to coordinate 300 k transaction rejections, they will still then have to deal with the backlog of APT transfer proposals starting at sequence number 300,003:
having exhausted their personal funds they appeal to the community for funding to cover the gas costs, and a bout of PRO panic selling ensues.

Having opened up a massive short position on PRO before the attack, the attacker easily re-coups all gas funds expended during the attack and profits handsomely.

Proposed mitigant

A flush transaction with the following Move function signature:

module aptos_framework::multisig_account {

    /// Flush all pending transactions and update owner schema.
    entry fun flush_queue_and_update_owner_schema(
        owner: &signer,
        multisig_account: address,
        owners_to_remove: vector<address>,
        owners_to_add: vector<address>,
        new_num_signatures_required: u64
    ) acquires MultisigAccount {
        // Increment last_executed_sequence_number by 1
        // Insert this transaction proposal's `aptos_framework::multisig_account::MultisigTransaction`
        //    at last_executed_sequence_number
        // Set next_sequence_number to last_executed_sequence_number + 1
        // Add `owners_to_add`
        // Set num_signatures_required to 1
        // Remove `owners_to_remove`
        // Set num_signatures_required to `new_num_signatures_required`
    }

}

This mitigant requires that malicious transactions in the queue be overwritten over time, rather than during the flush operation, because a flush operation resulting in 300 k table operations will exhaust per-transaction gas limits. Hence in this solution, the flush operation essentially overwrites the pointer to the head of the queue, such that transactions enqueued after the flush overwrite DoS proposals.

Note that this flush operation will require a violation of the FIFO invariant, likely through a new transaction type that incorporates a nested Multisig type defined at

/// A multisig transaction that allows an owner of a multisig account to execute a pre-approved
/// transaction as the multisig account.
#[derive(Clone, Debug, Hash, Eq, PartialEq, Serialize, Deserialize)]
pub struct Multisig {
pub multisig_address: AccountAddress,
// Transaction payload is optional if already stored on chain.
pub transaction_payload: Option<MultisigTransactionPayload>,
}

The proposed new transaction type is as follows:

/// A multisig transaction that allows an owner of a multisig account to execute a pre-approved
/// multisig flush transaction as the multisig account.
#[derive(Clone, Debug, Hash, Eq, PartialEq, Serialize, Deserialize)]
pub struct MultisigFlush {
    pub Multisig: Multisig,
    /// Sequence number of the flush transaction proposal
    pub sequence_number: u64,
}

The VM will also have to be updated to support this new transaction type, per the following:

  1. The VM needs to verify that the enclosed MultisigTransactionPayload corresponds to a aptos_framework::multisig_account::flush_queue_and_update_owner_schema invocation, for both full-payload and hash-only proposals, at the specified sequence number
  2. The VM needs to support invocations that violate the FIFO invariant
@alnoki alnoki added the bug Something isn't working label May 28, 2023
@movekevin
Copy link
Contributor

movekevin commented May 28, 2023

We actually thought through this during the design and sec audit. If not consider this a bug or dos exploit as this is consistent with how other multisig systems on other chains are designed. Technically you can reject the spammy txs right? Why add a flush operation that needs voting when you can reject?

We can introduce a batch reject tx if theres a worry of having to send many reject txs for the spammy proposals.

@alnoki
Copy link
Contributor Author

alnoki commented May 28, 2023

Why add a flush operation that needs voting when you can reject?

We can introduce a batch reject tx if theres a worry of having to send many reject txs for the spammy proposals.

@movekevin rejecting becomes prohibitively expensive in a DoS attack with many duplicate transactions (per above example attack), and if a batch reject has to write to thousands of transaction table entries then the gas costs could exceed the per-transaction gas limit, rendering the batch reject operation unusable. Hence the quasi-pointer overwrite I suggest above.

Your thoughts?

@movekevin
Copy link
Contributor

movekevin commented May 28, 2023

The attacker would need to pay gas as well and in fact more to create these txs. In my opinion, this is more of a convenience issue to mass reject txs. It's only a serious prob if attacker pays much less gas than mitigation.

Plus gas is currently pretty low anyway. Im not against introducing a flush operation if it is a pressing use case to mass reject txs though

@chen-robert
Copy link
Contributor

As additional context, creating the transaction should be more expensive than just casting a vote because the transaction proposer is responsible for setting up all transaction-related states, including creating a new storage slot.

We didn't end up profiling this during the security review, but I think that fact makes it probably harder to pull off in reality. Would be curious what the numbers look like, especially after the recent gas refactor.

@movekevin
Copy link
Contributor

movekevin commented May 28, 2023

There's also a plan to add support for executing (or rejecting) a tx in a single tx with the signatures from owners passed in as args (#8412). This would also reduce the cost of rejecting + executing (all can be in done in a single tx). We could even add a batch function for this which would allow rejecting multiple txs in a single tx, which would be roughly equivalent to the proposed flush function above

@alnoki
Copy link
Contributor Author

alnoki commented May 29, 2023

@movekevin @chen-robert indeed the attacker will incur the more expensive per-item global storage costs to open up a table entry for each bogus transaction, but any such gas costs may be rendered negligible depending on specific economic incentives: in the example I gave above, the opportunity to simultaneously short and cripple a successful protocol is well worth even tens of thousands of dollars of gas costs. In reality the relative amounts to DoS or to mitigate might differ from what I present above, but the determinant for whether an attack is profitable or not is ultimately the associated TVL/treasury/market cap to be exploited.

As for bulk rejection transactions or multi-agent transactions per #8412 , these approaches could potentially reduce the number of operations required to clear out a backlogged queue but they are still limited by per-transaction limits. Even in the case of a multi-agent bulk rejection (combining both theoretical approaches), there is an upper limit on the number of loopwise table operations and function calls that can be affected in a single transaction, due to per-transaction gas limits. Hence bulk rejections might only allow, say, 1000 proposals at a time, and if this is the case then honest signatories are faced with a significant coordination problem requiring a non-negligible level of technical sophistication (estimate max bulk rejection size, create a loop to make proposals, have all signatories operate in concert, etc.). Meanwhile the attacker is adding to the backlog (transfer APT in my example above).

Another potential disadvantage of forcing DoS transactions to pass through the FIFO flow is that it could become a bit of an indexing nightmare to catalog legitimate multisig activity: nodes might prune all events prior to the DoS attack, inspecting the execution logs would yield a wall of bogus activity, etc. Perhaps most perniciously, however, any such DoS attacks will eat into global storage and contribute to chain bloat. Again, with significant enough economic upside on a potential attack, a malicious attacker may be willing to submit even tens of gigabytes of DoS content.

The flush operation I propose above mitigates these risks, in a manner that I believe completely deters DoS vectors: if honest signatories can sidestep the FIFO queue and reset the pointer to the head of the queue with a single operation, it is essentially guaranteed that an attacker is wasting money by trying to clog the queue. This holds true regardless of relative gas amounts.

I'm glad to submit a draft PR on the Move code side of things to stimulate further discussion.

@alnoki
Copy link
Contributor Author

alnoki commented Jun 5, 2023

The attacker would need to pay gas as well and in fact more to create these txs. In my opinion, this is more of a convenience issue to mass reject txs. It's only a serious prob if attacker pays much less gas than mitigation.

@movekevin

As additional context, creating the transaction should be more expensive than just casting a vote because the transaction proposer is responsible for setting up all transaction-related states, including creating a new storage slot.

@chen-robert

Proposing a DoS transaction costs less than rejecting the transaction when the payload is sufficiently large, per the following testnet demonstration:

Hence it costs less to propose the attack transaction than it does to get the required two rejection votes (not to mention the rejection execution transaction), and the ratio of reject cost to proposal cost grows linearly with the number of signatures required: a 4-of-7 multisig would have to pay twice as much gas to reject compared with the gas required to propose a 50 kilobyte transaction, for example.

Hence my earlier suggestion to simply change the header to the pointer of the queue, since a savvy attacker will almost necessarily send large transactions that are expensive to reject in bulk.

@movekevin
Copy link
Contributor

Thanks for the detailed analysis, @alnoki. Although I do still think this attack is possible, I'm not sure why an attacker would do this in practice. If the goal of the exploit is to withdraw funds, they could have already done so immediately. If the attacker only gains access to one key in a multisig that has signature threshold > 1, spamming the multisig would cause annoyance and some gas costs to the owners of the multisig, but that would not be a significant amount and doesn't seem to benefit the attacker at all in anyway.

With that said, I do agree with adding more convenient functions for owners to clear the queue of invalid transactions (which was somewhat of a pain point with Gnosis Safe, the most popular multisig wallet product on EVM chains) and perhaps exploring the ability to execute out of order. However, I just want to make sure we think this through and design it well as out-of-order execution, especially when it allows changing adding/removing owners in one go, can be dangerous. This could allow attackers to gain faster control over the multisig for example.

Furthermore, the gas cost associating with rejecting and remove-rejecting transactions are likely associated with writing the simple map entries as changing anything there counts as writing it entirely. And removing a storage slot doesn't currently receive any gas refund and may well do in the future. Thus, I think the flush operation proposed here would still increase very high gas cost. IMO, it could be simpler to just offer a bulk reject function, where owners can provide signed messages certifying they want to remove txs up to a given transaction_id. This function should not do anything else (add/remove owners) and instead only clears the queue. This would greatly simplify things as we'd not need to specifically store this flush transaction on chain (making the data structure more complex). In the near future, once multisig_account also supports adding and executing a tx in one go by providing all the required signatures, this would be a nice consistent UX.

@alnoki
Copy link
Contributor Author

alnoki commented Jun 5, 2023

Although I do still think this attack is possible, I'm not sure why an attacker would do this in practice.

@movekevin if an attacker can profit $500k USD by shorting a protocol then DoSing the queue to prevent critical operations, it it is worth it to spend $400k USD in gas (which would require $800k to mitigate in a 4-of-7 in the current design).

However, I just want to make sure we think this through and design it well as out-of-order execution, especially when it allows changing adding/removing owners in one go, can be dangerous. This could allow attackers to gain faster control over the multisig for example.

Agreed.

Furthermore, the gas cost associating with rejecting and remove-rejecting transactions are likely associated with writing the simple map entries as changing anything there counts as writing it entirely.

Agreed on the rejection vote transaction, but not for the remove-reject transaction: after the second required rejection vote, the remove-reject transaction for the above examples costs 0.00001 APT in gas.

Thus, I think the flush operation proposed here would still increase very high gas cost.

I am confused about this conclusion, as the functionality I propose here and in #8424 does not write to any simple maps. Are you suggesting that high gas costs will be incurred when overwriting DoS transactions, after the head of the queue is reset? I do not this would be the case because I would expect the write costs to be assessed on the write set (small legitimate payload when overwriting large DoS payload).

In the near future, once multisig_account also supports adding and executing a tx in one go by providing all the required signatures, this would be a nice consistent UX.

I am opposed to this, as off-chain signature assemblage is cumbersome (e.g. "hey everyone in this group chat, will you please reply to me with a signed blob for the BCS blob I'm sending over?" vs "hey everyone in this group chat please vote on transaction 3"). The advantage of Multisig v2 is that signature assemblage is on chain.

IMO, it could be simpler to just offer a bulk reject function, where owners can provide signed messages certifying they want to remove txs up to a given transaction_id.

It sounds like this approach would simply increment last_resolved_sequence_number instead of decrementing next_sequence_number as I suggest, which is still a head queue pointer, which would still work. But this would still lead to long stretches of DoS transactions in history.

This function should not do anything else (add/remove owners) and instead only clears the queue.

If the owner schema can't be updated at the same time as the queue flush, then what is to stop an attacker from immediately submitting more DoS transactions after the flush transactions, blocking owner schema modifications?

@movekevin
Copy link
Contributor

movekevin commented Jun 5, 2023

I am confused about this conclusion, as the functionality I propose here and in #8424 does not write to any simple maps. Are you suggesting that high gas costs will be incurred when overwriting DoS transactions, after the head of the queue is reset? I do not this would be the case because I would expect the write costs to be assessed on the write set (small legitimate payload when overwriting large DoS payload).

The flush operation prototype you have currently does not update the records regarding how many owners agreed to "flush" a pending transaction. If it was to update this, this would lead to the same gas cost incurred by reject_transaction

I am opposed to this, as off-chain signature assemblage is cumbersome (e.g. "hey everyone in this group chat, will you please reply to me with a signed blob for the BCS blob I'm sending over?" vs "hey everyone in this group chat please vote on transaction 3"). The advantage of Multisig v2 is that signature assemblage is on chain.

Storing the flush operation on chain can be very complex as it needs a lot of safeguards such as a time bound. Without it, these txs can potentially be executed much later than intended and thus wipe out more txs than originally intended. I think some of the pain points can be abstracted away with a UI and/or CLI, and instantly executing the flush would be both simpler and safer and can lead to few edge cases due to the out-of-order execution cadence

If the owner schema can't be updated at the same time as the queue flush, then what is to stop an attacker from immediately submitting more DoS transactions after the flush transactions, blocking owner schema modifications?
In order to prevent the attacker from clogging up the queue, you can both remove the pending transactions and propose a new one to remove the compromised owners in a single tx.

Here's a prototype PR I just quickly wrote up: #8523

@alnoki
Copy link
Contributor Author

alnoki commented Jun 5, 2023

The flush operation prototype you have currently does not update the records regarding how many owners agreed to "flush" a pending transaction. If it was to update this, this would lead to the same gas cost incurred by reject_transaction

@movekevin the prototype is submitted as a standard transaction proposal payload or payload hash, and hence the votes are tabulated against the ensuing MultisigTransaction stored on-chain. This will have a low payload and thus low gas.

Storing the flush operation on chain can be very complex as it needs a lot of safeguards such as a time bound. Without it, these txs can potentially be executed much later than intended and thus wipe out more txs than originally intended.

Good point. I can update the PR to include a time bound.

I think some of the pain points can be abstracted away with a UI and/or CLI, and instantly executing the flush would be both simpler and safer and can lead to few edge cases due to the out-of-order execution cadence

Here I would just want to ensure there is not off-chain assemblage required, since the process is cumbersome.

Here's a prototype PR I just quickly wrote up: #8523

I've commented there accordingly.

I'm thinking even moreso now that what I suggest won't be so different from what you suggest if, instead of resetting the queue head by decrementing next_sequence_number as I originally suggested, I simply revise to a last_resolved_sequence_number increment. I'll also update #8424 to include this.

alnoki added a commit to alnoki/aptos-core that referenced this issue Jun 5, 2023
@thepomeranian thepomeranian added move move-framework Issues related to the Framework modules/libraries labels Jun 15, 2023
@sausagee sausagee added the stale-exempt Prevents issues from being automatically marked and closed as stale label Jul 22, 2023
@junkil-park
Copy link
Contributor

junkil-park commented Feb 7, 2024

@alnoki , @movekevin , @chen-robert , to address the DoS attacks without necessitating out-of-order execution, I'd like to propose setting a cap on the transaction queue size, MAX_PENDING_TXNS, at a practical limit like 100 or 1000. Thus, we enforce that next_sequence_number - (last_executed_sequence_number + 1) <= MAX_PENDING_TXNS. This approach limits the number of transactions and gas fees required to counteract a DoS attack. Consider the following scenario:

  1. A compromised owner fills the transaction queue with MAX_PENDING_TXNS bogus transactions.
  2. Good owners vote "no" for the first bogus transaction.
  3. The final good owner submits a script to atomically do the following actions:
    3.1. vote "no" for the first bogus transaction
    3.2 execute-reject the first bogus transaction, removing it from the queue
    3.3 create a new transaction ("eviction transaction") to remove the compromised owner
  4. Good owners clear the remaining bogus transactions.
  5. The compromised owner may continue adding bogus transactions, but these go to the end of the queue.
  6. Good owners execute the eviction transaction, removing the compromised owner from the owner list.
  7. Good owners clear any remaining bogus transactions.

The effort required in steps 4 and 7 is capped by O(MAX_PENDING_TXNS), making the task manageable. We can further simplify them by supporting batch removals and/or off-chain signature execution/rejection (#8412).

@alnoki
Copy link
Contributor Author

alnoki commented Feb 8, 2024

@alnoki , @movekevin , @chen-robert , to address the DoS attacks without necessitating out-of-order execution, I'd like to propose setting a cap on the transaction queue size, MAX_PENDING_TXNS, at a practical limit like 100 or 1000. Thus, we enforce that next_sequence_number - (last_executed_sequence_number + 1) <= MAX_PENDING_TXNS. This approach limits the number of transactions and gas fees required to counteract a DoS attack. Consider the following scenario:

  1. A compromised owner fills the transaction queue with MAX_PENDING_TXNS bogus transactions.
  2. Good owners vote "no" for the first bogus transaction.
  3. The final good owner submits a script to atomically do the following actions:
    3.1. vote "no" for the first bogus transaction
    3.2 execute-reject the first bogus transaction, removing it from the queue
    3.3 create a new transaction ("eviction transaction") to remove the compromised owner
  4. Good owners clear the remaining bogus transactions.
  5. The compromised owner may continue adding bogus transactions, but these go to the end of the queue.
  6. Good owners execute the eviction transaction, removing the compromised owner from the owner list.
  7. Good owners clear any remaining bogus transactions.

The effort required in steps 4 and 7 is capped by O(MAX_PENDING_TXNS), making the task manageable. We can further simplify them by supporting batch removals and/or off-chain signature execution/rejection (#8412).

@junkil-park generally I think this approach solves the major problem, however I think there are some important UX considerations:

  1. Steps (4) and (7) can actually be quite a hassle in practice, because it is often the case that multiple signatories on a multisig account are not as technically savvy as the executor of (3) would be expected to be. Here it would be crucial to have a batched voting function, e.g. vote no on 10 proposals all at the same time.
  2. Step (3) would similarly benefit from a consolidated entry function, especially so that is clear from the Move module itself that DoS prevention is manageable using in-built APIs (as opposed to requiring sophisticated off-chain orchestration, in which I am including writing a Move script). Ideally this transaction would also include batch rejection of all DoS transactions.
  3. Assuming batchwise functions are available, a sophisticated attacker would then be inclined to submit txns with the highest payload as possible, so as to increase I/O and storage gas costs for the batchwise txns. Hence I would suggest that the MAX_PENDING_TXNS value be determined empirically, e.g. on an n-of-32 multisig (I believe 32 is the max, but if not, then whatever the max is) with the maximum possible metadata, with DoS transactions containing the largest permissible payload. This could be incorporated into a unit test, for example.

Hence per the simplified schema I am suggesting:

  1. A compromised owner fills the transaction queue with MAX_PENDING_TXNS bogus transactions.
  2. (n - 1) laymen good owners submit a consolidated public entry txn that batch-wise votes "no" for all DoS txns
  3. Savvy good owner submits a consolidated public entry txn that
    1. Vote-rejects the first DoS txn (this could be via an implicit vote-rejection per [Feature Request][CLI][Multisig v2] Enable implicit yes/no vote for multisig execution/rejection #11011)
    2. Creates an eviction txn at the tail of the queue
  4. (n - 1) Laymen good owners approve eviction txn at tail of queue
  5. Savvy good owner submits public entry txn that
    1. Vote-rejects all pending DoS txns
    2. Vote executes the eviction txn

How does this sound?

@junkil-park
Copy link
Contributor

@alnoki , that sounds good!

  1. Savvy good owner submits a consolidated public entry txn that
  2. Savvy good owner submits public entry txn that

Just to clarify, are you suggesting to have some designated entry functions for these operations?

@alnoki
Copy link
Contributor Author

alnoki commented Feb 8, 2024

Just to clarify, are you suggesting to have some designated entry functions for these operations?

@junkil-park yes, exactly. Per above flow, I think this would require the following function signatures:

/// Vote on multiple txns in a batch
public entry fun vote_transactions(
    owner: &signer,
    multisig_account: address,
    starting_sequence_number: u64,
    final_sequence_number: u64,
    approved: bool,
)
/// Vote-reject first DoS txn, enqueue eviction txn.
public entry fun prepare_flush(
    owner: &signer,
    multisig_account: address,
    starting_sequence_number: u64,
    final_sequence_number: u64,
    /// These last args should be passed to update_owner_schema(), the eviction txn
    new_owners: vector<address>,
    owners_to_remove: vector<address>,
    optional_new_num_signatures_required: Option<u64>,
)
/// Vote-reject all pending txns, vote execute eviction txn (assumed at end of txn queue)
public entry fun execute_flush(
    owner: &signer,
    multisig_account: address,
    starting_sequence_number: u64,
)

@junkil-park
Copy link
Contributor

  1. Assuming batchwise functions are available, a sophisticated attacker would then be inclined to submit txns with the highest payload as possible, so as to increase I/O and storage gas costs for the batchwise txns. Hence I would suggest that the MAX_PENDING_TXNS value be determined empirically, e.g. on an n-of-32 multisig (I believe 32 is the max, but if not, then whatever the max is) with the maximum possible metadata, with DoS transactions containing the largest permissible payload. This could be incorporated into a unit test, for example.

@alnoki , Could you elaborate a bit more about the experiment and unit test that are suggested above ^^?

@alnoki
Copy link
Contributor Author

alnoki commented Feb 8, 2024

alnoki , Could you elaborate a bit more about the experiment and unit test that are suggested above ^^?

@junkil-park For the proposed flow to work, it must be guaranteed that vote_transactions(), prepare_flush(), and execute_flush() do not abort due to maximum gas exceeded errors. Hence, a sophisticated attacker would put the highest payload possible into each proposal in order to exhuast execution gas in the mitigation transactions.

Thus I am suggesting setting MAX_PENDING_TXNS low enough so that even if each proposal contains the highest payload possible, all 3 mitigating txns can still execute. For example MAX_PENDING_TXNS set to 2 shouldn't be a problem, but setting it to 10_000 might because of the O(n) operations required. Additionally, txn gas will be higher for more data in the multisig account resource, so a unit test that ensures all txns can execute without exceeding gas limits should additionally pack as much data into the multisig account resource as possible.

Does this help?

@junkil-park
Copy link
Contributor

junkil-park commented Feb 26, 2024

Just to clarify, are you suggesting to have some designated entry functions for these operations?

@junkil-park yes, exactly. Per above flow, I think this would require the following function signatures:

/// Vote on multiple txns in a batch
public entry fun vote_transactions(
    owner: &signer,
    multisig_account: address,
    starting_sequence_number: u64,
    final_sequence_number: u64,
    approved: bool,
)
/// Vote-reject first DoS txn, enqueue eviction txn.
public entry fun prepare_flush(
    owner: &signer,
    multisig_account: address,
    starting_sequence_number: u64,
    final_sequence_number: u64,
    /// These last args should be passed to update_owner_schema(), the eviction txn
    new_owners: vector<address>,
    owners_to_remove: vector<address>,
    optional_new_num_signatures_required: Option<u64>,
)
/// Vote-reject all pending txns, vote execute eviction txn (assumed at end of txn queue)
public entry fun execute_flush(
    owner: &signer,
    multisig_account: address,
    starting_sequence_number: u64,
)

@alnoki , it seems to be impossible to implement the function execute_flush mentioned above ^^ because "execution" should be a separate transaction. However, it sounds great to have the entry functions for batch votes, batch rejections and reject-and-remove for a better UX.

@alnoki
Copy link
Contributor Author

alnoki commented Feb 26, 2024

Just to clarify, are you suggesting to have some designated entry functions for these operations?

@junkil-park yes, exactly. Per above flow, I think this would require the following function signatures:

/// Vote on multiple txns in a batch
public entry fun vote_transactions(
    owner: &signer,
    multisig_account: address,
    starting_sequence_number: u64,
    final_sequence_number: u64,
    approved: bool,
)
/// Vote-reject first DoS txn, enqueue eviction txn.
public entry fun prepare_flush(
    owner: &signer,
    multisig_account: address,
    starting_sequence_number: u64,
    final_sequence_number: u64,
    /// These last args should be passed to update_owner_schema(), the eviction txn
    new_owners: vector<address>,
    owners_to_remove: vector<address>,
    optional_new_num_signatures_required: Option<u64>,
)
/// Vote-reject all pending txns, vote execute eviction txn (assumed at end of txn queue)
public entry fun execute_flush(
    owner: &signer,
    multisig_account: address,
    starting_sequence_number: u64,
)

@alnoki , it seems to be impossible to implement the function execute_flush mentioned above ^^ because "execution" should be a separate transaction. However, it sounds great to have the entry functions for batch votes, batch rejections and reject-and-remove.

@junkil-park Noted that execution should be a separate txn because it is a special txn type, hence I suggest implementation as close as functionally possible to the above prototypes

@junkil-park
Copy link
Contributor

junkil-park commented Feb 26, 2024

@alnoki , I made a draft implementation: In my draft PR,

I think these can be good building blocks for DoS mitigation, if not sufficient. We can improve the UX in M-Safe and CLI for a better experience for recovery from DoS.

Please let me know what you think.

@alnoki
Copy link
Contributor Author

alnoki commented Feb 26, 2024

@junkil-park thanks for tagging me. I've left a review with a breakdown of what this looks like in practice:

#12241 (review)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working move move-framework Issues related to the Framework modules/libraries stale-exempt Prevents issues from being automatically marked and closed as stale
Projects
None yet
6 participants