feat(congestion): reject new transactions on RPC level #11419

jakmeier · 2024-05-29T17:26:17Z

Summary: In this PR, we introduce a new failure mode on the RPC level when a transaction is submitted under congestion. The error is of type InvalidTxError and called ShardCongested with a single field shard_id referencing the congested shard.

Details

With cross-shard congestion control being stabilized soon, we must deal with the case when a shard rejects new transactions.

On the chunk producer level, all transactions going to a congested shard will be dropped. This keeps the memory requirements of chunk producers bounded. Further, we decided to go for a relatively low threshold in order to keep the latency of accepted transactions low, preventing new transactions as soon as we hit 25% congestion on a specific shard. Consequently, when shards are congested, it will not be long before transactions are rejected.

This has consequences for the users. On the positive side, they will no longer have to wait for a long time not knowing if their transaction will be accepted or not. Either, it is executed within a bounded time (at most 20 blocks after inclusion) or it will be rejected immediately.

But on the negative side, when a shard is congested, they will have to actively retry sending the transaction until it gets accepted. We hope that this can be automated by wallets, which can also provide useful live updates to the user about what is happening. But for this, they will need to understand and handle the new error ShardCongested differently from existing errors. The key difference is that the same signed transaction can be sent again and will be accepted if congestion has gone down.

jakmeier · 2024-05-30T08:26:55Z

One of the points raised in near/NEPs#539 was that we have to clearly propagate the information about rejected transactions to users.

This PR is my suggestion how to expose the information on the RPC node level.

Summary: In this PR, we introduce a new failure mode on the RPC level when a transaction is submitted under congestion. The error is of type `InvalidTxError` and called `ShardCongested` with a single field `shard_id` referencing the congested shard. ## Details With [cross-shard congestion control](near/NEPs#539) being stabilized soon, we want to reject new transactions as early as possible when the receiver shard is already overloaded with traffic. On the chunk producer level, all transactions going to a congested shard will be dropped. This keeps the memory requirements of chunk producers bounded. Further, we decided to go for a relatively low threshold in order to keep the latency of accepted transactions low, preventing new transactions as soon as we hit 25% congestion on a specific shard. Consequently, when shards are congested, it will not be long before transactions are rejected. This has consequences for the users. On the positive side, they will no longer have to wait for a long time not knowing if their transaction will be accepted or not. Either, it is executed within a bounded time (at most 20 blocks after inclusion) or it will be rejected immediately. But on the negative side, when a shard is congested, they will have to actively retry sending the transaction until it gets accepted. We hope that this can be automated by wallets, which can also provide useful live updates to the user about what is happening. But for this, they will need to understand and handle the new error `ShardCongested` different from existing errors.

Some tests should run without congestion control to function the same way the did. And some congestion control tests will observe the new error, which is expected.

codecov · 2024-05-31T07:54:52Z

Codecov Report

Attention: Patch coverage is 88.09524% with 5 lines in your changes are missing coverage. Please review.

Project coverage is 71.42%. Comparing base (96af8f7) to head (f1b492d).
Report is 1 commits behind head on master.

Files	Patch %	Lines
chain/client/src/client.rs	91.66%	0 Missing and 2 partials ⚠️
core/primitives/src/errors.rs	0.00%	2 Missing ⚠️
chain/chain/src/runtime/mod.rs	93.33%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #11419      +/-   ##
==========================================
- Coverage   71.45%   71.42%   -0.03%     
==========================================
  Files         785      785              
  Lines      158807   158846      +39     
  Branches   158807   158846      +39     
==========================================
- Hits       113473   113463      -10     
- Misses      40429    40479      +50     
+ Partials     4905     4904       -1

Flag	Coverage Δ
backward-compatibility	`0.24% <0.00%> (-0.01%)`	⬇️
db-migration	`0.24% <0.00%> (-0.01%)`	⬇️
genesis-check	`1.37% <0.00%> (-0.01%)`	⬇️
integration-tests	`37.46% <88.09%> (-0.06%)`	⬇️
linux	`68.95% <59.52%> (+0.03%)`	⬆️
linux-nightly	`70.92% <88.09%> (-0.02%)`	⬇️
macos	`52.54% <59.52%> (-0.01%)`	⬇️
pytests	`1.59% <0.00%> (-0.01%)`	⬇️
sanity-checks	`1.38% <0.00%> (-0.01%)`	⬇️
unittests	`66.18% <78.57%> (-0.02%)`	⬇️
upgradability	`0.28% <0.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jakmeier · 2024-05-31T08:53:39Z

@wacban and @bowenwang1996 this PR is what I propose for how we handle rejected transactions.

In summary:

RPC nodes return InvalidTxError::ShardCongested when they see that the receiver shard is congested. (introduced in this PR)
When an RPC forwards a TX, chunk producers may drop it later due to congestion. (no change in this PR)

In the second case, RPC nodes will wait for a tx_status timeout to occur and provide no useful information back to the API user. This may happen every time a shard transitions from not congested to congested, with slightly unlucky tx submission timing.

I don't really like that it is marked as InvalidTxError but that's minor IMO. But I really don't like the case where we rely on timeouts. If transactions are sent with TxExecutionStatus::Included or stronger guarantees, getting dubious timeouts due to congestion is a bad developer and user experience. I just don't know anything better. I only see two alternatives:

Don't drop transactions from the tx pool due to congestion, with the downside that, during congestion, the pool will quickly fill up with transactions that cannot currently be processed. Potentially that starves all other shards.
Or, somehow signal the transactions dropped from the transaction pool back to the RPCs. That seems it would be a major effort. First, we send multiple RoutedMessageBody::ForwardTx from one RPC to several chunk producers. Then, transactions are inserted in the pool successfully, or perhaps the tx already exsisted, or maybe the tx pool is full. Already at this point we don't report back to the RPC in any way, regardless of congestion. RPCs just poll the chain state for inclusion of the tx. With congestion control enabled, we add the possibility that we drop a transaction from the pool in the future, at which point we have no idea which RPC(s) originally sent us the transaction. So yeah, reporting it back without a large overhead seems tricky to me.

Please let me know if you have any ideas. Otherwise, I suggest we discuss it again in our next congestion control call.

Also cc @Akashin and @telezhnaya. I believe for dropped transactions on a full transaction pool, we already have the same problem today, right? The problem being that RPC nodes just time out, without a useful error message. What is the recommended way for wallets to deal with it in that case?

bowenwang1996 · 2024-06-02T20:32:04Z

If transactions are sent with TxExecutionStatus::Included or stronger guarantees, getting dubious timeouts due to congestion is a bad developer and user experience.

I agree. I also don't have a good solution in mind and would like to hear what others think. At the same time, this would only be a problem when a shard goes from noncongested to congested, so overall the impact should not be that large.

bowenwang1996

Could we test this behavior?

wacban

LGTM
We can also consider doing the same for the transaction signer shard but it's optional and can be done separately.

wacban · 2024-06-03T14:30:40Z

core/primitives/src/errors.rs

+    /// The receiver shard of the transaction is too congestion to accept new
+    /// transactions at the moment.


typo: too congested

wacban · 2024-06-03T14:35:05Z

chain/chain/src/runtime/mod.rs

@@ -643,9 +643,25 @@ impl RuntimeAdapter for NightshadeRuntime {
        verify_signature: bool,
        epoch_id: &EpochId,
        current_protocol_version: ProtocolVersion,
+        receiver_congestion_info: Option<ExtendedCongestionInfo>,


How about the congestion info of the sender? If the sender is congested their tx limit will be lowered and some transactions may also be rejected because of that.

I don't think we reject them. We just lower the gas limit. When we reach the gas limit, we keep following transactions in the pool rather than rejecting (dropping) them. Eventually, new transactions are dropped but that's because the pool is full, not due to something we added in nep-539.

We could still try to be smart about it, like reject something on the RPC if we think the pool is filling up with transactions all using the same sender shard. But it seems a bit tricky to get the heuristics right. If we are too aggressive, we risk frustrating users by rejecting their transactions for no good reason. Not to mention that "rogue RPCs" could abuse the system by filling the transaction pool beyond what our RPCs do, which would starve the honest RPCs completely under congestion.

Sounds good, we can observe what happens in the wild and make adjustments if we find it necessary.

wacban · 2024-06-03T14:37:56Z

chain/chain/src/runtime/mod.rs

+                congestion_info.congestion_info,
+                congestion_info.missed_chunks_count,
+            );
+            if !congestion_control.shard_accepts_transactions() {


You may consider making that more flexible by making the threshold a configurable value rather than using the runtime parameter of 0.25. The drawback is that those would need to be kept somewhat in sync. I also dislike that some rpc node may artificially lower it in order to get their users priority but this is something they can do anyway.

Good point. We can consider adding the flexibility.

But just to be clear, there should not much to be gained by setting a higher threshold. Unless the congestion level lowers between now and when the chunk producer considers it, it will be rejected and dropped anyway. Most of the time, the RPC will just end up waiting for a timeout before it inevitably fails, annoying the user and wasting resources on the RPC in the process.
At least that is my theory, I guess it could be different in practice. Maybe the congestion level keeps going below and above the threshold, which could potentially give an advantage to more aggressive forwarding on the RPC node compared to the retrying on the wallet/client side.

For some users the UX doesn't matter as much as just maximizing the chances of getting their transaction in. For those it may be useful. It may also be useful for us if we discover that we did the math wrong and need to change it quickly :)

jakmeier · 2024-06-03T15:04:13Z

Could we test this behavior?

Yes makes sense. I added an integration test checking that we the new error is returned when the receiver is congested.

jakmeier mentioned this pull request May 30, 2024

test(congestion_control) - Implemented nayduck tests for congestion control #11429

Merged

jakmeier added 2 commits May 31, 2024 09:17

fix existing tests

f161e54

Some tests should run without congestion control to function the same way the did. And some congestion control tests will observe the new error, which is expected.

jakmeier force-pushed the rpc-rejects-tx branch 2 times, most recently from 52ef5aa to f161e54 Compare May 31, 2024 07:32

jakmeier mentioned this pull request May 31, 2024

feat(loadtest): add ShardCongestedError #11438

Merged

jakmeier requested review from wacban and bowenwang1996 May 31, 2024 08:15

jakmeier marked this pull request as ready for review May 31, 2024 08:15

jakmeier requested a review from a team as a code owner May 31, 2024 08:15

bowenwang1996 reviewed Jun 2, 2024

View reviewed changes

add a test for RPCs rejecting new transactions

6891891

wacban approved these changes Jun 3, 2024

View reviewed changes

jakmeier added 2 commits June 3, 2024 17:01

fix typo

5b25b6c

Merge remote-tracking branch 'upstream/master' into HEAD

f1b492d

bowenwang1996 approved these changes Jun 3, 2024

View reviewed changes

jakmeier mentioned this pull request Jun 4, 2024

document ShardCongested error on tx submission near/docs#2018

Merged

jakmeier added this pull request to the merge queue Jun 4, 2024

Merged via the queue into near:master with commit dd1e278 Jun 4, 2024
29 checks passed

jakmeier deleted the rpc-rejects-tx branch June 4, 2024 16:01

jancionear mentioned this pull request Jun 5, 2024

test_in_memory_trie_node_consistency is flaky: shard is currently congested and rejects new transactions #11496

Closed

jakmeier mentioned this pull request Jul 5, 2024

[ProjectTracking]: congestion control near/near-one-project-tracking#48

Open

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(congestion): reject new transactions on RPC level #11419

feat(congestion): reject new transactions on RPC level #11419

jakmeier commented May 29, 2024 •

edited

Loading

jakmeier commented May 30, 2024

codecov bot commented May 31, 2024 •

edited

Loading

jakmeier commented May 31, 2024

bowenwang1996 commented Jun 2, 2024

bowenwang1996 left a comment

wacban left a comment

wacban Jun 3, 2024

wacban Jun 3, 2024

jakmeier Jun 3, 2024

wacban Jun 4, 2024

wacban Jun 3, 2024

jakmeier Jun 3, 2024

wacban Jun 4, 2024

jakmeier commented Jun 3, 2024

		/// The receiver shard of the transaction is too congestion to accept new
		/// transactions at the moment.

feat(congestion): reject new transactions on RPC level #11419

feat(congestion): reject new transactions on RPC level #11419

Conversation

jakmeier commented May 29, 2024 • edited Loading

Details

jakmeier commented May 30, 2024

codecov bot commented May 31, 2024 • edited Loading

Codecov Report

jakmeier commented May 31, 2024

bowenwang1996 commented Jun 2, 2024

bowenwang1996 left a comment

Choose a reason for hiding this comment

wacban left a comment

Choose a reason for hiding this comment

wacban Jun 3, 2024

Choose a reason for hiding this comment

wacban Jun 3, 2024

Choose a reason for hiding this comment

jakmeier Jun 3, 2024

Choose a reason for hiding this comment

wacban Jun 4, 2024

Choose a reason for hiding this comment

wacban Jun 3, 2024

Choose a reason for hiding this comment

jakmeier Jun 3, 2024

Choose a reason for hiding this comment

wacban Jun 4, 2024

Choose a reason for hiding this comment

jakmeier commented Jun 3, 2024

jakmeier commented May 29, 2024 •

edited

Loading

codecov bot commented May 31, 2024 •

edited

Loading