-
Notifications
You must be signed in to change notification settings - Fork 648
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(congestion): reject new transactions on RPC level #11419
Conversation
One of the points raised in near/NEPs#539 was that we have to clearly propagate the information about rejected transactions to users. This PR is my suggestion how to expose the information on the RPC node level. |
Summary: In this PR, we introduce a new failure mode on the RPC level when a transaction is submitted under congestion. The error is of type `InvalidTxError` and called `ShardCongested` with a single field `shard_id` referencing the congested shard. ## Details With [cross-shard congestion control](near/NEPs#539) being stabilized soon, we want to reject new transactions as early as possible when the receiver shard is already overloaded with traffic. On the chunk producer level, all transactions going to a congested shard will be dropped. This keeps the memory requirements of chunk producers bounded. Further, we decided to go for a relatively low threshold in order to keep the latency of accepted transactions low, preventing new transactions as soon as we hit 25% congestion on a specific shard. Consequently, when shards are congested, it will not be long before transactions are rejected. This has consequences for the users. On the positive side, they will no longer have to wait for a long time not knowing if their transaction will be accepted or not. Either, it is executed within a bounded time (at most 20 blocks after inclusion) or it will be rejected immediately. But on the negative side, when a shard is congested, they will have to actively retry sending the transaction until it gets accepted. We hope that this can be automated by wallets, which can also provide useful live updates to the user about what is happening. But for this, they will need to understand and handle the new error `ShardCongested` different from existing errors.
Some tests should run without congestion control to function the same way the did. And some congestion control tests will observe the new error, which is expected.
52ef5aa
to
f161e54
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #11419 +/- ##
==========================================
- Coverage 71.45% 71.42% -0.03%
==========================================
Files 785 785
Lines 158807 158846 +39
Branches 158807 158846 +39
==========================================
- Hits 113473 113463 -10
- Misses 40429 40479 +50
+ Partials 4905 4904 -1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
@wacban and @bowenwang1996 this PR is what I propose for how we handle rejected transactions. In summary:
In the second case, RPC nodes will wait for a I don't really like that it is marked as
Please let me know if you have any ideas. Otherwise, I suggest we discuss it again in our next congestion control call. Also cc @Akashin and @telezhnaya. I believe for dropped transactions on a full transaction pool, we already have the same problem today, right? The problem being that RPC nodes just time out, without a useful error message. What is the recommended way for wallets to deal with it in that case? |
I agree. I also don't have a good solution in mind and would like to hear what others think. At the same time, this would only be a problem when a shard goes from noncongested to congested, so overall the impact should not be that large. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we test this behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
We can also consider doing the same for the transaction signer shard but it's optional and can be done separately.
core/primitives/src/errors.rs
Outdated
/// The receiver shard of the transaction is too congestion to accept new | ||
/// transactions at the moment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: too congested
@@ -643,9 +643,25 @@ impl RuntimeAdapter for NightshadeRuntime { | |||
verify_signature: bool, | |||
epoch_id: &EpochId, | |||
current_protocol_version: ProtocolVersion, | |||
receiver_congestion_info: Option<ExtendedCongestionInfo>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about the congestion info of the sender? If the sender is congested their tx limit will be lowered and some transactions may also be rejected because of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we reject them. We just lower the gas limit. When we reach the gas limit, we keep following transactions in the pool rather than rejecting (dropping) them. Eventually, new transactions are dropped but that's because the pool is full, not due to something we added in nep-539.
We could still try to be smart about it, like reject something on the RPC if we think the pool is filling up with transactions all using the same sender shard. But it seems a bit tricky to get the heuristics right. If we are too aggressive, we risk frustrating users by rejecting their transactions for no good reason. Not to mention that "rogue RPCs" could abuse the system by filling the transaction pool beyond what our RPCs do, which would starve the honest RPCs completely under congestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, we can observe what happens in the wild and make adjustments if we find it necessary.
congestion_info.congestion_info, | ||
congestion_info.missed_chunks_count, | ||
); | ||
if !congestion_control.shard_accepts_transactions() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may consider making that more flexible by making the threshold a configurable value rather than using the runtime parameter of 0.25. The drawback is that those would need to be kept somewhat in sync. I also dislike that some rpc node may artificially lower it in order to get their users priority but this is something they can do anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. We can consider adding the flexibility.
But just to be clear, there should not much to be gained by setting a higher threshold. Unless the congestion level lowers between now and when the chunk producer considers it, it will be rejected and dropped anyway. Most of the time, the RPC will just end up waiting for a timeout before it inevitably fails, annoying the user and wasting resources on the RPC in the process.
At least that is my theory, I guess it could be different in practice. Maybe the congestion level keeps going below and above the threshold, which could potentially give an advantage to more aggressive forwarding on the RPC node compared to the retrying on the wallet/client side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some users the UX doesn't matter as much as just maximizing the chances of getting their transaction in. For those it may be useful. It may also be useful for us if we discover that we did the math wrong and need to change it quickly :)
Yes makes sense. I added an integration test checking that we the new error is returned when the receiver is congested. |
Summary: In this PR, we introduce a new failure mode on the RPC level when a transaction is submitted under congestion. The error is of type
InvalidTxError
and calledShardCongested
with a single fieldshard_id
referencing the congested shard.Details
With cross-shard congestion control being stabilized soon, we must deal with the case when a shard rejects new transactions.
On the chunk producer level, all transactions going to a congested shard will be dropped. This keeps the memory requirements of chunk producers bounded. Further, we decided to go for a relatively low threshold in order to keep the latency of accepted transactions low, preventing new transactions as soon as we hit 25% congestion on a specific shard. Consequently, when shards are congested, it will not be long before transactions are rejected.
This has consequences for the users. On the positive side, they will no longer have to wait for a long time not knowing if their transaction will be accepted or not. Either, it is executed within a bounded time (at most 20 blocks after inclusion) or it will be rejected immediately.
But on the negative side, when a shard is congested, they will have to actively retry sending the transaction until it gets accepted. We hope that this can be automated by wallets, which can also provide useful live updates to the user about what is happening. But for this, they will need to understand and handle the new error
ShardCongested
differently from existing errors. The key difference is that the same signed transaction can be sent again and will be accepted if congestion has gone down.