Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit size of source_receipt_proofs inside ChunkStateWitness #11295

Open
jancionear opened this issue May 13, 2024 · 11 comments
Open

Limit size of source_receipt_proofs inside ChunkStateWitness #11295

jancionear opened this issue May 13, 2024 · 11 comments
Assignees
Labels
A-stateless-validation Area: stateless validation

Comments

@jancionear
Copy link
Contributor

There could be a lot of large incoming receipts, which could cause the source_receipt_proofs field inside ChunkStateWitness to be really large. There are known problems with distributing large instances of ChunkStateWitness, so we should analyze how large the incoming receipts can be, and add a size limit to make sure that their size stays reasonable.

Refs: #10259, #11103, #11274

/cc @wacban @Longarithm, I know you already spent some time thinking about incoming receipts.

I think we should at least look into it before stateless validation launch, to estimate how bad things can get.

@jancionear jancionear added the A-stateless-validation Area: stateless validation label May 13, 2024
@jancionear
Copy link
Contributor Author

One solution would be to allow shards to send out large receipts on every nth block, round robin style.

Something like:

if block_height % num_shards == my_shard_id {
    // allowed to send out 4MB of receipts
} else {
    // allowed to send out 50kB of receipts, the rest of outgoing receipts should wait in a queue
}

This would ensure that incoming receipts from one block height are at most 4MB + (num_shards - 1) * 50kB in size.

There's still a problem when we have missing chunks and the incoming receipts are from multiple block heights, but this could be solved by adding a condition that a chunk is allowed to send out large receipts only when the previous chunk wasn't missing.

@jancionear jancionear self-assigned this May 14, 2024
@wacban
Copy link
Contributor

wacban commented May 15, 2024

@jancionear I think it's best to reason about the aggregated sum of the receipts rather than just the size of an individual receipts. At the end of the day it's the total size of all receipts that contributes to the state witness.

With that in mind I would suggest deploying a new metric to a canary node showing the typical total size of outgoing receipts per receiver shard. Assuming that on average this metric remains somewhat reasonable we can then consider what you suggested - limiting the number of bytes a shard is allowed to send to another shard in one block with round robin extra allowance.

@jancionear
Copy link
Contributor Author

jancionear commented May 20, 2024

Another thing to consider is smaller receipts being blocked by large ones. If we were to just process receipts in FIFO order, waiting until an outgoing receipt can be sent out, we could end up in a situation where the first receipt in the queue is really large, and we'll wait for a few blocks until we're able to send out this receipt, without sending anything else for those few blocks. A single big receipt blocks all small receipts.

To fix this we could have two separate queues of outgoing receipts - a queue for big ones that can only be sent in the special blocks, and another for small ones that can be sent from any block.

@wacban
Copy link
Contributor

wacban commented May 23, 2024

My current take for it would be as follows:

  • Let's pick a target maximum size of source_receipt_proofs and call it MAX.
  • Let's set a reasonable limit on the number of bytes that one shard can send to another. I think this limit can be as high as MAX or some high percentage thereof (e.g. 75%). This way we still allow full throughput in the case where two shards just exchange a lot of bytes. It can be done by leveraging congestion control. However while this may work in the average case and in the case where one shards spams another it wouldn't cover the worst case where all shards spam a single target shard.
  • In the worst case scenario the state witness would still exceed the size limit and may not be distributed in time to the chunk validators. We can rely on a combination of other solutions to handle that situation:

@jancionear
Copy link
Contributor Author

jancionear commented May 23, 2024

Thanks to #11344 eventually a new chunk will be produced in the affected shard and we can move past the troublesome state witness.

Does #11344 help with large chunk state witnesses? My understanding was that it helps when it takes a long time to apply a chunk, but there could still be a situation when a chunk isn't endorsed in time because the validator didn't receive a witness for this chunk in time.

@wacban
Copy link
Contributor

wacban commented May 23, 2024

Does #11344 help with large chunk state witnesses? My understanding was that it helps when it takes a long time to apply a chunk, but there could still be a situation when a chunk isn't endorsed in time because the validator didn't receive a witness for this chunk in time.

I would hope so but definitely needs testing and checking. Perhaps you can introduce a way to manually increase the size of the state witness (good for testing in forknet) or slow down the state witness distribution (good for testing in nayduck) and see if the chain can recover? I think both are worth doing. You can follow the example of slow_chunk.py where neard is compiled with the test_features feature and you can add adversarial behaviour, custom host functions or whatever else you need for simulating difficult behaviour.

@jancionear
Copy link
Contributor Author

jancionear commented May 24, 2024

During the last meeting Bowen said that #11344 might not help with large witnesses, we might need a separate fix to deal with witnesses that take a long time to distribute.

@jancionear
Copy link
Contributor Author

Even if we can recover from extra large witnesses, I feel that we still need to properly limit witness size. A malicious actor could try to cause the worst case scenario on every block, which would make the chain struggle.

@jancionear
Copy link
Contributor Author

jancionear commented Jun 6, 2024

The approach from #11295 (comment), where one shard is allowed to send more than others isn't the best, because the receipts have to wait for up to num_shards block before being sent.

Maybe it would be possible to make a better solution by making the shards negotiate how much they're allowed to send out on each block height. At each height every shard would publish an intent of sending out X MB of receipts to some shard. Then on the next height every shard sees how much every other shard wants to send out, and can deterministically decide how much they're allowed to send. In scenarios of low congestion this would allow large receipts to be sent in ~2 blocks, instead of num_shards blocks. With high congestion the deterministic algorithm would determine who is allowed to send out a large receipt at each height, in a fair way.

That sounds like a better solution, although it's more complex, so for now we can go with the simple round robin.

github-merge-queue bot pushed a commit that referenced this issue Jun 7, 2024
…ize of source_receipt_proofs under control (#11492)

This is a basic fix for: #11295

The problem is that the size of `source_receipt_proofs` could be really
large in some scenarios. If all 6 shards send a 4MB outgoing receipt to
shard 0, then `source_receipt_proofs` for shard 0 will be of size 6 *
4MB = 24MB.
That's way too big, the network probably won't be able to distribute
that in time. And as we add more shards to the network, the effect will
get worse and worse.

This fix deals with the problem by allowing only one chosen shard to
send large receipts to the other shard. All other shards are only
allowed to send ~100kB of receipts. So instead of 6 shards sending 4MB,
we end up with 5 shards sending 100kB and one shard sending 4MB, which
adds up to 4.5MB, much more manageable.

The mapping of "who is able to send a lot of outgoing receipts to whom"
changes with every block height:


![image](https://github.com/near/nearcore/assets/149345204/3b571d7c-da24-4cd9-ad8f-19686fc0f055)

In this example at block height 2:
* shard 0 can send:
    * 100kB of receipts to shard 0
    * 100kB of receipts to shard 1
    * 4.5MB of receipts to shard 2
* shard 1 can send:
    * 4.5MB of receipts to shard 0
    * 100kB of receipts to shard 1
    * 100kB of receipts to shard 2
* shard 2 can send:
    * 100kB of receipts to shard 0
    * 4.5MB of receipts to shard 1
    * 100kB of receipts to shard 2

At every height a receiving shard will receive large receipts from only
one shard, so the size of `source_receipt_proofs` stays under control.
The mapping changes, so every large receipt will eventually be sent out
when the mapping allows to send it to the destination.

The new limits are:
* `outgoing_receipts_usual_size_limit`: 102_400 (100kiB)
* `outgoing_receipts_big_size_limit`: 4_718_592 (4.5MiB - a bit larger
than the 4MiB receipt limit to make sure that 4MiB receipts can get
through)

### Flaws

This is a basic solution which has some flaws. It limits the witness
size, but it affects throughput in certain scenarios.
It can serve as a starting point for further improvements, something
that we can get in before the code freeze.

* Flaw 1: big receipts can block small receipts
Shard tries to send outgoing receipts in the order in which they were
created. When a large receipt is at the front of the queue, the shard
won't send anything until it can send out this large receipt. This means
that the shard might not send out anything for a few blocks.
This could be fixed by having a separate queue for large outgoing
receipts.
* Flaw 2: missing chunks
When a chunk is missing, the next chunk receives receipts from two block
heights. This means that it could receive two 4MB receipts. This could
be fixed by disallowing sending large receipts to shard that just had
missing chunks

### TODO
The implementation is pretty much ready, I should probably write some
tests, but for now I have to do other stuff.
Posting the PR as is for now.
@jancionear
Copy link
Contributor Author

In #11492 I implemented a basic receipt size limit using the approach where one chosen shard is allowed to send out more at each block height, while other shards wait for their turn.

It limits the receipts, but there are some flaws.
We had a meeting discussing it, here's a short summary:

The first problem is that small receipts can be blocked by big receipts. If a small receipt is enqueued after a big one, then the small one will have to wait until the big one is sent out, which happens only once every num_shards.

Another problem is that the change might not play well with congestion control. Congestion control reacts aggresively to buffered receipts. Once there is 500 TGas worth of enqueued receipts, the shard is considered fully congested and cannot receive any more receipts. Having 500 TGas of enqueued receipts could become fairly common because of small receipts getting stuck behind the big ones.

One way to solve the problem of small receipts getting stuck would be to have a separate queue for large receipts. But this would be problematic for the transaction priorities NEP. With transaction priorities we would like to send receipts with higher priority before the ones with low priority. What should happen when a big receipt has higher priority, but we're unable to send it at this block height? Should we send a smaller receipt with lower priority? Or wait for the big receipt? Two queues cause a headache there.

There was also the question of whether we need this kind of limit at all. If we reduced the receipt size limit o 1.5 MiB, then with six shards we could receive at most 6*1.5 MiB = 9 MiB of incoming receipts, which sounds somewhat managable.
I believe that having a limit on incoming receipts to a single shard is consistent with the sharded design of NEAR. A single shard can only process so much, and can only receive so much. As we scale horizontally and keep adding more shards, the issue will become more pronounced. With 100 shards we could have 150 MiB of incoming receipts, which isn't going to work. We need to incorporate a limit on incoming data into our design sooner or later.

There is also the concern of DoS attacks. What happens when someone submits a lot of large receipts aimed at a single shard? Cross shard throughput is limited, so it will cause a lot of congestion. I think that gas-based congestion control has to deal with the same problem, so maybe there can be some universal solution to discourage sending a lot of data to a single shard - higher gas prices or something like that.

@jancionear
Copy link
Contributor Author

Because of the trouble with transaction priorities I became less enthusiastic about the two-queue solution. I think it would help in the short term, increasing the throughput and lowering latency in presence of large receipts, but it's incompatible with transaction priorities.

But I have another idea that should work better: partial receipts
A lot of headache is caused by the fact that we can only send whole receipts, which can be really large. What if we added a way to split a large receipt into many small partial receipts? The sender shard could send a limited amount of partial receipts at every block height instead of sending one large receipt. This way we wouldn't have a dilemma whether to send a small receipt with low priority before the large receipt. Every receipt (or at least a part of receipt) is sendable at every height, so we can have a single queue/buffer of outgoing partial receipts and send some of them out to the receiver shard at every block height. The receiver has to reconstruct the large receipt from the partial ones, but that sounds doable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-stateless-validation Area: stateless validation
Projects
None yet
Development

No branches or pull requests

2 participants