Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In sc-network, isolate accesses to client, txpool, finality proof builder, ... in separate tasks. #559

Open
tomaka opened this issue Mar 27, 2020 · 8 comments
Labels
I2-bug The node fails to follow expected behavior. I4-refactor Code needs refactoring.

Comments

@tomaka
Copy link
Contributor

tomaka commented Mar 27, 2020

Right now the sc-network directly calls the client, tx-pool, finality proof builder, and so on (full list in the config module).

However whenever one of these components takes a long time to respond, the entire networking stack freezes for everyone.
Not only does this happens right now due to some runtime performance issues, but it also makes us quite vulnerable to attacks.

Instead of performing straight calls, we should in my opinion spawn a few background tasks (or even a new background task per call when that is possible) and perform the calls there.

My personal opinion is that isolating the calls to the clients/tx-pool/... should be performed by sc-service, but from a pragmatic perspective let's do it in sc-network.

Instead of passing directly an Arc<dyn Client> to the sync state machine, we should instead pass some sort of Isolated<Arc<dyn Client>> where Isolated exposes an asynchronous API.

@seunlanlege
Copy link
Contributor

seunlanlege commented Mar 30, 2020

Isn't this related to #3230. In terms of effort, Wouldn't it be best to just coordinate work on #3230

@tomaka
Copy link
Contributor Author

tomaka commented Mar 30, 2020

Yes and no. It depends how paritytech/substrate#3230 is implemented.

What I had in mind for paritytech/substrate#3230 is that when we call client.call(...) we should not put the thread to sleep while waiting for something to happen. For example, the example I gave in paritytech/substrate#3230 is that we shouldn't block the thread while waiting for a network message to be received.

But even if we do that, it is still possible for client.call(...) to take several seconds of CPU operations (because of a lot of calculations) during which the network will be frozen.

But we could also consider using separate tasks as part of paritytech/substrate#3230, I don't know.

@seunlanlege
Copy link
Contributor

We should start a channel on matrix and continue there.

@bkchr
Copy link
Member

bkchr commented Mar 30, 2020

We should start a channel on matrix and continue there.

Why? This is the right place to discuss the implementation.

@tomaka
Copy link
Contributor Author

tomaka commented Apr 16, 2020

The code that is relevant for this issue is here: https://github.com/paritytech/substrate/blob/ae36c62223a8191e5bbca14ca5df3bc7d0edb6ea/client/network/src/chain.rs#L45
Ideally all the traits mentioned in this file would have asynchronous methods.

Since this is far from being trivial, we should, as mentioned above, add to the networking code an IsolatedClient struct and an IsolatedFinalityProofProvider struct that wrap around implementations of these traits and do the work in a background task.

When it comes to making the networking code async-friendly, it shouldn't be very hard to adjust light_client_handler.rs and block_requests.rs. However protocol.rs would be very tough to change.

I suggest we keep the synchronousity of protocol.rs and only adjust the new protocols.

@arkpar
Copy link
Member

arkpar commented Apr 16, 2020

However whenever one of these components takes a long time to respond, the entire networking stack freezes for everyone.

How so? Isn't it run on a thread pool?

I don't see how this proposed change will solve anything. Simply moving a bottleneck to a different place won't help much. You'd just get an overflow in the task queue.

@tomaka
Copy link
Contributor Author

tomaka commented Apr 16, 2020

How so? Isn't it run on a thread pool?

Answering ping, parsing messages, and everything "close to the socket" is done in a thread pool. Answering requests when access to the client is needed is not.

Simply moving a bottleneck to a different place won't help much.

Right now if the client takes a long time we just accumulate network messages from nodes. After this change, we would instead process all messages quickly/immediately and instead accumulate messages that we need to answer.

This means that:

1- We could detect when the client is over-capacity and return some sort of "sorry we're busy" error to the remote.
2- We could continue processing the block announces, transactions, and so on that the nodes send us.

As a more general problem, we call the client from within Future::poll, which is supposed to return quickly. Blocking in Future::poll is normally a no-no.

@arkpar
Copy link
Member

arkpar commented Apr 16, 2020

Answering requests when access to the client is needed is not.

Why not? Client itself is thread safe. I'm pretty sure handling requests does not lock anything for the duration of the client query. At least it was not initially.

1- We could detect when the client is over-capacity and return some sort of "sorry we're busy" error to the remote.

Request handling needs to be fair. E.g. bootnode resources need to scale evenly for all connecting peers. If the node is overloaded the request needs to wait till timeout and not be immediately denied.

2- We could continue processing the block announces, transactions, and so on that the nodes send us.

See above. I don't see why it is not the case right now.

This sounds like a lot of additional complexity to work around the actual problem. Handling network requests should be fast. If your network handler takes seconds to execute it needs to be optimized first, and not shoved into the queue. And if such a queue was to be organized, i'd argue that it should be managed explicitly, instead of relying on tokio executor or creating another unbounded channel. As it will most certainly require some kind of per-peer throttling.

@altonen altonen transferred this issue from paritytech/substrate Aug 24, 2023
@the-right-joyce the-right-joyce added I2-bug The node fails to follow expected behavior. I4-refactor Code needs refactoring. and removed I3-bug labels Aug 25, 2023
helin6 pushed a commit to boolnetwork/polkadot-sdk that referenced this issue Feb 5, 2024
* Add runtime API `ConvertTransactionRuntime API`

* ref: make transaction_converter optional

* rustfmt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I2-bug The node fails to follow expected behavior. I4-refactor Code needs refactoring.
Projects
Status: Backlog 🗒
Development

No branches or pull requests

6 participants