-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rpc: improve latency by not blocking worker threads polling IO notifications #3242
base: master
Are you sure you want to change the base?
Conversation
This is a rebase of solana-labs#34675 |
If this PR represents a change to the public RPC API:
Thank you for keeping the RPC clients in sync with the server API @alessandrod. |
accounts.push( | ||
self.runtime | ||
.spawn_blocking(move || { | ||
get_encoded_account(&bank, &pubkey, encoding, data_slice, None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it make sense to spawn task per account? why don't fetch all accounts in one task
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parallelism - so that one big request doesn't monopolize the thread for too long and multiple requests can make progress together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that point, but if we have 500 req/s, and let's assume most of them gMA don't we spend more time on spawning tasks? gMA (that can be max 100 gAI) is not the same as gPA (especially gTABO)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't we spend more time on spawning tasks
The overhead of spawning a task is minimal, it's matter of putting the task descriptor in a channel. The background thread already exists, just needs to pop the item off the channel.
The tradeoff with this approach vs doing all in one go is that with this approach, all requests potentially get a little slower (because parallelism), vs doing all in one go where some requests are faster but some risk getting starved and timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, then why do not use join_all
? because right now we wait when previous task will be finished
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because waiting is the point - so that if there are concurrent requests (which most likely there are), the background threads can serve those. If we join_all all the threads will fire up at the same time, load will increase, background threads will have less capacity to process other things etc.
The point of the current approach is to maximize overall throughput not single response time. I think that this is the best approach, but I'm open to changing my mind!
Some other methods that should probably go in background (intel from RPC provider 🙏 )
|
validator/src/cli.rs
Outdated
@@ -2325,6 +2335,7 @@ impl DefaultArgs { | |||
.retry_pool_max_size | |||
.to_string(), | |||
rpc_threads: num_cpus::get().to_string(), | |||
rpc_blocking_threads: (num_cpus::get() / 2).to_string(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs to be set a lot lower
@@ -2325,6 +2335,7 @@ impl DefaultArgs { | |||
.retry_pool_max_size | |||
.to_string(), | |||
rpc_threads: num_cpus::get().to_string(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs to be set a lot lower
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll tweak these and see how the choice of cpus affects the benchmark latencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll keep going with this.
- Apply the same mitigation to:
- getConfirmedBlocks
- getSupply
- getFirstAvailableBlock
- getConfirmedTransaction
- getBlock
- getTokenSupply
- Add
RpcBench
routines to cover all of them - Tweak the CPU numbers to find some lower bound
- Paste before/after benchmark results into the PR description
let response = self | ||
.runtime | ||
.spawn_blocking({ | ||
let bank = Arc::clone(&bank); | ||
move || get_encoded_account(&bank, &pubkey, encoding, data_slice, None) | ||
}) | ||
.await | ||
.expect("rpc: get_encoded_account panicked")?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me say this back to you, just to make sure that I understand it correctly.
On first glance, creating a yield point here seems strange; there's a very little bit of work above and below, and this apparently synchronous call to get_encoded_account()
in the middle. Why not just do all the work in one go?
However, if you hyperclick into get_encoded_account()
long enough, you eventually bottom out at this ‘read from disk’ call. This tells me that though this call appears sync, it can bottom out at an IOWait and cause the thread to go idle.
If get_encoded_account()
and every function all the way down were async
we would naturally get yieldyness through the use of await
. Since it isn't, the change here introduces a yield-point in recognition that get_encoded_account()
may wait on IO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All correct! I would add that there are other hidden costs in loading from accountsdb, especially if the account doesn't exist index lookup is very slow, and that encoding is also very slow (can do compression etc) on large inputs and accounts can be up to 10MB.
@@ -2325,6 +2335,7 @@ impl DefaultArgs { | |||
.retry_pool_max_size | |||
.to_string(), | |||
rpc_threads: num_cpus::get().to_string(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll tweak these and see how the choice of cpus affects the benchmark latencies.
5435ef5
to
e1ab1b6
Compare
e1ab1b6
to
30e4e0a
Compare
@t-nelson @alessandrod: I've completed a first pass, and I'm working on blasting out some measurements right now. Feel free to ignore this, but if you would like to take an early crack at feedback, please feel free. Also: #4048 when you have a sec. |
Great stuff! Just one note about selecting the right number of rpc_threads: I think that we'll need help from operators to properly test that part, because the results you'll get from the bench wrt number of threads won't be representative of what happens while doing RPC requests against a mnb node, which has 400+ threads busy doing other stuff. As a default I think we should be fairly conservative and use small numbers. |
I've completed some measurements, as seen in the chart, but I can't say that I really know what to set it to. Given that this is configurable via CLI arguments, what's your guidance?
|
Hey @NicolasPennie, @WilfredAlmeida, would you be willing to build an RPC using this PR, tweak |
6fcf0b2
to
234f234
Compare
Computing the benchmarks using the script in the PR description will take a long time. That said, I have not run a benchmark locally yet that hasn't shown positive results. I invite you to take the position that this is ‘probably better’ without the need to quantify exactly how much, and to iterate from here. |
234f234
to
9ba5d62
Compare
validator/src/cli.rs
Outdated
.validator(|value| { | ||
value | ||
.parse::<u64>() | ||
.map_err(|err| format!("error parsing '{value}': {err}")) | ||
.and_then(|threads| { | ||
if threads > 0 { | ||
Ok(()) | ||
} else { | ||
Err(format!("value must be >= 1")) | ||
} | ||
}) | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some rigor to the CLI here.
validator/src/cli.rs
Outdated
@@ -2362,6 +2384,7 @@ impl DefaultArgs { | |||
.retry_pool_max_size | |||
.to_string(), | |||
rpc_threads: num_cpus::get().to_string(), | |||
rpc_blocking_threads: 1.max(num_cpus::get() / 2).to_string(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tweaked the default to never be less than 1.
9ba5d62
to
b6b71a7
Compare
@steveluscher we are going to make a build with this PR in Triton, but it's hard to do if PR contains force-pushes. I already spent almost 2 hours to trying resolve conflicts in our rpc code and now I do not know what changed after 9ba5d62... |
I gotchu!
|
I completely screwed up the |
By default tokio allows up to 512 blocking threas. We don't want that many threads, as they'd slow down other validator threads.
Make the function async and use tokio::task::spawn_blocking() to execute CPU-bound code in background. This prevents stalling the worker threads polling IO notifications and serving other non CPU-bound rpc methods.
get_filtered_program_accounts can be used to retrieve _a list_ of accounts that match some filters. This is CPU bound and can block the calling thread for a significant amount of time when copying many/large accounts.
Pass the custom runtime to JsonRpcRequestProcessor and use it to spawn blocking tasks from rpc methods.
When these methods reach out to Blockstore, yield the thread
When this method reaches out to accounts_db (through a call to `calculate_non_circulating_supply()`), yield the thread.
When this method reaches out to blockstore, yield the thread
When this method reaches out to blockstore, yield the thread
When this method reaches out to methods on bank that do reads, yield the thread
b6b71a7
to
02f5c94
Compare
Updated!
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks great! Left some comments
})?) | ||
self.runtime | ||
.spawn_blocking(move || { | ||
bank.get_filtered_indexed_accounts( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should make get_filtered_indexed_accounts do the spawn_blocking
internally like get_filtered_program_accounts and get_filtered_spl_token_accounts_by_owner?
.spawn_blocking({ | ||
let bank = bank.clone(); | ||
move || { | ||
calculate_non_circulating_supply(&bank).map_err(|e| RpcCustomError::ScanError { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
calculate_non_circulating_supply is called from some other methods, maybe we
should make it async?
rpc/src/rpc.rs
Outdated
.runtime | ||
.spawn_blocking({ | ||
let blockstore = Arc::clone(&self.blockstore); | ||
move || blockstore.get_first_available_block().unwrap_or_default() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reasoning for doing this in the bg pool? Looking at the
implementation it doesn't look slow?
rpc/src/rpc.rs
Outdated
move || blockstore.get_rooted_block(slot, true) | ||
}) | ||
.await | ||
.expect("Failed to spawn blocking task"); | ||
self.check_blockstore_root(&result, slot)?; | ||
let encode_block = |confirmed_block: ConfirmedBlock| -> Result<UiConfirmedBlock> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
encoding probably needs to go in the blocking pool too?
rpc/src/rpc.rs
Outdated
.runtime | ||
.spawn_blocking({ | ||
let blockstore = Arc::clone(&self.blockstore); | ||
move || blockstore.get_first_available_block().unwrap_or_default() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here not sure that this needs to go in the bg pool
rpc/src/rpc.rs
Outdated
let blockstore = Arc::clone(&self.blockstore); | ||
move || -> Result<Vec<_>> { | ||
let blocks = blockstore | ||
.rooted_slot_iterator(max(start_slot, lowest_blockstore_slot)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here: I think that this iterator is fast, it just gets a bunch of u64s from
rocks.
In general not sure that this function should be async, it has a confusing name
but it just returns a vec of slot numbers
rpc/src/rpc.rs
Outdated
async move { | ||
let unix_timestamp = self | ||
.runtime | ||
.spawn_blocking(move || bank.clock().unix_timestamp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that this should go in the bg pool? The clock sysvar account is
smol and pretty much guaranteed to be cached
let bank = Arc::clone(&bank); | ||
let mint = Pubkey::clone(mint); | ||
move || { | ||
bank.get_account(&mint).ok_or_else(|| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Token accounts are small, so not sure that we should send this to the bg pool?
Especially considering it's memecoin szn so this is probably going to be called
a lot? We must be careful to not put too many things in the bg pool, or we
accidentally serialize frequent, fast requests behind chonky, slow ones we put
in the bg pool
Problem
Some RPC operations are CPU bound and run for a significant amount of time. Those operations end up blocking worker threads that are also used to handle IO notifications, leading to notifications not being polled often enough and so for the whole RPC server to potentially become slow and exhibit high latency. When latency gets high enough it can exceed request timeouts, leading to failed requests.
Summary of Changes
This PR makes some of the most CPU expensive RPC methods use
tokio::task::spawn_blocking
to run cpu hungry code. This way the worker threads doing IO don't get blocked and latency is improved.The methods changed so far include:
getMultipleAccounts
getProgramAccounts
getAccountInfo
getTokenAccountsByDelegate
getTokenAccountsByOwner
I'm not super familiar with RPC so I've changed what looking at the code seems to be loading/copying a lot of data around. Please feel free to suggest more!
Test plan
Methodolgy for selection of CPU defaults
Run this
blocks
benchmark script while tweaking CPU params. This was run on a 48 CPU machine.rpc_threads
rpc_blocking_threads
Methodology
Using this script for computing metrics: https://gist.github.com/steveluscher/b4959b9601093b0009f1d7646217b030, ran each of these
account-cluster-bench
suites before and after this PR:account-info
block
blocks
first-available-block
multiple-accounts
slot
supply
token-accounts-by-delegate
token-accounts-by-owner
token-supply
transaction
transaction-parsed
version
Using a command similar to this:
Note
You can adjust the
sleep 15
if you want the validator to stack up more slots before starting the bench.Warning
When running benches that require token accounts, supply a
mint
,space
, and actually create the token account using the fixture found here.Results
Warning
These results are a little messed up, because what's actually happening here is that the benchmark script is spitting out averages in 3s windows. The avg/p50/p90/p99 of those numbers is what you're seeing in this table. Not correct, but directionally correct.
Note
Filling in this grid would take a long time, especially if run against a mainnet RPC with production traffic. We may just choose to land this as ‘certainly better, how much we can't say exactly.’
account-info
block
blocks
first-available-block
multiple-accounts
slot
supply
token-accounts-by-delegate
token-accounts-by-owner
token-supply
transaction
transaction-parsed
version