Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpc: improve latency by not blocking worker threads polling IO notifications #3242

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

alessandrod
Copy link

@alessandrod alessandrod commented Oct 21, 2024

Problem

Some RPC operations are CPU bound and run for a significant amount of time. Those operations end up blocking worker threads that are also used to handle IO notifications, leading to notifications not being polled often enough and so for the whole RPC server to potentially become slow and exhibit high latency. When latency gets high enough it can exceed request timeouts, leading to failed requests.

Summary of Changes

This PR makes some of the most CPU expensive RPC methods use tokio::task::spawn_blocking to run cpu hungry code. This way the worker threads doing IO don't get blocked and latency is improved.

The methods changed so far include:

  • getMultipleAccounts
  • getProgramAccounts
  • getAccountInfo
  • getTokenAccountsByDelegate
  • getTokenAccountsByOwner

I'm not super familiar with RPC so I've changed what looking at the code seems to be loading/copying a lot of data around. Please feel free to suggest more!

Test plan

Methodolgy for selection of CPU defaults

Run this blocks benchmark script while tweaking CPU params. This was run on a 48 CPU machine.

rpc_threads rpc_blocking_threads Average Median p90 p99
cpus cpus / 2 21880 22136 22546 22572
cpus cpus / 4 20617 ($${\color{green}-5.7\%}$$) 20627 ($${\color{green}-6.8\%}$$) 21040 ($${\color{green}-6.7\%}$$) 21149 ($${\color{green}-6.3\%}$$)
cpus cpus / 8 21366 ($${\color{green}-2.4\%}$$) 21367 ($${\color{green}-3.8\%}$$) 21434 ($${\color{green}-4.9\%}$$) 21477 ($${\color{green}-4.9\%}$$)
cpus / 2 cpus / 2 21642 ($${\color{green}-1.1\%}$$) 21525 ($${\color{green}-2.8\%}$$) 23202 ($${\color{red}+2.9\%}$$) 23235 ($${\color{red}+2.9\%}$$)
cpus / 2 cpus / 4 20033 ($${\color{green}-8.4\%}$$) 20044 ($${\color{green}-9.4\%}$$) 20430 ($${\color{green}-9.4\%}$$) 20598 ($${\color{green}-8.7\%}$$)

Methodology

Using this script for computing metrics: https://gist.github.com/steveluscher/b4959b9601093b0009f1d7646217b030, ran each of these account-cluster-bench suites before and after this PR:

  • account-info
  • block
  • blocks
  • first-available-block
  • multiple-accounts
  • slot
  • supply
  • token-accounts-by-delegate
  • token-accounts-by-owner
  • token-supply
  • transaction
  • transaction-parsed
  • version

Using a command similar to this:

 % (
       bash -c 'while ! curl -s localhost:8899/health | grep -q "ok"; do echo "Waiting for validator" && sleep 1; done;' \
           ${IFS# Set this higher if you want the test to run with more blocks having been committed } \
           && sleep 15 \
           && echo "Running bench" \
           && cd accounts-cluster-bench \
           && cargo run --release -- \
               -u l \
               --identity ~/.config/solana/id.json \
               ${IFS# Optional for benches that require token accounts} \
               ${IFS# https://gist.github.com/steveluscher/19261b5321f56a89dc75804070b61dc4} \
               ${IFS# --mint UhrKsjtPJJ8ndhSdrcCbQaiw8L8a6gH1FbmtJ4XpVJR } \
               --iterations 100 \
               --num-rpc-bench-threads 100 \
               --rpc-bench supply 2>&1 \
           | grep -Po "Supply average success_time: \K(\d+)" \
           | ~/stats.sh 
   ) \
       & (
           (cd accounts-cluster-bench && cargo build --release) \
           && (
               cd validator \
                   && rm -rf test-ledger/ \
                   && cargo run --release \
                       --manifest-path=./Cargo.toml \
                       --bin solana-test-validator -- \
                           ${IFS# Put this in ~/fixtures/ } \
                           ${IFS# https://gist.github.com/steveluscher/19261b5321f56a89dc75804070b61dc4 } \
                           --account-dir ~/fixtures \
                           --quiet
               ) \
       )
Average: 34293.3
Median: 31708.5
90th Percentile: 44640
99th Percentile: 45166

Note

You can adjust the sleep 15 if you want the validator to stack up more slots before starting the bench.

Warning

When running benches that require token accounts, supply a mint, space, and actually create the token account using the fixture found here.

Results

Warning

These results are a little messed up, because what's actually happening here is that the benchmark script is spitting out averages in 3s windows. The avg/p50/p90/p99 of those numbers is what you're seeing in this table. Not correct, but directionally correct.

Note

Filling in this grid would take a long time, especially if run against a mainnet RPC with production traffic. We may just choose to land this as ‘certainly better, how much we can't say exactly.’

Suite Average Median p90 p99
account-info TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$)
block TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$)
blocks TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$)
first-available-block TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$)
multiple-accounts TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$)
slot TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$)
supply TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$)
token-accounts-by-delegate TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$)
token-accounts-by-owner TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$)
token-supply TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$)
transaction TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$)
transaction-parsed TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$)
version TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$)

@alessandrod
Copy link
Author

This is a rebase of solana-labs#34675

Copy link

mergify bot commented Oct 21, 2024

If this PR represents a change to the public RPC API:

  1. Make sure it includes a complementary update to rpc-client/ (example)
  2. Open a follow-up PR to update the JavaScript client @solana/web3.js (example)

Thank you for keeping the RPC clients in sync with the server API @alessandrod.

accounts.push(
self.runtime
.spawn_blocking(move || {
get_encoded_account(&bank, &pubkey, encoding, data_slice, None)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it make sense to spawn task per account? why don't fetch all accounts in one task

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parallelism - so that one big request doesn't monopolize the thread for too long and multiple requests can make progress together.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that point, but if we have 500 req/s, and let's assume most of them gMA don't we spend more time on spawning tasks? gMA (that can be max 100 gAI) is not the same as gPA (especially gTABO)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we spend more time on spawning tasks

The overhead of spawning a task is minimal, it's matter of putting the task descriptor in a channel. The background thread already exists, just needs to pop the item off the channel.

The tradeoff with this approach vs doing all in one go is that with this approach, all requests potentially get a little slower (because parallelism), vs doing all in one go where some requests are faster but some risk getting starved and timeout.

Copy link

@fanatid fanatid Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, then why do not use join_all? because right now we wait when previous task will be finished

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because waiting is the point - so that if there are concurrent requests (which most likely there are), the background threads can serve those. If we join_all all the threads will fire up at the same time, load will increase, background threads will have less capacity to process other things etc.

The point of the current approach is to maximize overall throughput not single response time. I think that this is the best approach, but I'm open to changing my mind!

@alessandrod
Copy link
Author

Some other methods that should probably go in background (intel from RPC provider 🙏 )

getConfirmedBlocks - 58.3 s
getSupply - 30.1 s
getFirstAvailableBlock - 1.94 s
getConfirmedTransaction - 4.28 s
getBlock - 1.49 s
getTokenSupply - 1.30 s

@@ -2325,6 +2335,7 @@ impl DefaultArgs {
.retry_pool_max_size
.to_string(),
rpc_threads: num_cpus::get().to_string(),
rpc_blocking_threads: (num_cpus::get() / 2).to_string(),
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be set a lot lower

@@ -2325,6 +2335,7 @@ impl DefaultArgs {
.retry_pool_max_size
.to_string(),
rpc_threads: num_cpus::get().to_string(),
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be set a lot lower

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll tweak these and see how the choice of cpus affects the benchmark latencies.

@steveluscher steveluscher self-assigned this Dec 2, 2024
Copy link

@steveluscher steveluscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll keep going with this.

  1. Apply the same mitigation to:
    • getConfirmedBlocks
    • getSupply
    • getFirstAvailableBlock
    • getConfirmedTransaction
    • getBlock
    • getTokenSupply
  2. Add RpcBench routines to cover all of them
  3. Tweak the CPU numbers to find some lower bound
  4. Paste before/after benchmark results into the PR description

Comment on lines +475 to +501
let response = self
.runtime
.spawn_blocking({
let bank = Arc::clone(&bank);
move || get_encoded_account(&bank, &pubkey, encoding, data_slice, None)
})
.await
.expect("rpc: get_encoded_account panicked")?;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me say this back to you, just to make sure that I understand it correctly.

On first glance, creating a yield point here seems strange; there's a very little bit of work above and below, and this apparently synchronous call to get_encoded_account() in the middle. Why not just do all the work in one go?

However, if you hyperclick into get_encoded_account() long enough, you eventually bottom out at this ‘read from disk’ call. This tells me that though this call appears sync, it can bottom out at an IOWait and cause the thread to go idle.

If get_encoded_account() and every function all the way down were async we would naturally get yieldyness through the use of await. Since it isn't, the change here introduces a yield-point in recognition that get_encoded_account() may wait on IO.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All correct! I would add that there are other hidden costs in loading from accountsdb, especially if the account doesn't exist index lookup is very slow, and that encoding is also very slow (can do compression etc) on large inputs and accounts can be up to 10MB.

@@ -2325,6 +2335,7 @@ impl DefaultArgs {
.retry_pool_max_size
.to_string(),
rpc_threads: num_cpus::get().to_string(),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll tweak these and see how the choice of cpus affects the benchmark latencies.

@steveluscher
Copy link

@t-nelson @alessandrod: I've completed a first pass, and I'm working on blasting out some measurements right now. Feel free to ignore this, but if you would like to take an early crack at feedback, please feel free.

Also: #4048 when you have a sec.

@alessandrod
Copy link
Author

Great stuff! Just one note about selecting the right number of rpc_threads: I think that we'll need help from operators to properly test that part, because the results you'll get from the bench wrt number of threads won't be representative of what happens while doing RPC requests against a mnb node, which has 400+ threads busy doing other stuff.

As a default I think we should be fairly conservative and use small numbers.

@steveluscher
Copy link

As a default I think we should be fairly conservative and use small numbers.

I've completed some measurements, as seen in the chart, but I can't say that I really know what to set it to. Given that this is configurable via CLI arguments, what's your guidance?

  1. Set it to 1 and let RPC providers configure it if they would like, or
  2. Ask RPC providers to gift a sensible default to us through experimentation?

@steveluscher
Copy link

Hey @NicolasPennie, @WilfredAlmeida, would you be willing to build an RPC using this PR, tweak --rpc-blocking-threads, and let me know what a sensible default (preferably as a percentage of CPUs available) for that value might be based on your observations.

@steveluscher steveluscher force-pushed the rpc-spawn-blocking-new branch 2 times, most recently from 6fcf0b2 to 234f234 Compare December 11, 2024 23:09
@steveluscher steveluscher marked this pull request as ready for review December 11, 2024 23:19
@steveluscher
Copy link

Computing the benchmarks using the script in the PR description will take a long time. That said, I have not run a benchmark locally yet that hasn't shown positive results. I invite you to take the position that this is ‘probably better’ without the need to quantify exactly how much, and to iterate from here.

@steveluscher steveluscher force-pushed the rpc-spawn-blocking-new branch from 234f234 to 9ba5d62 Compare December 11, 2024 23:24
Comment on lines 967 to 978
.validator(|value| {
value
.parse::<u64>()
.map_err(|err| format!("error parsing '{value}': {err}"))
.and_then(|threads| {
if threads > 0 {
Ok(())
} else {
Err(format!("value must be >= 1"))
}
})
})

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some rigor to the CLI here.

@@ -2362,6 +2384,7 @@ impl DefaultArgs {
.retry_pool_max_size
.to_string(),
rpc_threads: num_cpus::get().to_string(),
rpc_blocking_threads: 1.max(num_cpus::get() / 2).to_string(),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tweaked the default to never be less than 1.

@steveluscher steveluscher force-pushed the rpc-spawn-blocking-new branch from 9ba5d62 to b6b71a7 Compare December 12, 2024 18:22
@fanatid
Copy link

fanatid commented Dec 12, 2024

@steveluscher we are going to make a build with this PR in Triton, but it's hard to do if PR contains force-pushes. I already spent almost 2 hours to trying resolve conflicts in our rpc code and now I do not know what changed after 9ba5d62...

@steveluscher
Copy link

steveluscher commented Dec 12, 2024

I gotchu!

git fetch
git remote add alessandrod https://github.com/alessandrod/solana
git fetch alessandrod
git checkout 5435ef54081d79bfc623668db75e535459c59f4c
git checkout -b tmp-3242-before
git rebase 3a6d593401bfbbf3ecf459e1fa7ead1613041954
git checkout bb8d89b
git checkout -b tmp-3242-after
git rebase 3a6d593401bfbbf3ecf459e1fa7ead1613041954
git diff tmp-3242-before tmp-3242-after

@steveluscher
Copy link

I completely screwed up the ? / unwrap() semantics here, because I don't know how to program. Fixing! Will have an update in a moment.

alessandrod and others added 10 commits December 13, 2024 00:07
By default tokio allows up to 512 blocking threas. We don't want that
many threads, as they'd slow down other validator threads.
Make the function async and use tokio::task::spawn_blocking() to execute
CPU-bound code in background. This prevents stalling the worker threads
polling IO notifications and serving other non CPU-bound rpc methods.
get_filtered_program_accounts can be used to retrieve _a list_ of
accounts that match some filters. This is CPU bound and can block the
calling thread for a significant amount of time when copying many/large
accounts.
Pass the custom runtime to JsonRpcRequestProcessor and use it to spawn
blocking tasks from rpc methods.
When these methods reach out to Blockstore, yield the thread
When this method reaches out to accounts_db (through a call to `calculate_non_circulating_supply()`), yield the thread.
When this method reaches out to blockstore, yield the thread
When this method reaches out to blockstore, yield the thread
When this method reaches out to methods on bank that do reads, yield the thread
@steveluscher steveluscher force-pushed the rpc-spawn-blocking-new branch from b6b71a7 to 02f5c94 Compare December 13, 2024 00:07
@steveluscher
Copy link

Updated!

  • Replaced Box::pin(...) with async move { /* ... */ }.boxed()
  • Corrected rogue replacement of ? with unwrap() now that I understand the runtime semantics.

Copy link
Author

@alessandrod alessandrod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks great! Left some comments

})?)
self.runtime
.spawn_blocking(move || {
bank.get_filtered_indexed_accounts(
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should make get_filtered_indexed_accounts do the spawn_blocking
internally like get_filtered_program_accounts and get_filtered_spl_token_accounts_by_owner?

.spawn_blocking({
let bank = bank.clone();
move || {
calculate_non_circulating_supply(&bank).map_err(|e| RpcCustomError::ScanError {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

calculate_non_circulating_supply is called from some other methods, maybe we
should make it async?

rpc/src/rpc.rs Outdated
.runtime
.spawn_blocking({
let blockstore = Arc::clone(&self.blockstore);
move || blockstore.get_first_available_block().unwrap_or_default()
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reasoning for doing this in the bg pool? Looking at the
implementation it doesn't look slow?

rpc/src/rpc.rs Outdated
move || blockstore.get_rooted_block(slot, true)
})
.await
.expect("Failed to spawn blocking task");
self.check_blockstore_root(&result, slot)?;
let encode_block = |confirmed_block: ConfirmedBlock| -> Result<UiConfirmedBlock> {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

encoding probably needs to go in the blocking pool too?

rpc/src/rpc.rs Outdated
.runtime
.spawn_blocking({
let blockstore = Arc::clone(&self.blockstore);
move || blockstore.get_first_available_block().unwrap_or_default()
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here not sure that this needs to go in the bg pool

rpc/src/rpc.rs Outdated
let blockstore = Arc::clone(&self.blockstore);
move || -> Result<Vec<_>> {
let blocks = blockstore
.rooted_slot_iterator(max(start_slot, lowest_blockstore_slot))
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here: I think that this iterator is fast, it just gets a bunch of u64s from
rocks.

In general not sure that this function should be async, it has a confusing name
but it just returns a vec of slot numbers

rpc/src/rpc.rs Outdated
async move {
let unix_timestamp = self
.runtime
.spawn_blocking(move || bank.clock().unix_timestamp)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that this should go in the bg pool? The clock sysvar account is
smol and pretty much guaranteed to be cached

let bank = Arc::clone(&bank);
let mint = Pubkey::clone(mint);
move || {
bank.get_account(&mint).ok_or_else(|| {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Token accounts are small, so not sure that we should send this to the bg pool?
Especially considering it's memecoin szn so this is probably going to be called
a lot? We must be careful to not put too many things in the bg pool, or we
accidentally serialize frequent, fast requests behind chonky, slow ones we put
in the bg pool

This blockstore method doesn't actually do expensive reads.

This reverts commit 3bbc57f.
Kept the `spawn_blocking` around:

* Call to `get_rooted_block`
* Call to `get_complete_block`

This reverts commit 710f9c6.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants