-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rpc: improve latency by not blocking worker threads polling IO notifications #34675
Conversation
I've been testing with the new benches from #34520. I've tested with 100 threads, things get much worse without the changes with more threads. As you can see in the before run, the average success_time gets increasingly higher. In the after run success_time is constant and snappy. Before
After
|
f1bd06c
to
b18e132
Compare
735a77f
to
fcafcac
Compare
rpc_bigtable_config: Option::default(), | ||
max_multiple_accounts: Option::default(), | ||
account_indexes: AccountSecondaryIndexes::default(), | ||
rpc_threads: 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I add a dep on num_cpus here so I can num_cpus::get()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could. i doubt that's what we want long term*, but no dumber that what we do today.
* i'm looking into making all our thread pools cli-configurable atm :hidethepainharold:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then I'll leave as is - the validator always overrides these fields anyway, the default values are only used in tests
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #34675 +/- ##
=========================================
- Coverage 81.8% 81.8% -0.1%
=========================================
Files 824 824
Lines 222687 222857 +170
=========================================
+ Hits 182298 182387 +89
- Misses 40389 40470 +81 |
fcafcac
to
f1faf9d
Compare
By default tokio allows up to 512 blocking threas. We don't want that many threads, as they'd slow down other validator threads.
Make the function async and use tokio::task::spawn_blocking() to execute CPU-bound code in background. This prevents stalling the worker threads polling IO notifications and serving other non CPU-bound rpc methods.
get_filtered_program_accounts can be used to retrieve _a list_ of accounts that match some filters. This is CPU bound and can block the calling thread for a significant amount of time when copying many/large accounts.
Pass the custom runtime to JsonRpcRequestProcessor and use it to spawn blocking tasks from rpc methods.
f1faf9d
to
a599809
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working this up, @alessandrod ! I'm in favor. I assume this will also improve perf for heavy loads of bigtable queries, since those are already using the runtime...?
As for which methods to treat, my leaning is to do all of them. Wdyt? You don't need to do all that legwork, though. We could merge this PR with its existing subset, then I can pick up extending it to the rest of the methods.
enable_rpc_transaction_history: Default::default(), | ||
enable_extended_tx_metadata_storage: Default::default(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, clippy or CI doesn't complain about all these Default::default()
s?
I'd prefer these be specific to type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not, and yeah it's weird. I'll change to use specific types.
get_encoded_account(&bank, &pubkey, encoding, data_slice) | ||
}) | ||
.await | ||
.expect("rpc: get_encoded_account panicked")?, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should handle any errors nicely, shouldn't we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking was: spawn_blocking returns Err if the spawned code panics. If the spawned code panics, it sounds like it's a bug? Not sure it makes sense to handle gracefully?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it also returns error upon cancellation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it can in this case? We're creating the JoinHandle and immediately awaiting it. We're not calling abort() and we're not giving out any AbortHandle(s), so nobody can cancel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the runtime won't cancel outstanding tasks when its threads catch SIGINT?
i'm expecting to see reports of this at process exit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so? https://docs.rs/tokio/latest/tokio/runtime/struct.Runtime.html#shutdown
Blocking functions spawned through Runtime::spawn_blocking keep running until they return.
Tokio cancels tasks by stopping to poll them, so they stop at the next await point. But things inside spawn_blocking don't await, so they can only run until completion (or process exit). pthread_cancel isn't a thing etc
@@ -429,11 +465,18 @@ impl JsonRpcRequestProcessor { | |||
})?; | |||
let encoding = encoding.unwrap_or(UiAccountEncoding::Binary); | |||
|
|||
let response = get_encoded_account(&bank, pubkey, encoding, data_slice)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like we can shrink the diff quite a bit by passing the runtime into get_encoding_account()
and doing all the async magick down there?
should we give some of the slower ledger accessors the same treatment? |
This repository is no longer in use. Please re-open this pull request in the agave repo: https://github.com/anza-xyz/agave |
Problem
Some RPC operations are CPU bound and run for a significant amount of time. Those operations end up blocking worker threads that are also used to handle IO notifications, leading to notifications not being polled often enough and so for the whole RPC server to potentially become slow and exhibit high latency. When latency gets high enough it can exceed request timeouts, leading to failed requests.
Summary of Changes
This PR makes some of the most CPU expensive RPC methods use
tokio::task::spawn_blocking
to run cpu hungry code. This way the worker threads doing IO don't get blocked and latency is improved.The methods changed so far include:
getMultipleAccounts
getProgramAccounts
getAccount
getTokenAccountsByDelegate
getTokenAccountsByOwner
I'm not super familiar with RPC so I've changed what looking at the code seems to be loading/copying a lot of data around. Please feel free to suggest more!