Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make transaction status service multi-threaded. #4032

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

fkouteib
Copy link

@fkouteib fkouteib commented Dec 10, 2024

Problem

As part of an investigation into Agave OOM issues in internal private cluster tests, I found that TSS receiver channel would get severely backed up (80k+ pending msg) when the cluster is running at 40k TPS sustained (bench-TPS wkld; 80-20 FD to Agave node ratio). This would cause slow down across the system and build up memory usage until the node OOMs (crashed agave 256 GB node, and Agave tile on FD 512 GB node in my tests). This issue reproduces more prominently reliably when running with '--enable-rpc-transaction-history' and '--enable-extended-tx-metadata-storage' enabled. Without those flags (graph 3), OOM issue did not reproduce but had mem spikes.

Summary of Changes

  • Make transaction status receiver multi-threaded running 4 worker threads.
  • With this change, the queue can get from 1k to 5k pending messages.

Original issue:

FD node failures are agave tile oom'ing.
Screenshot 2024-12-09 at 21 46 32

Improved state:

tiv1 and tiv2 are agave nodes running the fix. Other nodes running same FD code as before.
Screenshot 2024-12-09 at 21 27 03

original code (without tx history flags):

Screenshot 2024-12-09 at 21 58 04

@alessandrod
Copy link

Thanks for looking at this! Haven't done a proper review yet, but skimming through the code, it looks like it would parallelize well using rayon instead?

@fkouteib
Copy link
Author

Thanks for the feedback Alessandro. That makes sense, and spinning off just the write_transaction_status_batch() into a rayon thread after a message is dequeued would be cleaner and achieve the same desired outcome. One follow-up, mostly because I am not super familiar with how we manage this at large on Agave, we should do it with a private rayon thread pool that's still capped, rather than the global rayon pool. I am worried about introducing other perf variations and resource starvation with tapping the global pool. Is that what you have in mind?

@alessandrod
Copy link

Thanks for the feedback Alessandro. That makes sense, and spinning off just the write_transaction_status_batch() into a rayon thread after a message is dequeued would be cleaner and achieve the same desired outcome. One follow-up, mostly because I am not super familiar with how we manage this at large on Agave, we should do it with a private rayon thread pool that's still capped, rather than the global rayon pool. I am worried about introducing other perf variations and resource starvation with tapping the global pool. Is that what you have in mind?

This is a tricky one because the global rayon pool is actually 99.9% unused. But it does have a bajillion threads (num_cpus() I think?), so we should be careful to not crank it too hard. Between adding another pool and using the global one I'd vote for using the global one, and then hopefully someone makes that pool smaller.

@bw-solana
Copy link

+1 to using the global pool

@steviez
Copy link

steviez commented Dec 11, 2024

This is a tricky one because the global rayon pool is actually 99.9% unused. But it does have a bajillion threads (num_cpus() I think?), so we should be careful to not crank it too hard. Between adding another pool and using the global one I'd vote for using the global one, and then hopefully someone makes that pool smaller.

I'll counter that I think this one may deserve a dedicated pool. I view the global pool as something to be used for startup operations, or for "occasional work". I know the incoming transactions are "bursty" so this isn't a perfect analogy, but at 10k TPS, that is 1 transaction coming in every 100us. Every 10us for 100k TPS. So, I think this work would keep the global pool pretty busy.

@steviez
Copy link

steviez commented Dec 11, 2024

This issue reproduces more prominently when running with '--enable-rpc-transaction-history' and '--enable-extended-tx-metadata-storage' enabled.

Just want to make sure I follow, you're saying things get worse when running with --enable-extended-tx-metadata-storage right? If someone doesn't set --enable-rpc-transaction-history, then this service is not created at all

Copy link

@steviez steviez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking at this! Haven't done a proper review yet, but skimming through the code, it looks like it would parallelize well using rayon instead?

That makes sense, and spinning off just the write_transaction_status_batch() into a rayon thread after a message is dequeued would be cleaner and achieve the same desired outcome

Do you know where we're spending time ? Is it in the transaction-status-preparation, or in the actual commit to Blockstore (blockstore.write_batch()) ?

rpc/src/transaction_status_service.rs Outdated Show resolved Hide resolved
rpc/src/transaction_status_service.rs Outdated Show resolved Hide resolved
@fkouteib
Copy link
Author

This issue reproduces more prominently when running with '--enable-rpc-transaction-history' and '--enable-extended-tx-metadata-storage' enabled.

Just want to make sure I follow, you're saying things get worse when running with --enable-extended-tx-metadata-storage right? If someone doesn't set --enable-rpc-transaction-history, then this service is not created at all

The service is spun up even without the rpc tx history flag enabled, because the other check is tx notifier (tied to geyser plugin support). So this aligns with third graph showing memory activity and spikes and logs on agave nodes. This also aligns with the processing work for tx notifier happening unconditionally, and only items that involve writing to blockstore (rpc tx history, and ext metadata) being wrapped in conditional checks.

In my experiments (and default config for internal tests that originally OOOM), both rpc tx history and extended metadata are enabled. I disabled them together for one experiment to get a sense of how much worse they make the situation when added, but I have not tested with rpc history enabled and ext metadata disabled. I expect that to be a no-op here bc the only place in this service it would matter in reducing work is cancelled out by tx notify support.

@fkouteib
Copy link
Author

All - I updated the review with using a local thread pool of 4 threads, while the global vs local discussion continues.

I am still not convinced that giving TSS unrestricted access to the global thread pool is a good idea. I am worried about the case of just using all threads almost all the time (as we increase TPS). For example, with 4 threads, the queue can still have 500-1000 pending items, but it's not an issue yet (maybe it will later on the road to 1M). Yet in a global pool, they would still be processed sooner because they can. Also a 4-thread cap is a forcing function to revisit this if it's explicitly flagged in future debug as this service is falling behind and cant keep up at TPS rate X. Without it, TSS could starve other global pool users for threads, and we have to start the debug from some downstream manifestation of the issue.

@steviez
Copy link

steviez commented Dec 12, 2024

The service is spun up even without the rpc tx history flag enabled, because the other check is tx notifier (tied to geyser plugin support).

Yup, you're correct that enabling geyser on the validator will also result in this service getting spun up. I didn't see mention of geyser in the PR description, and I was acting under the assumption (possibly incorrectly) that these nodes did not have a geyser plugin enabled.

@fkouteib
Copy link
Author

The service is spun up even without the rpc tx history flag enabled, because the other check is tx notifier (tied to geyser plugin support).

Yup, you're correct that enabling geyser on the validator will also result in this service getting spun up. I didn't see mention of geyser in the PR description, and I was acting under the assumption (possibly incorrectly) that these nodes did not have a geyser plugin enabled.

Ah, good call. You are right. I missed this at first and just checked my running setup. The notifiers are not enabled.

[2024-12-11T06:21:10.613550936Z INFO solana_core::validator] Geyser plugin: accounts_update_notifier: false, transaction_notifier: false, entry_notifier: false

@fkouteib
Copy link
Author

Do you know where we're spending time ? Is it in the transaction-status-preparation, or in the actual commit to Blockstore (blockstore.write_batch()) ?

It's the tx processing and constructing the write batch.

The following data was collected running a local cluster test on my mac and filtered out to only count tx batches with 64 tx, which provided 278 data points. Sample size is small but it should be directionally correct to answer the question.

  • all processing excluding writing the batch:
    min: 3.27 ms, max: 108.96 ms, mean: 44.75 ms, stddev: 23.97 ms
  • The batch write to blockstore:
    min: 3.87 ms, max: 19.21 ms, mean: 7.05 ms, stddev: 2.68 ms

@bw-solana
Copy link

The following data was collected running a local cluster test on my mac

While your conclusion matches my assumption, I'm hesitant to draw any strong performance conclusions. Probably running thousands of threads on a dozen or so cores and getting thrashed to hell.

Would be good to run on a dev server and make sure split times still hold.

@fkouteib
Copy link
Author

fkouteib commented Dec 13, 2024

Would be good to run on a dev server and make sure split times still hold.

I re-ran both Linux and mac in release build (bc my earlier mac data was debug build by accident..). The conclusion holds that processing and batching dominates over the batch write step.

But MBP with M3 max and 64 GB RAM outperforming devserver with AMD EPYC 7513 (64 vCores) and 512 GB RAM caught me by surprise. So I kept all the data together here for review.

  Linux proc (release) in ms Mac proc (release)in ms Mac proc (debug)in ms Linux db (release) in ms Mac db (release)in ms Mac db (debug)in ms
min 9.34 0.50 3.27 1.70 0.75 3.88
max 292.50 105.83 108.97 48.11 16.31 19.22
mean 141.23 26.34 44.90 9.10 2.87 7.05
std dev 49.23 16.86 23.97 4.44 1.34 2.69
num elements 1165 899 277 1165 899 277

@steviez
Copy link

steviez commented Dec 13, 2024

Ah, good call. You are right. I missed this at first and just checked my running setup. The notifiers are not enabled.

No worries. So given your earlier comment

This issue reproduces more prominently when running with '--enable-rpc-transaction-history' and '--enable-extended-tx-metadata-storage' enabled.

Have you seen the node OOM without these --enable-* flags ? I think you definitely found an issue with the transaction status message channel, but just wanting to make sure there isn't some other non-rpc/geyser channel that is backing up and leading to OOM

@fkouteib
Copy link
Author

Have you seen the node OOM without these --enable-* flags ? I think you definitely found an issue with the transaction status message channel, but just wanting to make sure there isn't some other non-rpc/geyser channel that is backing up and leading to OOM

No, not OOM. I updated the wording to be more clear. If you look at the third graph in the PR description, that's with both those flags disabled and there was still a steep climb in mem used to 80-120GB that seem to have self-corrected back down. It was a one-off run. I have not looked into the reason for the spikes as the OOM was the main thing I was chasing.

@fkouteib
Copy link
Author

All - I have addressed all review feedback and questions. I also combined both requests for using of global rayon thread pool and a more cooperative sleep of 50 ms (friendlier service and concern mitigation for tapping global thread pool) in latest patch. with that, TSS backlog never builds to more that 4000 messages (@ 64 tx each max), consistently works that to zero, hits sleep, wakes to somewhere b/w 1k and 4k messages backlog. This is done while sustaining above 40k TPS synthetic wkld. So I think my earlier starvation concern for other users of global pool is mitigated.

@fkouteib fkouteib requested a review from steviez December 16, 2024 21:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants