Make transaction status service multi-threaded. #4032

fkouteib · 2024-12-10T06:04:35Z

Problem

As part of an investigation into Agave OOM issues in internal private cluster tests, I found that TSS receiver channel would get severely backed up (80k+ pending msg) when the cluster is running at 40k TPS sustained (bench-TPS wkld; 80-20 FD to Agave node ratio). This would cause slow down across the system and build up memory usage until the node OOMs (crashed agave 256 GB node, and Agave tile on FD 512 GB node in my tests). This issue reproduces ~~more prominently~~ reliably when running with '--enable-rpc-transaction-history' and '--enable-extended-tx-metadata-storage' enabled. Without those flags (graph 3), OOM issue did not reproduce but had mem spikes.

Summary of Changes

Make transaction status receiver multi-threaded running 4 worker threads.
With this change, the queue can get from 1k to 5k pending messages.

Original issue:

FD node failures are agave tile oom'ing.

Improved state:

tiv1 and tiv2 are agave nodes running the fix. Other nodes running same FD code as before.

original code (without tx history flags):

alessandrod · 2024-12-11T03:29:25Z

Thanks for looking at this! Haven't done a proper review yet, but skimming through the code, it looks like it would parallelize well using rayon instead?

fkouteib · 2024-12-11T05:02:53Z

Thanks for the feedback Alessandro. That makes sense, and spinning off just the write_transaction_status_batch() into a rayon thread after a message is dequeued would be cleaner and achieve the same desired outcome. One follow-up, mostly because I am not super familiar with how we manage this at large on Agave, we should do it with a private rayon thread pool that's still capped, rather than the global rayon pool. I am worried about introducing other perf variations and resource starvation with tapping the global pool. Is that what you have in mind?

alessandrod · 2024-12-11T09:54:07Z

Thanks for the feedback Alessandro. That makes sense, and spinning off just the write_transaction_status_batch() into a rayon thread after a message is dequeued would be cleaner and achieve the same desired outcome. One follow-up, mostly because I am not super familiar with how we manage this at large on Agave, we should do it with a private rayon thread pool that's still capped, rather than the global rayon pool. I am worried about introducing other perf variations and resource starvation with tapping the global pool. Is that what you have in mind?

This is a tricky one because the global rayon pool is actually 99.9% unused. But it does have a bajillion threads (num_cpus() I think?), so we should be careful to not crank it too hard. Between adding another pool and using the global one I'd vote for using the global one, and then hopefully someone makes that pool smaller.

bw-solana · 2024-12-11T17:17:10Z

+1 to using the global pool

steviez · 2024-12-11T18:03:21Z

This is a tricky one because the global rayon pool is actually 99.9% unused. But it does have a bajillion threads (num_cpus() I think?), so we should be careful to not crank it too hard. Between adding another pool and using the global one I'd vote for using the global one, and then hopefully someone makes that pool smaller.

I'll counter that I think this one may deserve a dedicated pool. I view the global pool as something to be used for startup operations, or for "occasional work". I know the incoming transactions are "bursty" so this isn't a perfect analogy, but at 10k TPS, that is 1 transaction coming in every 100us. Every 10us for 100k TPS. So, I think this work would keep the global pool pretty busy.

steviez · 2024-12-11T18:07:07Z

This issue reproduces more prominently when running with '--enable-rpc-transaction-history' and '--enable-extended-tx-metadata-storage' enabled.

Just want to make sure I follow, you're saying things get worse when running with --enable-extended-tx-metadata-storage right? If someone doesn't set --enable-rpc-transaction-history, then this service is not created at all

steviez

Thanks for looking at this! Haven't done a proper review yet, but skimming through the code, it looks like it would parallelize well using rayon instead?

That makes sense, and spinning off just the write_transaction_status_batch() into a rayon thread after a message is dequeued would be cleaner and achieve the same desired outcome

Do you know where we're spending time ? Is it in the transaction-status-preparation, or in the actual commit to Blockstore (blockstore.write_batch()) ?

rpc/src/transaction_status_service.rs

fkouteib · 2024-12-12T01:05:21Z

This issue reproduces more prominently when running with '--enable-rpc-transaction-history' and '--enable-extended-tx-metadata-storage' enabled.

Just want to make sure I follow, you're saying things get worse when running with --enable-extended-tx-metadata-storage right? If someone doesn't set --enable-rpc-transaction-history, then this service is not created at all

The service is spun up even without the rpc tx history flag enabled, because the other check is tx notifier (tied to geyser plugin support). So this aligns with third graph showing memory activity and spikes and logs on agave nodes. This also aligns with the processing work for tx notifier happening unconditionally, and only items that involve writing to blockstore (rpc tx history, and ext metadata) being wrapped in conditional checks.

In my experiments (and default config for internal tests that originally OOOM), both rpc tx history and extended metadata are enabled. I disabled them together for one experiment to get a sense of how much worse they make the situation when added, but I have not tested with rpc history enabled and ext metadata disabled. I expect that to be a no-op here bc the only place in this service it would matter in reducing work is cancelled out by tx notify support.

fkouteib · 2024-12-12T01:21:06Z

All - I updated the review with using a local thread pool of 4 threads, while the global vs local discussion continues.

I am still not convinced that giving TSS unrestricted access to the global thread pool is a good idea. I am worried about the case of just using all threads almost all the time (as we increase TPS). For example, with 4 threads, the queue can still have 500-1000 pending items, but it's not an issue yet (maybe it will later on the road to 1M). Yet in a global pool, they would still be processed sooner because they can. Also a 4-thread cap is a forcing function to revisit this if it's explicitly flagged in future debug as this service is falling behind and cant keep up at TPS rate X. Without it, TSS could starve other global pool users for threads, and we have to start the debug from some downstream manifestation of the issue.

steviez · 2024-12-12T01:54:32Z

The service is spun up even without the rpc tx history flag enabled, because the other check is tx notifier (tied to geyser plugin support).

Yup, you're correct that enabling geyser on the validator will also result in this service getting spun up. I didn't see mention of geyser in the PR description, and I was acting under the assumption (possibly incorrectly) that these nodes did not have a geyser plugin enabled.

fkouteib · 2024-12-12T02:42:44Z

The service is spun up even without the rpc tx history flag enabled, because the other check is tx notifier (tied to geyser plugin support).

Yup, you're correct that enabling geyser on the validator will also result in this service getting spun up. I didn't see mention of geyser in the PR description, and I was acting under the assumption (possibly incorrectly) that these nodes did not have a geyser plugin enabled.

Ah, good call. You are right. I missed this at first and just checked my running setup. The notifiers are not enabled.

[2024-12-11T06:21:10.613550936Z INFO solana_core::validator] Geyser plugin: accounts_update_notifier: false, transaction_notifier: false, entry_notifier: false

…th a sleep.

fkouteib · 2024-12-12T23:04:58Z

Do you know where we're spending time ? Is it in the transaction-status-preparation, or in the actual commit to Blockstore (blockstore.write_batch()) ?

It's the tx processing and constructing the write batch.

The following data was collected running a local cluster test on my mac and filtered out to only count tx batches with 64 tx, which provided 278 data points. Sample size is small but it should be directionally correct to answer the question.

all processing excluding writing the batch:
min: 3.27 ms, max: 108.96 ms, mean: 44.75 ms, stddev: 23.97 ms
The batch write to blockstore:
min: 3.87 ms, max: 19.21 ms, mean: 7.05 ms, stddev: 2.68 ms

bw-solana · 2024-12-12T23:12:24Z

The following data was collected running a local cluster test on my mac

While your conclusion matches my assumption, I'm hesitant to draw any strong performance conclusions. Probably running thousands of threads on a dozen or so cores and getting thrashed to hell.

Would be good to run on a dev server and make sure split times still hold.

fkouteib · 2024-12-13T04:19:22Z

Would be good to run on a dev server and make sure split times still hold.

I re-ran both Linux and mac in release build (bc my earlier mac data was debug build by accident..). The conclusion holds that processing and batching dominates over the batch write step.

But MBP with M3 max and 64 GB RAM outperforming devserver with AMD EPYC 7513 (64 vCores) and 512 GB RAM caught me by surprise. So I kept all the data together here for review.

	Linux proc (release) in ms	Mac proc (release)in ms	Mac proc (debug)in ms	Linux db (release) in ms	Mac db (release)in ms	Mac db (debug)in ms
min	9.34	0.50	3.27	1.70	0.75	3.88
max	292.50	105.83	108.97	48.11	16.31	19.22
mean	141.23	26.34	44.90	9.10	2.87	7.05
std dev	49.23	16.86	23.97	4.44	1.34	2.69
num elements	1165	899	277	1165	899	277

steviez · 2024-12-13T17:04:55Z

Ah, good call. You are right. I missed this at first and just checked my running setup. The notifiers are not enabled.

No worries. So given your earlier comment

This issue reproduces more prominently when running with '--enable-rpc-transaction-history' and '--enable-extended-tx-metadata-storage' enabled.

Have you seen the node OOM without these --enable-* flags ? I think you definitely found an issue with the transaction status message channel, but just wanting to make sure there isn't some other non-rpc/geyser channel that is backing up and leading to OOM

fkouteib · 2024-12-13T18:38:29Z

Have you seen the node OOM without these --enable-* flags ? I think you definitely found an issue with the transaction status message channel, but just wanting to make sure there isn't some other non-rpc/geyser channel that is backing up and leading to OOM

No, not OOM. I updated the wording to be more clear. If you look at the third graph in the PR description, that's with both those flags disabled and there was still a steep climb in mem used to 80-120GB that seem to have self-corrected back down. It was a one-off run. I have not looked into the reason for the spikes as the OOM was the main thing I was chasing.

fkouteib · 2024-12-16T21:15:29Z

All - I have addressed all review feedback and questions. I also combined both requests for using of global rayon thread pool and a more cooperative sleep of 50 ms (friendlier service and concern mitigation for tapping global thread pool) in latest patch. with that, TSS backlog never builds to more that 4000 messages (@ 64 tx each max), consistently works that to zero, hits sleep, wakes to somewhere b/w 1k and 4k messages backlog. This is done while sustaining above 40k TPS synthetic wkld. So I think my earlier starvation concern for other users of global pool is mitigated.

rpc/src/transaction_status_service.rs

Make TSS multi-threaded.

b874697

fkouteib requested review from bw-solana and alessandrod December 10, 2024 06:05

steviez reviewed Dec 11, 2024

View reviewed changes

rpc/src/transaction_status_service.rs Outdated Show resolved Hide resolved

rpc/src/transaction_status_service.rs Outdated Show resolved Hide resolved

Switch to local rayon thread pool.

264907a

fkouteib added 2 commits December 11, 2024 19:14

Merge branch 'master' into multi-thread-tss

eded259

Switch to a global thread pool and change recv_timeout to try_recv wi…

2f2e99a

…th a sleep.

fkouteib added 2 commits December 16, 2024 10:12

Merge branch 'master' into multi-thread-tss

9361a8a

up sleep time to 50 msec if TSS receiver is empty

4faec9f

fkouteib requested a review from steviez December 16, 2024 21:15

bw-solana reviewed Dec 16, 2024

View reviewed changes

rpc/src/transaction_status_service.rs Show resolved Hide resolved

fkouteib added 2 commits December 16, 2024 17:24

Merge branch 'master' into multi-thread-tss

8763701

Merge branch 'master' into multi-thread-tss

e48fe47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make transaction status service multi-threaded. #4032

Make transaction status service multi-threaded. #4032

fkouteib commented Dec 10, 2024 •

edited

Loading

alessandrod commented Dec 11, 2024

fkouteib commented Dec 11, 2024

alessandrod commented Dec 11, 2024

bw-solana commented Dec 11, 2024

steviez commented Dec 11, 2024

steviez commented Dec 11, 2024 •

edited

Loading

steviez left a comment •

edited

Loading

fkouteib commented Dec 12, 2024

fkouteib commented Dec 12, 2024

steviez commented Dec 12, 2024

fkouteib commented Dec 12, 2024

fkouteib commented Dec 12, 2024

bw-solana commented Dec 12, 2024

fkouteib commented Dec 13, 2024 •

edited

Loading

steviez commented Dec 13, 2024

fkouteib commented Dec 13, 2024

fkouteib commented Dec 16, 2024

Make transaction status service multi-threaded. #4032

Are you sure you want to change the base?

Make transaction status service multi-threaded. #4032

Conversation

fkouteib commented Dec 10, 2024 • edited Loading

Problem

Summary of Changes

Original issue:

Improved state:

original code (without tx history flags):

alessandrod commented Dec 11, 2024

fkouteib commented Dec 11, 2024

alessandrod commented Dec 11, 2024

bw-solana commented Dec 11, 2024

steviez commented Dec 11, 2024

steviez commented Dec 11, 2024 • edited Loading

steviez left a comment • edited Loading

Choose a reason for hiding this comment

fkouteib commented Dec 12, 2024

fkouteib commented Dec 12, 2024

steviez commented Dec 12, 2024

fkouteib commented Dec 12, 2024

fkouteib commented Dec 12, 2024

bw-solana commented Dec 12, 2024

fkouteib commented Dec 13, 2024 • edited Loading

steviez commented Dec 13, 2024

fkouteib commented Dec 13, 2024

fkouteib commented Dec 16, 2024

fkouteib commented Dec 10, 2024 •

edited

Loading

steviez commented Dec 11, 2024 •

edited

Loading

steviez left a comment •

edited

Loading

fkouteib commented Dec 13, 2024 •

edited

Loading