-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wire unified scheduler into banking experimentally #3946
base: master
Are you sure you want to change the base?
Wire unified scheduler into banking experimentally #3946
Conversation
16c8b91
to
ca24a20
Compare
3af5124
to
9772f93
Compare
sdk/src/scheduling.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nothing new in sdk
, please 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, you spotted that. :) 723f765
a465e1f
to
1900ab3
Compare
a371e49
to
7fd532f
Compare
786b276
to
da5c6a8
Compare
// A write lock for the poh recorder must be grabbed for the entire duration of inserting new | ||
// tpu bank into the bank forks. That's because any buffered transactions could immediately be | ||
// executed after the bank forks update, when unified scheduler is enabled for block | ||
// production. If transactions were executed prematurely, the unified scheduler would be hit | ||
// with false errors due to having no bank in the poh recorder otherwise. | ||
let mut poh_recorder = poh_recorder.write().unwrap(); | ||
|
||
let tpu_bank = bank_forks | ||
.write() | ||
.unwrap() | ||
.insert_with_scheduling_mode(SchedulingMode::BlockProduction, tpu_bank); | ||
poh_recorder.set_bank(tpu_bank, track_transaction_indexes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this (backref: #4123 (comment))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a concern here is that elsewhere we grab these locks in the opposite order.
I'm fairly certain we don't but given these are both so extensively used (smelly) I'm not convinced entirely that we do not.
Eventually there should be some other delivery system for TPU banks, other than locking poh. I'm not quite sure what that should look like though. Any thoughts?
core/src/validator.rs
Outdated
// bank_forks could be write-locked temporarily here, if unified scheduler is enabled for | ||
// block production. That's because ReplayStage started inside Tvu could immediately try to | ||
// insert a tpu bank into bank_forks, like with local development clusters consisting of a | ||
// single staked node. Such insertion could be blocked with the lock held by unified | ||
// scheduler until it's fully setup by the banking stage. This is intentional because | ||
// completion of insertion needs to strictly correspond with the initiation of producing | ||
// blocks in unified scheduler. Because Tpu setup follows Tvu setup, there's a corner case | ||
// where the banking stage hasn't yet called register_banking_stage() to finish the unified | ||
// scheduler setup. | ||
// This means any setup which accesses bank_forks must be done before | ||
// or after the short duration of Tvu and Tpu setups to avoid deadlocks. So, | ||
// RootBankCache needs to be created here in advance as being one of such setups. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this (backref: #4123 (comment))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the above comment on the lock order - feel it may be worth investing in a better way to deliver working banks to tpu to avoid these weird edge cases where we need to be very careful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems that that wall of text couldn't help and i crossed your code-smell threshold nevertheless.. ;)
how about this? 110c1bc (note: ci won't pass)
if it's okay for you that unified scheduler code breaks out of the BankingStage code organization more aggressively like that, i can remove this weird edge case: 9839ab8
feel it may be worth investing in a better way to deliver working banks to tpu to avoid these weird edge cases where we need to be very careful.
fyi, this is completely different from the update_bank_forks_and_poh_recorder_for_new_tpu_bank
thing (tpu delivery as you said), which i'll plan to address later with a different commit.
Problem
Currently, the unified scheduler can't be used as a block-production method.
Summary of Changes
(See Proposition/justification section at #3832 for justification of this pr in the first place)
This pr (and any spin-off ones) implements all the necessary plumbing from the cli to the innermost (
SchedulingStateMachine
) of the unified scheduler with minimal functionality. Note that, even after all the changes in this pr is merged, unified scheduler isn't fully functional nor performant nor secure for the banking stage in any sense.as part of the plumbing, new thread management mechanism is introduced inside
unified-scheduler-pool
, parallel to the already-existing one for the unified scheduler in the block verification. As a quick recap, unified scheduler assigns independent thread pools to each active banks for ongoing forks while verifying blocks.on the other hand, unified scheduler maintains one thread pool at most (as a singleton object) for the current working_bank of PohRecorder, when producing blocks. This means these two kinds of thread pools (one for block verification, other for block production) are managed differently. Other than that, most of core code (thread looping and
SchedulingStateMachine
) is still shared among them with limited number of branching (i.e. I'd say it's still unified lol).This differentiated behavior at the higher layer is intentional to accommodate banking-stage specific requirements:
(1) existing relevant code base is strongly assuming non-concurrent block production even if there are multiple competing forks, which i think a justifiable design decision for impl simplicity. This means the channel receivers (
nonvote_receiver
,tpu_receiver
,gossip_receiver
) for incoming transactions are kind of a singleton set of resources and aren't fork-aware. Also, unified scheduler needs some preprocessing (SanitizedTransaction
/TransactioView
creation, alt resolution, andPubkey
->UsageQueue
lookups). With all these considerations in mind, the channels are directly connected to unified-scheduler's handler thread main loop of a particular scheduler instance, in order to do the offloaded processing there multi-threadedly, while avoiding extra messaging pop for efficiency.the actual integration are implemented with above-mentioned design goals. at validator startup, if
BankingStage::new()
is branched intonew_unified_scheduler()
, it does minimal setup even without spawning threads. Specifically, it just callsscheduler_pool.register_banking_stage()
. its notable arg is the callback closure to convertBankingPacketBatch
es toSanitizedTransaction
s, which is actually run on handler threads.The closure type is slightly involved because of the need to respawn thread pool from time to time to mitigate unbounded
UsageQueueLoader
growth. So, the closure is nested: the inner closure for convention is calledBatchConverter
. The outer closure to create the converter is calledBatchConverterCreator
. Along with the closure, the actual transaction-incoming channel and small monitor object (calledBankingStageMonitor
) are registered to the unified scheduler pool. These args are retained as-is, typed asBlockProductionSchedulerRespawner
, internally. These separation incurs a single dynamic dispatch, but intentionally done for separation of concerns.(2) transaction buffering (and prioritization-ordering) is desired to be run prior to the first bank creation and this buffer must be carried over to the first bank of leader slots and the next, etc. So, block-verifying unified scheduler's tight coupling of
1 Bank
-to-1 thread pool
isn't appropriate here.towards that end, now thread pool can be started without a bank by making
SchedulingContext::bank
Option<_>
-ed. Then, it enters into the buffering state immediately after the setup. Previously, this wasn't possible.Also, scheduling session needed some adjustment to reach to safe suspension point quickly, so that next child bank could resume timely. As a quick recap,
SchedulingStateMachine
didn't previously buffer any non-conflicting ready-to-execute transactions. That meant these transactions are buffered inside crossbeam channels. And, its (potentially large) buffer has to be cleared before switching to the next bank, to maintain runtime invariant.So,
SchedulingStateMachine
now starts to buffer transaction internally up-to fixed cap instead. And scheduler thread's main loop introduces new state for session, to be paired to the new buffering mechanism:session_pausing = true
Note that, this scheduling resumption mechanism doesn't work nicely when switching forks in the middle of leader slots. Specifically, it doesn't recover already-processed higher-paying txes, which are yet to be executed at the new fork. This behavior is also true for central scheduler.
As a bonus,
session_resetting = true
is also introduced to clear any remaining unprocessed buffered transactions after leader schedule.The actual change list:
Introduce
BankingTracer::create_channels()
, to use same channels over all of{nonvote,tpu,gossip}_channels
. Define BankingTracer::create_channels() #4041Promote receiver payload types
solana-core
tosolana-perf
((EDIT: obsoleted, thanks to Remove Tracer packet as a concept #4043) /SigverifyTracerPacketStats
BankingPacketBatch
/BankingPacketReceiver
). Apply cleanups to solana-core for unified scheduler #4123BankingPacketBatch
/BankingPacketReceiver
insidesolana-unified-scheduler-pool
.SupportAbiExample
forstd::num::Saturating<T>
.This is needed because(EDIT: obsoleted, thanks to Remove Tracer packet as a concept #4043)SigverifyTracerPacketStats
used solana_sdk's saturating_add macro previously. Nowsolana_perf
doesn't depend onsolana-sdk
.Make TransactionHandler trait's api more general: Make TaskHandler::handle() more generalized #4050
Move
RootBankCache
creation to avoid deadlock: Apply cleanups to solana-core for unified scheduler #4123Define
SchedulerInner
, a private trait for internal use bysolana-unified-scheuler-pool
: Define SchedulerInner for fine-grained cleaning #4133commit between
solana-unified-scheduler
,solana-runtime
,solana-ledger
.load_execute_and_commit_transactions
api as well for this.Introduce
TransactionError::CommitFailed
to propagate poh faiture immediately before bank: Support tx poh recording in unified scheduler #4150commit between
solana-unified-scheduler-pool
,solana-runtime
,solana-ledger
.load_execute_and_commit_transactions
api as well for this.To beatify the upcoming diffs: Apply misc small cleanups to unified scheduler #4080 Apply more cosmetic changes to unified scheduler #4111
SchedulingMode::{BlockVerification, BlockProduction}
finally!BankForks
/InstalledScheduler
aware ofSchedulingMode
.SchedulingStateMachine
buffer ready-to-execute tasks internally up tomax_executing_task
unified-scheduler
for--block-production-method
, with override flag (--enable-experimental-block-production-method
)PohRecorder::reset()
return overwrittenBankWithScheduler
to reap now-unused scheduler from the bank wrapper.local-cluster
test:test_randomly_mixed_block_production_methods_between_bootstrap_and_not
agave-ledger-tool simulate-block-production
support--block-production-method unified-scheduler
solana-banking-bench simulate-block-production
support--block-production-method unified-scheduler
Note that this pr is extracted from #2325