Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce experimental FIFO compaction settings for shreds in blockstore #22140

Merged
merged 1 commit into from
Feb 10, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions ledger-tool/src/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -678,6 +678,7 @@ fn open_blockstore(
access_type,
recovery_mode: wal_recovery_mode,
enforce_ulimit_nofile: true,
..BlockstoreOptions::default()
},
) {
Ok(blockstore) => blockstore,
Expand Down
1 change: 1 addition & 0 deletions ledger/src/blockstore.rs
Original file line number Diff line number Diff line change
Expand Up @@ -3781,6 +3781,7 @@ pub fn create_new_ledger(
access_type,
recovery_mode: None,
enforce_ulimit_nofile: false,
..BlockstoreOptions::default()
},
)?;
let ticks_per_slot = genesis_config.ticks_per_slot;
Expand Down
134 changes: 130 additions & 4 deletions ledger/src/blockstore_db.rs
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,9 @@ use {
self,
compaction_filter::CompactionFilter,
compaction_filter_factory::{CompactionFilterContext, CompactionFilterFactory},
ColumnFamily, ColumnFamilyDescriptor, CompactionDecision, DBIterator, DBRawIterator,
DBRecoveryMode, IteratorMode as RocksIteratorMode, Options, WriteBatch as RWriteBatch, DB,
ColumnFamily, ColumnFamilyDescriptor, CompactionDecision, DBCompactionStyle, DBIterator,
DBRawIterator, DBRecoveryMode, FifoCompactOptions, IteratorMode as RocksIteratorMode,
Options, WriteBatch as RWriteBatch, DB,
},
serde::{de::DeserializeOwned, Serialize},
solana_runtime::hardened_unpack::UnpackError,
Expand All @@ -35,6 +36,17 @@ use {
};

const MAX_WRITE_BUFFER_SIZE: u64 = 256 * 1024 * 1024; // 256MB
const FIFO_WRITE_BUFFER_SIZE: u64 = 2 * MAX_WRITE_BUFFER_SIZE;
// Maximum size of cf::DataShred. Used when `shred_storage_type`
// is set to ShredStorageType::RocksFifo. The default value is set
// to 125GB, assuming 500GB total storage for ledger and 25% is
// used by data shreds.
const DEFAULT_FIFO_COMPACTION_DATA_CF_SIZE: u64 = 125 * 1024 * 1024 * 1024;
// Maximum size of cf::CodeShred. Used when `shred_storage_type`
// is set to ShredStorageType::RocksFifo. The default value is set
// to 100GB, assuming 500GB total storage for ledger and 20% is
// used by coding shreds.
const DEFAULT_FIFO_COMPACTION_CODING_CF_SIZE: u64 = 100 * 1024 * 1024 * 1024;

// Column family for metadata about a leader slot
const META_CF: &str = "meta";
Expand Down Expand Up @@ -305,8 +317,40 @@ impl Rocks {
new_cf_descriptor::<BankHash>(&access_type, &oldest_slot),
new_cf_descriptor::<Root>(&access_type, &oldest_slot),
new_cf_descriptor::<Index>(&access_type, &oldest_slot),
new_cf_descriptor::<ShredData>(&access_type, &oldest_slot),
new_cf_descriptor::<ShredCode>(&access_type, &oldest_slot),
match options.shred_storage_type {
Copy link
Contributor

@carllin carllin Feb 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, any potential issues if a validator opens a ledger in one compaction type, then shuts down and opens the ledger with the other compaction type? i.e. are they always locked into path after making a decision?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. It's actually covered in the proposal #21358.

From Level to FIFO is simple, because every single file under Level compaction has a disjoint key range, which makes it a valid FIFO format.

From FIFO to Level, the DB will hit the write stall immediately because there are too many L0 files. Typically running manual compaction first before opening the DB is suggested.

Copy link
Contributor

@carllin carllin Feb 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would like to test that forward migration path from Level to FIFO just to ensure it doesn't stall. For instance run a validator for a while with level, then switch it to FIFO. Ideally on a couple of different nodes, a couple different times make sure they don't stall.

Second rollback case is a bit tricky, I think we need to think about how to expose the details of that second case to operators.

Is there a way to tell that the db was formerly opened in FIFO and is now switching to level so we can notify the operator they should perform manual compaction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to tell that the db was formerly opened in FIFO and is now switching to level so we can notify the operator they should perform manual compaction?

I remember RocksDB itself does not store its options. The usual way to do it is to store the options inside an options file https://github.com/facebook/rocksdb/wiki/RocksDB-Options-File. Eventually, we might want to use RocksDB-Options-file to manage RocksDB options.

For the FIFO change, what we can do is to first have the FIFO code landed, then another PR to make it configurable via the command line arguments. Then we stay at this point for a while to make sure we will never revert back to a release where we cannot configure whether we want to use FIFO or Level. Once we reach that point, we can safely assume even when we roll back. The validator will still understand the command line argument and correctly use FIFO or Level after the revert.

ShredStorageType::RocksLevel => {
carllin marked this conversation as resolved.
Show resolved Hide resolved
new_cf_descriptor::<ShredData>(&access_type, &oldest_slot)
}
ShredStorageType::RocksFifo => {
if options.shred_data_cf_size > FIFO_WRITE_BUFFER_SIZE {
new_cf_descriptor_fifo::<ShredData>(&options.shred_data_cf_size)
} else {
warn!(
"shred_data_cf_size is must be greater than {} for RocksFifo.",
FIFO_WRITE_BUFFER_SIZE
);
warn!("Fall back to ShredStorageType::RocksLevel for cf::ShredData.");
new_cf_descriptor::<ShredData>(&access_type, &oldest_slot)
}
}
},
match options.shred_storage_type {
ShredStorageType::RocksLevel => {
new_cf_descriptor::<ShredCode>(&access_type, &oldest_slot)
}
ShredStorageType::RocksFifo => {
if options.shred_code_cf_size > FIFO_WRITE_BUFFER_SIZE {
new_cf_descriptor_fifo::<ShredCode>(&options.shred_code_cf_size)
} else {
warn!(
"shred_code_cf_size is must be greater than {} for RocksFifo.",
FIFO_WRITE_BUFFER_SIZE
);
warn!("Fall back to ShredStorageType::RocksLevel for cf::ShredCode.");
new_cf_descriptor::<ShredCode>(&access_type, &oldest_slot)
}
}
},
new_cf_descriptor::<TransactionStatus>(&access_type, &oldest_slot),
new_cf_descriptor::<AddressSignatures>(&access_type, &oldest_slot),
new_cf_descriptor::<TransactionMemos>(&access_type, &oldest_slot),
Expand Down Expand Up @@ -952,10 +996,40 @@ pub struct WriteBatch<'a> {
map: HashMap<&'static str, &'a ColumnFamily>,
}

pub enum ShredStorageType {
yhchiang-sol marked this conversation as resolved.
Show resolved Hide resolved
// Stores shreds under RocksDB's default compaction (level).
RocksLevel,
// (Experimental) Stores shreds under RocksDB's FIFO compaction which
// allows ledger store to reclaim storage more efficiently with
// lower I/O overhead.
RocksFifo,
}

pub struct BlockstoreOptions {
// The access type of blockstore. Default: PrimaryOnly
yhchiang-sol marked this conversation as resolved.
Show resolved Hide resolved
pub access_type: AccessType,
// Whether to open a blockstore under a recovery mode. Default: None.
pub recovery_mode: Option<BlockstoreRecoveryMode>,
// Whether to allow unlimited number of open files. Default: true.
pub enforce_ulimit_nofile: bool,
// Determine how to store both data and coding shreds. Default: RocksLevel.
pub shred_storage_type: ShredStorageType,
yhchiang-sol marked this conversation as resolved.
Show resolved Hide resolved
// The maximum storage size for storing data shreds in column family
// [`cf::DataShred`]. Typically, data shreds contribute around 25% of the
// ledger store storage size if the RPC service is enabled, or 50% if RPC
// service is not enabled.
//
// Currently, this setting is only used when shred_storage_type is set to
// [`ShredStorageType::RocksFifo`].
pub shred_data_cf_size: u64,
// The maximum storage size for storing coding shreds in column family
// [`cf::CodeShred`]. Typically, coding shreds contribute around 20% of the
// ledger store storage size if the RPC service is enabled, or 40% if RPC
// service is not enabled.
//
// Currently, this setting is only used when shred_storage_type is set to
// [`ShredStorageType::RocksFifo`].
pub shred_code_cf_size: u64,
}

impl Default for BlockstoreOptions {
Expand All @@ -965,6 +1039,13 @@ impl Default for BlockstoreOptions {
access_type: AccessType::PrimaryOnly,
recovery_mode: None,
enforce_ulimit_nofile: true,
shred_storage_type: ShredStorageType::RocksLevel,
// Maximum size of cf::DataShred. Used when `shred_storage_type`
// is set to ShredStorageType::RocksFifo.
shred_data_cf_size: DEFAULT_FIFO_COMPACTION_DATA_CF_SIZE,
// Maximum size of cf::CodeShred. Used when `shred_storage_type`
// is set to ShredStorageType::RocksFifo.
shred_code_cf_size: DEFAULT_FIFO_COMPACTION_CODING_CF_SIZE,
}
}
}
Expand Down Expand Up @@ -1355,6 +1436,51 @@ fn get_cf_options<C: 'static + Column + ColumnName>(
options
}

fn new_cf_descriptor_fifo<C: 'static + Column + ColumnName>(
max_cf_size: &u64,
) -> ColumnFamilyDescriptor {
ColumnFamilyDescriptor::new(C::NAME, get_cf_options_fifo::<C>(max_cf_size))
}

/// Returns the RocksDB Column Family Options which use FIFO Compaction.
///
/// Note that this CF options is optimized for workloads which write-keys
/// are mostly monotonically increasing over time. For workloads where
/// write-keys do not follow any order in general should use get_cf_options
/// instead.
///
/// - [`max_cf_size`]: the maximum allowed column family size. Note that
/// rocksdb will start deleting the oldest SST file when the column family
/// size reaches `max_cf_size` - `FIFO_WRITE_BUFFER_SIZE` to strictly
/// maintain the size limit.
fn get_cf_options_fifo<C: 'static + Column + ColumnName>(max_cf_size: &u64) -> Options {
let mut options = Options::default();

options.set_max_write_buffer_number(8);
options.set_write_buffer_size(FIFO_WRITE_BUFFER_SIZE as usize);
// FIFO always has its files in L0 so we only have one level.
options.set_num_levels(1);
// Since FIFO puts all its file in L0, it is suggested to have unlimited
// number of open files. The actual total number of open files will
// be close to max_cf_size / write_buffer_size.
options.set_max_open_files(-1);

let mut fifo_compact_options = FifoCompactOptions::default();

// Note that the following actually specifies size trigger for deleting
// the oldest SST file instead of specifying the size limit as its name
// might suggest. As a result, we should trigger the file deletion when
// the size reaches `max_cf_size - write_buffer_size` in order to correctly
// maintain the storage size limit.
fifo_compact_options
.set_max_table_files_size((*max_cf_size).saturating_sub(FIFO_WRITE_BUFFER_SIZE));

options.set_compaction_style(DBCompactionStyle::Fifo);
options.set_fifo_compaction_options(&fifo_compact_options);

options
}

fn get_db_options(access_type: &AccessType) -> Options {
let mut options = Options::default();

Expand Down
1 change: 1 addition & 0 deletions local-cluster/tests/common.rs
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,7 @@ pub fn open_blockstore(ledger_path: &Path) -> Blockstore {
access_type: AccessType::TryPrimaryThenSecondary,
recovery_mode: None,
enforce_ulimit_nofile: true,
..BlockstoreOptions::default()
},
)
.unwrap_or_else(|e| {
Expand Down
1 change: 1 addition & 0 deletions local-cluster/tests/local_cluster_flakey.rs
Original file line number Diff line number Diff line change
Expand Up @@ -331,6 +331,7 @@ fn do_test_optimistic_confirmation_violation_with_or_without_tower(with_tower: b
access_type: AccessType::TryPrimaryThenSecondary,
recovery_mode: None,
enforce_ulimit_nofile: true,
..BlockstoreOptions::default()
},
)
.unwrap();
Expand Down