-
Notifications
You must be signed in to change notification settings - Fork 6.3k
BlobDB
BlobDB is essentially RocksDB for large-value use cases. The basic idea, which was proposed in the WiscKey paper, is key-value separation: by storing large values in dedicated blob files and storing only small pointers to them in the LSM tree, we avoid copying the values over and over again during compaction. This reduces write amplification, which has several potential benefits like improved SSD lifetime, and better write and read performance. On the other hand, this comes with the cost of some space amplification due to the presence of blobs that are no longer referenced by the LSM tree, which have to be garbage collected.
⚠️ WARNING: There are two BlobDB implementations in the codebase: the legacyStackableDB
based one (seerocksdb::blob_db::BlobDB
) and the new integrated one (which uses the well-knownrocksdb::DB
interface). The legacy implementation is primarily geared towards FIFO/TTL use cases that can tolerate some data loss. It is incompatible with many widely used RocksDB features, for example,Merge
, column families, checkpoints, backup/restore, transactions etc., and its performance is significantly worse than that of the integrated implementation. Note that the API for this version is not in the public header directory, it is not actively developed, and we expect it to be eventually deprecated. This page focuses on the new integrated BlobDB.
RocksDB's LSM tree works by buffering writes in memtables, which are then persisted in SST files during flush. SST files form a tree, and are continuously merged and rewritten in the background by compactions, which eliminate any obsolete key-values in the process. This repeated rewriting of data leads to write amplification, which is detrimental to flash lifetime and consumes a significant amount of bandwidth, especially when the key-values in question are large. Also, in the case of write-heavy workloads, compactions might not be able to keep up with the incoming load, creating backpressure that limits write throughput and potentially results in write stalls.
To address the above issues, BlobDB uses a technique called key-value separation: instead of storing large values (blobs) in the SST files, it writes them to a dedicated set of blob files, and stores only small pointers to them in the SST files. (Values smaller than a configurable threshold are stored in the LSM tree as usual.) With the blobs stored outside the LSM tree, compactions have to rewrite much less data, which can dramatically reduce overall write amplification and thus improve flash endurance. BlobDB can also provide much higher throughput by reducing or eliminating the backpressure mentioned above, and for many workloads, it can even improve read performance (see our benchmark results here).
With key-value separation, updating or deleting a key-value results in an unreferenced blob in the corresponding blob file. Space occupied by such garbage blobs is reclaimed using garbage collection. BlobDB’s garbage collector is integrated with the LSM tree compaction process, and can be fine-tuned to strike the desired balance between space amplification and write amplification.
Offloading blob file building to RocksDB’s background jobs, i.e. flushes and compactions, has several advantages. It enables BlobDB to provide the same consistency guarantees as RocksDB itself. There are also several performance benefits:
- Similarly to SSTs, any given blob file is written by a single background thread, which eliminates the need for synchronization.
- Blob files can be written using large I/Os; there is no need to flush them after each write like in the case of the old BlobDB for example. This approach is also a better fit for network-based file systems where small writes might be expensive.
- Compressing blobs in the background can improve latencies.
- Blob files are immutable, which enables making blob files a part of the Version. This in turn makes the read-path essentially lock-free.
- Similarly to SST files, blob files are sorted by key, which enables performance improvements like using readahead during compaction and iteration.
- When it comes to garbage collection, blobs can be relocated and the corresponding blob references can be updated at the same time, as they are encountered during compaction (without any additional LSM tree operations).
- It opens up the possibility of file format optimizations that involve buffering (like dictionary compression).
In terms of functionality, BlobDB is near feature parity with vanilla RocksDB. In particular, it supports the following:
- write APIs:
Put
,Merge
,Delete
,SingleDelete
,DeleteRange
,Write
with all write options - read APIs:
Get
,MultiGet
(including batchedMultiGet
), iterators, andGetMergeOperands
- flush including atomic and manual flush
- compaction (with integrated garbage collection), subcompactions, and the manual compaction APIs
CompactFiles
andCompactRange
- WAL and the various recovery modes
- tracking blob files in the MANIFEST
- snapshots
- per-blob compression and checksums (CRC32c)
- column families
- compaction filters (with a BlobDB-specific optimization)
- checkpoints
- backup/restore
- transactions
- per-file checksums
- SST file manager integration for tracking and rate-limited deletion of blob files
- blob file cache of frequently used blob files
- statistics
- DB properties
- metadata APIs:
GetColumnFamilyMetaData
,GetAllColumnFamilyMetaData
, andGetLiveFilesStorageInfo
-
EventListener
interface - direct I/O
- I/O rate limiting
- I/O tracing
- C and Java bindings
- tooling (
ldb
andsst_dump
integration,blob_dump
tool)
The BlobDB-specific aspects of some of these features are detailed below.
BlobDB can be configured (on a per-column family basis if needed) simply by using the following column family options:
-
enable_blob_files
: set it totrue
to enable key-value separation. -
min_blob_size
: values at or above this threshold will be written to blob files during flush or compaction. -
blob_file_size
: the size limit for blob files. (Note that a single flush or (sub)compaction may write multiple blob files.) Since the space is reclaimed in blob file increments, the value of this parameter heavily influences space amplification. -
blob_compression_type
: the compression type to use for blob files. All blobs in the same file are compressed using the same algorithm. -
enable_blob_garbage_collection
: set this totrue
to make BlobDB actively relocate valid blobs from the oldest blob files as they are encountered during compaction. -
blob_garbage_collection_age_cutoff
: the cutoff that the GC logic uses to determine which blob files should be considered “old.” For example, the default value of 0.25 signals to RocksDB that blobs residing in the oldest 25% of blob files should be relocated by GC. This parameter can be tuned to adjust the trade-off between write amplification and space amplification. -
blob_garbage_collection_force_threshold
: if the ratio of garbage in the oldest blob files exceeds this threshold, targeted compactions are scheduled in order to force garbage collecting the blob files in question, assuming they are all eligible based on the value ofblob_garbage_collection_age_cutoff
above. This can help reduce space amplification in the case of skewed workloads where the affected files would not otherwise be picked up for compaction. This option is currently only supported with leveled compactions. -
blob_compaction_readahead_size
: when set, BlobDB will prefetch data from blob files in chunks of the configured size during compaction. This can improve compaction performance when the database resides on higher-latency storage like HDDs or remote filesystems.
The above options are all dynamically adjustable via the SetOptions
API; changing them will affect subsequent flushes and compactions but not ones that are already in progress.
As mentioned above, BlobDB now also supports compaction filters. Key-value separation actually enables an optimization here: if the compaction filter of an application can make a decision about a key-value solely based on the key, it is unnecessary to read the value from the blob file. Applications can take advantage of this optimization by implementing the new FilterBlobByKey
method of the CompactionFilter
interface. This method gets called by RocksDB first whenever it encounters a key-value where the value is stored in a blob file. If this method returns a “final” decision like kKeep
, kRemove
, kChangeValue
, or kRemoveAndSkipUntil
, RocksDB will honor that decision; on the other hand, if the method returns kUndetermined
, RocksDB will read the blob from the blob file and call FilterV2
with the value in the usual fashion.
The integrated implementation supports the tickers BLOB_DB_BLOB_FILE_BYTES_{READ,WRITTEN}
, BLOB_DB_BLOB_FILE_SYNCED
, and BLOB_DB_GC_{NUM_KEYS,BYTES}_RELOCATED
, as well as the histograms BLOB_DB_BLOB_FILE_{READ,WRITE,SYNC}_MICROS
and BLOB_DB_(DE)COMPRESSION_MICROS
. Note that the vast majority of the legacy BlobDB's tickers/histograms are not applicable to the new implementation, since they e.g. pertain to calling dedicated BlobDB APIs (which the integrated BlobDB does not have) or are tied to the legacy BlobDB's design of writing blob files synchronously when a write API is called. Such statistics are marked "legacy BlobDB only" in
statistics.h
.
We support the following BlobDB-related properties:
-
rocksdb.num-blob-files
: number of blob files in the current Version. -
rocksdb.blob-stats
: returns the total number and size of all blob files, as well as the total amount of garbage (in bytes) in the blob files in the current Version and the corresponding space amplification. -
rocksdb.total-blob-file-size
: the total size of all blob files aggregated across all Versions. -
rocksdb.live-blob-file-size
: the total size of all blob files in the current Version. -
rocksdb.estimate-live-data-size
: this is a non-BlobDB specific property that was extended to also consider the live data bytes residing in blob files (which can be computed exactly by subtracting garbage bytes from total bytes and summing over all blob files in the current Version).
For BlobDB, the ColumnFamilyMetaData
structure has been extended with the following information:
- a vector of
BlobMetaData
objects, one for each live blob file, which contain the file number, file name and path, file size, total number and size of all blobs in the file, total number and size of all garbage blobs in the file, as well as the file checksum method and checksum value. - the total number and size of all live blob files.
This information can be retrieved using the GetColumnFamilyMetaData
API for any given column family. You can also retrieve a consistent view of all column families using the GetAllColumnFamilyMetaData
API.
We expose the following BlobDB-related information via the EventListener
interface:
- Job-level information:
FlushJobInfo
andCompactionJobInfo
contain information about the blob files generated by flush and compaction jobs, respectively. Both structures contain a vector ofBlobFileInfo
objects corresponding to the newly generated blob files; in addition,CompactionJobInfo
also contains a vector ofBlobFileGarbageInfo
structures that describe the additional amount of unreferenced garbage produced by the compaction job in question. - File-level information: RocksDB notifies the listener about events related to the lifecycle of any given blob file through the functions
OnBlobFileCreationStarted
,OnBlobFileCreated
, andOnBlobFileDeleted
. - Operation-level information: the
OnFile*Finish
notifications are also supported for blob files.
In terms of compaction styles, we recommend using leveled compaction with BlobDB. The rationale behind universal compaction in general is to provide lower write amplification at the expense of higher read amplification; however, according to our benchmarks, BlobDB can provide very low write amp and good read performance with leveled compaction. Therefore, there is really no reason to take the hit in read performance that comes with universal compaction.
In addition to the BlobDB options above, consider tuning the following non-BlobDB specific options:
-
write_buffer_size
: this is the memtable size. You might want to increase it for large-value workloads to ensure that SST and blob files contain a decent number of keys. -
target_file_size_base
: the target size of SST files. Note that even when using BlobDB, it is important to have an LSM tree with a “nice” shape and multiple levels and files per level to prevent heavy compactions. Since BlobDB extracts and writes large values to blob files, it makes sense to make this parameter significantly smaller than the memtable size, for instance by dividing up the memtable size proportionally based on the ratio of key size to value size. -
max_bytes_for_level_base
: consider setting this to a multiple (e.g. 8x or 10x) oftarget_file_size_base
. -
compaction_readahead_size
: this is the readahead size for SST files during compactions. Again, it might make sense to set this when the database is on slower storage. -
writable_file_max_buffer_size
: buffer size used when writing SST and blob files. Increasing it results in larger I/Os, which might be beneficial on certain types of storage.
There is a couple of remaining features that are not yet supported by BlobDB; namely, we don’t currently support secondary instances and ingestion of blob files. We will continue to work on closing this gap.
We also have further plans when it comes to performance. These include optimizing garbage collection, introducing a dedicated cache for blobs, improving iterator performance, and evolving the blob file format amongst others.
Contents
- RocksDB Wiki
- Overview
- RocksDB FAQ
- Terminology
- Requirements
- Contributors' Guide
- Release Methodology
- RocksDB Users and Use Cases
- RocksDB Public Communication and Information Channels
-
Basic Operations
- Iterator
- Prefix seek
- SeekForPrev
- Tailing Iterator
- Compaction Filter
- Multi Column Family Iterator (Experimental)
- Read-Modify-Write (Merge) Operator
- Column Families
- Creating and Ingesting SST files
- Single Delete
- Low Priority Write
- Time to Live (TTL) Support
- Transactions
- Snapshot
- DeleteRange
- Atomic flush
- Read-only and Secondary instances
- Approximate Size
- User-defined Timestamp
- Wide Columns
- BlobDB
- Online Verification
- Options
- MemTable
- Journal
- Cache
- Write Buffer Manager
- Compaction
- SST File Formats
- IO
- Compression
- Full File Checksum and Checksum Handoff
- Background Error Handling
- Huge Page TLB Support
- Tiered Storage (Experimental)
- Logging and Monitoring
- Known Issues
- Troubleshooting Guide
- Tests
- Tools / Utilities
-
Implementation Details
- Delete Stale Files
- Partitioned Index/Filters
- WritePrepared-Transactions
- WriteUnprepared-Transactions
- How we keep track of live SST files
- How we index SST
- Merge Operator Implementation
- RocksDB Repairer
- Write Batch With Index
- Two Phase Commit
- Iterator's Implementation
- Simulation Cache
- [To Be Deprecated] Persistent Read Cache
- DeleteRange Implementation
- unordered_write
- Extending RocksDB
- RocksJava
- Lua
- Performance
- Projects Being Developed
- Misc