-
Notifications
You must be signed in to change notification settings - Fork 6.3k
Subcompaction
Here we explain the subcompaction, which is used in both Leveled and Universal compactions.
The goal of subcompaction is to speed up a compaction job by partitioning it among multiple threads.
It is employed when one of the following conditions holds (See Compaction::ShouldFormSubcompactions()
):
- Leveled Compaction: L0 -> Lo where o > 0 or in a manual compaction
- Why: L0->Lo cannot be run in parallel with another L0->Lo since L0s are usually overlapping. Hence, partitioning is the only way to speed it up. For manual compaction, it is invoked by the user and is usually not parallelized; therefore, it benefits from partitioning
- Universal Compaction: except L0 -> L0, i.e., any compaction with output level > 0.
A compaction job is executed in three stages: Prepare(), Run(), and Install(). We describe how subcompaction works in each of these stages.
A compaction's input key range is partitioned into n
ranges to allow for n
subcompactions later. Ideally, we want to partition the input into subranges such that the workload is evenly distributed among each subcompaction job.
The logic for partitioning (introduced in #10393) is mostly in CompactionJob::GenSubcompactionBoundaries()
. It works roughly as follows:
- For every compaction input file, we sample up to 128 anchor points that evenly partition the file into subranges. The sample anchor points also include the range sizes and may look something like this:
File1: (a1, 1000), (b1, 1200), (c1, 1100)
File2: (a2, 1100), (b2, 1000), (c2, 1000)
- Combine, sort, and deduplicate the anchor points from all compaction input files. For the above example, we get:
(a1, 1000), (a2, 1100), (b1, 1200), (b2, 1000), (c1, 1100), (c2, 1000)
-
Determine
target_input_size
for each subcompaction. Say max_subcompactions = N, we may not need all N subcompactions if the input size is not big enough. We first determinetarget_input_size
of a subcompaction. It is calculated asmax(input size / N, target file size of output level)
. -
Iterate through the anchor points and use the accumulated sum of range sizes to determine boundaries for each subcompaction. The input size of each subcompaction is roughly
target_input_size
determined in step 3.
Suppose target_level_size = 3200. Then based on the anchor points, we take "b1" as the partition key since the first three ranges would hit 3200.
(a1, 1000), (a2, 1100), (b1, 1200), (b2, 1000), (c1, 1100), (c2, 1000)
Two subcompactions:
subcompaction1: all keys < b1
subcompaction2: all keys >= b1
Note that since ranges defined by anchor points from compaction input files can overlap, there may be some inaccuracies in estimating the input size this way. This should be mitigated by having a large number(128) of anchor points from each input file.
It is based on a heuristic that worked out well so far. The heuristic could be improved in many ways.
- Select boundaries based on the natural boundary of input levels/files.
- first and last key of L0 files
- first and last key of non-0, non-last levels
- first key of each SST file of the last level
- Use Versions::ApproximateSize to estimate the data size in each boundary.
- Merge boundaries to eliminate empty and smaller-than-average ranges.
- find the average size in each range
- starting from the beginning greedily merge adjacent ranges until their total size exceeds the average
A compaction job spawns n-1 threads, so there are in total n threads that execute subcompactions in parallel. Each subcompaction executes the function CompactionJob::ProcessKeyValueCompaction()
and generates compaction output SST files. If EventListener
is configured, each subcompaction job will invoke the callbacks OnSubcompactionBegin()
and OnSubcompactionCompleted()
. After all subcompactions are finished, compaction-related statistics are aggregated together, recorded in RocksDB statistics and shared to EventListener
via the OnCompactionCompleted()
callback.
Compaction output files from all subcompactions are sorted by their key ranges and installed in this step.
The option max_subcompactions
limits the max number of subcompactions for each compaction. By default, max_subcompactions = 1
, which means it is disabled. Note that Round-Robin pri under leveled compaction allows subcompactions by default, and the number of subcompactions can be larger than max_subcompactions. See more in CompactionJob::GenSubcompactionBoundaries()
.
Caveat (Issue #9291): note that subcompaction ignores the limit max_background_jobs
(or max_background_compactions
). Each compaction is allowed to be split into max_subcompactions
subcompactions. So, in total, there may be max_background_jobs * max_subcompactions
background threads running compaction.
Contents
- RocksDB Wiki
- Overview
- RocksDB FAQ
- Terminology
- Requirements
- Contributors' Guide
- Release Methodology
- RocksDB Users and Use Cases
- RocksDB Public Communication and Information Channels
-
Basic Operations
- Iterator
- Prefix seek
- SeekForPrev
- Tailing Iterator
- Compaction Filter
- Multi Column Family Iterator (Experimental)
- Read-Modify-Write (Merge) Operator
- Column Families
- Creating and Ingesting SST files
- Single Delete
- Low Priority Write
- Time to Live (TTL) Support
- Transactions
- Snapshot
- DeleteRange
- Atomic flush
- Read-only and Secondary instances
- Approximate Size
- User-defined Timestamp
- Wide Columns
- BlobDB
- Online Verification
- Options
- MemTable
- Journal
- Cache
- Write Buffer Manager
- Compaction
- SST File Formats
- IO
- Compression
- Full File Checksum and Checksum Handoff
- Background Error Handling
- Huge Page TLB Support
- Tiered Storage (Experimental)
- Logging and Monitoring
- Known Issues
- Troubleshooting Guide
- Tests
- Tools / Utilities
-
Implementation Details
- Delete Stale Files
- Partitioned Index/Filters
- WritePrepared-Transactions
- WriteUnprepared-Transactions
- How we keep track of live SST files
- How we index SST
- Merge Operator Implementation
- RocksDB Repairer
- Write Batch With Index
- Two Phase Commit
- Iterator's Implementation
- Simulation Cache
- [To Be Deprecated] Persistent Read Cache
- DeleteRange Implementation
- unordered_write
- Extending RocksDB
- RocksJava
- Lua
- Performance
- Projects Being Developed
- Misc