Subcompaction

Here we explain the subcompaction, which is used in both Leveled and Universal compactions.

Goal

The goal of subcompaction is to speed up a compaction job by partitioning it among multiple threads.

When

It is employed when one of the following conditions holds (See Compaction::ShouldFormSubcompactions()):

Leveled Compaction: L0 -> Lo where o > 0 or in a manual compaction
- Why: L0->Lo cannot be run in parallel with another L0->Lo since L0s are usually overlapping. Hence, partitioning is the only way to speed it up. For manual compaction, it is invoked by the user and is usually not parallelized; therefore, it benefits from partitioning
Universal Compaction: except L0 -> L0, i.e., any compaction with output level > 0.

How

A compaction job is executed in three stages: Prepare(), Run(), and Install(). We describe how subcompaction works in each of these stages.

Prepare

A compaction's input key range is partitioned into n ranges to allow for n subcompactions later. Ideally, we want to partition the input into subranges such that the workload is evenly distributed among each subcompaction job.

Since RocksDB V7.6

The logic for partitioning (introduced in #10393) is mostly in CompactionJob::GenSubcompactionBoundaries(). It works roughly as follows:

For every compaction input file, we sample up to 128 anchor points that evenly partition the file into subranges. The sample anchor points also include the range sizes and may look something like this:

File1: (a1, 1000), (b1, 1200), (c1, 1100)
File2: (a2, 1100), (b2, 1000), (c2, 1000)

Combine, sort, and deduplicate the anchor points from all compaction input files. For the above example, we get:

(a1, 1000), (a2, 1100), (b1, 1200), (b2, 1000), (c1, 1100), (c2, 1000)

Determine target_input_size for each subcompaction. Say max_subcompactions = N, we may not need all N subcompactions if the input size is not big enough. We first determine target_input_size of a subcompaction. It is calculated as max(input size / N, target file size of output level).
Iterate through the anchor points and use the accumulated sum of range sizes to determine boundaries for each subcompaction. The input size of each subcompaction is roughly target_input_size determined in step 3.

Suppose target_level_size = 3200. Then based on the anchor points, we take "b1" as the partition key since the first three ranges would hit 3200.

(a1, 1000), (a2, 1100), (b1, 1200), (b2, 1000), (c1, 1100), (c2, 1000)

Two subcompactions:
subcompaction1: all keys < b1
subcompaction2: all keys >= b1

Note that since ranges defined by anchor points from compaction input files can overlap, there may be some inaccuracies in estimating the input size this way. This should be mitigated by having a large number(128) of anchor points from each input file.

Before RocksDB v7.6

It is based on a heuristic that worked out well so far. The heuristic could be improved in many ways.

Select boundaries based on the natural boundary of input levels/files.
- first and last key of L0 files
- first and last key of non-0, non-last levels
- first key of each SST file of the last level
Use Versions::ApproximateSize to estimate the data size in each boundary.
Merge boundaries to eliminate empty and smaller-than-average ranges.
- find the average size in each range
- starting from the beginning greedily merge adjacent ranges until their total size exceeds the average

Run

A compaction job spawns n-1 threads, so there are in total n threads that execute subcompactions in parallel. Each subcompaction executes the function CompactionJob::ProcessKeyValueCompaction() and generates compaction output SST files. If EventListener is configured, each subcompaction job will invoke the callbacks OnSubcompactionBegin() and OnSubcompactionCompleted(). After all subcompactions are finished, compaction-related statistics are aggregated together, recorded in RocksDB statistics and shared to EventListener via the OnCompactionCompleted() callback.

Install

Compaction output files from all subcompactions are sorted by their key ranges and installed in this step.

Options

The option max_subcompactions limits the max number of subcompactions for each compaction. By default, max_subcompactions = 1, which means it is disabled. Note that Round-Robin pri under leveled compaction allows subcompactions by default, and the number of subcompactions can be larger than max_subcompactions. See more in CompactionJob::GenSubcompactionBoundaries().

Caveat (Issue #9291): note that subcompaction ignores the limit max_background_jobs (or max_background_compactions). Each compaction is allowed to be split into max_subcompactions subcompactions. So, in total, there may be max_background_jobs * max_subcompactions background threads running compaction.

Contents

RocksDB Wiki
Overview
RocksDB FAQ
Terminology
Requirements
Contributors' Guide
Release Methodology
RocksDB Users and Use Cases
RocksDB Public Communication and Information Channels
Basic Operations
- Iterator
- Prefix seek
- SeekForPrev
- Tailing Iterator
- Compaction Filter
- Multi Column Family Iterator (Experimental)
- Read-Modify-Write (Merge) Operator
- Column Families
- Creating and Ingesting SST files
- Single Delete
- Low Priority Write
- Time to Live (TTL) Support
- Transactions
- Snapshot
- DeleteRange
- Atomic flush
- Read-only and Secondary instances
- Approximate Size
- User-defined Timestamp
- Wide Columns
- BlobDB
- Online Verification
Options
- Setup Options and Basic Tuning
- Option String and Option Map
- RocksDB Options File
MemTable
Journal
- Write Ahead Log (WAL)
- MANIFEST
- Track WAL in MANIFEST
Cache
- Block Cache
- SecondaryCache (Experimental)
Write Buffer Manager
Compaction
- Leveled Compaction
- Universal compaction style
- FIFO compaction style
- Manual Compaction
- Subcompaction
- Choose Level Compaction Files
- Managing Disk Space Utilization
- Trivial Move Compaction
- Remote Compaction (Experimental)
SST File Formats
- Block-based Table Format
- PlainTable Format
- CuckooTable Format
- Index Block Format
- Bloom Filter
- Data Block Hash Index
IO
- Rate Limiter
- SST File Manager
- Direct I/O
Compression
- Dictionary Compression
Full File Checksum and Checksum Handoff
Background Error Handling
Huge Page TLB Support
Tiered Storage (Experimental)
Logging and Monitoring
- Logger
- Statistics
- Compaction Stats and DB Status
- Perf Context and IO Stats Context
- EventListener
Known Issues
Troubleshooting Guide
Tests
- Stress Test
- Fuzzing
- Benchmarking
Tools / Utilities
- Administration and Data Access Tool
- How to Backup RocksDB?
- Replication Helpers
- Checkpoints
- How to persist in-memory RocksDB database
- Third-party language bindings
- RocksDB Trace, Replay, Analyzer, and Workload Generation
- Block cache analysis and simulation tools
- IO Tracer and Parser
Implementation Details
- Delete Stale Files
- Partitioned Index/Filters
- WritePrepared-Transactions
- WriteUnprepared-Transactions
- How we keep track of live SST files
- How we index SST
- Merge Operator Implementation
- RocksDB Repairer
- Write Batch With Index
- Two Phase Commit
- Iterator's Implementation
- Simulation Cache
- [To Be Deprecated] Persistent Read Cache
- DeleteRange Implementation
- unordered_write
Extending RocksDB
- RocksDB Configurable Objects
- The Customizable Class
- Object Registry
RocksJava
- RocksJava Basics
- Logging in RocksJava
- JNI Debugging
- RocksJava API TODO
- RocksJava Performance on Flash Storage
- Tuning RocksDB from Java
Lua
- Lua CompactionFilter
Performance
- Performance Benchmarks
- In Memory Workload Performance
- Read-Modify-Write (Merge) Performance
- Delete A Range Of Keys
- Write Stalls
- Pipelined Write
- MultiGet Performance
- Tuning Guide
- Memory usage in RocksDB
- Speed-Up DB Open
- Implement Queue Service Using RocksDB
- Asynchronous IO
- Off-peak in RocksDB
Projects Being Developed
Misc
- Building on Windows
- Developing with an IDE
- Open Projects
- Talks
- Publication
- Features Not in LevelDB
- How to ask a performance-related question?
- Articles about Rocks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly