Rocksdb BlockBasedTable Format

This page is forked from LevelDB's document on table format, and reflects changes we have made during the development of RocksDB.

BlockBasedTable is the default SST table format in RocksDB.

File format

<beginning_of_file>
[data block 1]
[data block 2]
...
[data block N]
[meta block 1: filter block]                  (see section: "filter" Meta Block)
[meta block 2: index block]
[meta block 3: compression dictionary block]  (see section: "compression dictionary" Meta Block)
[meta block 4: range deletion block]          (see section: "range deletion" Meta Block)
[meta block 5: stats block]                   (see section: "properties" Meta Block)
...
[meta block K: future extended block]  (we may add more meta blocks in the future)
[metaindex block]
[Footer]                               (fixed size; starts at file_size - sizeof(Footer))
<end_of_file>

The file contains internal pointers, called BlockHandles, containing the following information:

offset:         varint64
size:           varint64

See this document for an explanation of varint64 format.

(1) The sequence of key/value pairs in the file are stored in sorted order and partitioned into a sequence of data blocks. These blocks come one after another at the beginning of the file. Each data block is formatted according to the code in block_builder.cc (see code comments in the file), and then optionally compressed.

(2) After the data blocks, we store a bunch of meta blocks. The supported meta block types are described below. More meta block types may be added in the future. Each meta block is again formatted using block_builder.cc and then optionally compressed.

(3) A metaindex block contains one entry for every meta block, where the key is the name of the meta block and the value is a BlockHandle pointing to that meta block.

(4) An index block contains one entry per data block, where the key is a string >= last key in that data block and before the first key in the successive data block. The value is the BlockHandle for the data block. If kTwoLevelIndexSearch is used as IndexType the index block is a 2nd level index on index partitions, i.e., each entry points to another index block that contains one entry per data block. In this case, the format will be

[index block - 1st level]
[index block - 1st level]
...
[index block - 1st level]
[index block - 2nd level]

Up to RocksDB version 5.14, BlockBasedTableOptions::format_version=2, the format of index and data blocks are the same, where the index blocks use same key format of <user_key,seq> but special values, <offset,size>, that point to data blocks. format_version=3,4 offer more optimized, yet forward-incompatible format for index blocks.

format_version=3 (Since RocksDB 5.15): In most of the cases the sequence number seq is not necessary for keys in the index blocks. In such cases, this format_version skips encoding the sequence number and sets index_key_is_user_key in TableProperties, which is used by the reader to know how to decode the index block.
format_version=4 (Since RocksDB 5.16): Changes the format of index blocks by delta encoding the index values, which are the block handles. This saves the encoding of BlockHandle::offset of the non-head index entries in each restart interval. If used, TableProperties::index_value_is_delta_encoded is set, which is used by the reader to know how to decode the index block. The format of each key is (shared_size, non_shared_size, shared, non_shared). The format of each value, i.e., block handle, is (offset, size) whenever the shared_size is 0, which included the first entry in each restart point. Otherwise the format is delta-size = block handle size - size of last block handle.

The index format in format_version=4 would be as follows:

restart_point   0: k, v (off, sz), k, v (delta-sz), ..., k, v (delta-sz)
restart_point   1: k, v (off, sz), k, v (delta-sz), ..., k, v (delta-sz)
...
restart_point n-1: k, v (off, sz), k, v (delta-sz), ..., k, v (delta-sz)
where, k is key, v is value, and its encoding is in parenthesis.

(5) At the very end of the file is a fixed length footer that contains the BlockHandle of the metaindex and index blocks as well as a magic number.

   metaindex_handle: char[p];      // Block handle for metaindex
   index_handle:     char[q];      // Block handle for index
   padding:          char[40-p-q]; // zeroed bytes to make fixed length
                                   // (40==2*BlockHandle::kMaxEncodedLength)
   magic:            fixed64;      // 0x88e241b785f4cff7 (little-endian)

`Filter` Meta Block

Full filter

In this filter there is one filter block for the entire SST file.

Partitioned Filter

The full filter is partitioned into multiple blocks. A top-level index block is added to map keys to corresponding filter partitions. Read more here.

Block-based filter

Note: the below explains block based filter, which is deprecated.

If a "FilterPolicy" was specified when the database was opened, a filter block is stored in each table. The "metaindex" block contains an entry that maps from "filter." to the BlockHandle for the filter block, where "" is the string returned by the filter policy's Name() method.

The filter block stores a sequence of filters, where filter i contains the output of FilterPolicy::CreateFilter() on all keys that are stored in a block whose file offset falls within the range

[ i*base ... (i+1)*base-1 ]

Currently, "base" is 2KB. So, for example, if blocks X and Y start in the range [ 0KB .. 2KB-1 ], all of the keys in X and Y will be converted to a filter by calling FilterPolicy::CreateFilter(), and the resulting filter will be stored as the first filter in the filter block.

The filter block is formatted as follows:

 [filter 0]
 [filter 1]
 [filter 2]
 ...
 [filter N-1]

 [offset of filter 0]                  : 4 bytes
 [offset of filter 1]                  : 4 bytes
 [offset of filter 2]                  : 4 bytes
 ...
 [offset of filter N-1]                : 4 bytes

 [offset of beginning of offset array] : 4 bytes
 lg(base)                              : 1 byte

The offset array at the end of the filter block allows efficient mapping from a data block offset to the corresponding filter.

`Properties` Meta Block

This meta block contains a bunch of properties. The key is the name of the property. The value is the property.

The stats block is formatted as follows:

 [prop1]    (Each property is a key/value pair)
 [prop2]
 ...
 [propN]

Properties are guaranteed to sort with no duplication.

By default, each table provides the following properties.

 data size               // the total size of all data blocks. 
 index size              // the size of the index block.
 filter size             // the size of the filter block.
 raw key size            // the size of all keys before any processing.
 raw value size          // the size of all value before any processing.
 number of entries
 number of data blocks

RocksDB also provides users the "callback" to collect their interested properties about this table. Please refer to UserDefinedPropertiesCollector.

`Compression Dictionary` Meta Block

This metablock contains the dictionary used to prime the compression library before compressing/decompressing each block. Its purpose is to address a fundamental problem with dynamic dictionary compression algorithms on small data blocks: the dictionary is built during a single pass over the block, so small data blocks always have small and thus ineffective dictionaries.

Our solution is to initialize the compression library with a dictionary built from data sampled from previously seen blocks. This dictionary is then stored in a file-level meta-block for use during decompression. The upper-bound on the size of this dictionary is configurable via CompressionOptions::max_dict_bytes. By default it is zero, i.e., the block is not generated or stored. Currently this feature is supported with kZlibCompression, kLZ4Compression, kLZ4HCCompression, and kZSTDNotFinalCompression.

More specifically, the compression dictionary is built only during compaction to the bottommost level, where the data is largest and most stable. To avoid iterating over input data multiple times, the dictionary includes samples from the subcompaction's first output file only. Then, the dictionary is applied to and stored in meta-blocks of all subsequent output files. Note the dictionary is not applied to or stored in the first file since its contents are not finalized until that file has been fully processed.

Currently the sampling is uniformly random and each sample is 64 bytes. We do not know in advance the size of the output file when selecting the sample offsets, so we assume it'll reach the maximum size, which is usually true since it's the first file in the subcompaction. In case the file is smaller, some sample intervals will refer to offsets beyond EOF, which just means the dictionary will be a bit smaller than CompressionOptions::max_dict_bytes.

`Range Deletion` Meta Block

This metablock contains the range deletions in the file's key-range and seqnum-range. Range deletions cannot be inlined in the data blocks together with point data since the ranges would then not be binary searchable.

The block format is the standard key-value format. A range deletion is encoded as follows:

User key: the range's begin key
Sequence number: the sequence number at which the range deletion was inserted to the DB
Value type: kTypeRangeDeletion
Value: the range's end key

Range deletions are assigned sequence numbers when inserted using the same mechanism as non-range data types (puts, deletes, etc.). They also traverse through the LSM using the same flush/compaction mechanism as point data. They can be obsoleted (i.e., dropped) only during compaction to the bottommost level.

Contents

RocksDB Wiki
Overview
RocksDB FAQ
Terminology
Requirements
Contributors' Guide
Release Methodology
RocksDB Users and Use Cases
RocksDB Public Communication and Information Channels
Basic Operations
- Iterator
- Prefix seek
- SeekForPrev
- Tailing Iterator
- Compaction Filter
- Multi Column Family Iterator
- Read-Modify-Write (Merge) Operator
- Column Families
- Creating and Ingesting SST files
- Single Delete
- Low Priority Write
- Time to Live (TTL) Support
- Transactions
- Snapshot
- DeleteRange
- Atomic flush
- Read-only and Secondary instances
- Approximate Size
- User-defined Timestamp
- Wide Columns
- BlobDB
- Online Verification
Options
- Setup Options and Basic Tuning
- Option String and Option Map
- RocksDB Options File
MemTable
Journal
- Write Ahead Log (WAL)
- MANIFEST
- Track WAL in MANIFEST
Cache
- Block Cache
- SecondaryCache (Experimental)
Write Buffer Manager
Compaction
- Leveled Compaction
- Universal compaction style
- FIFO compaction style
- Manual Compaction
- Subcompaction
- Choose Level Compaction Files
- Managing Disk Space Utilization
- Trivial Move Compaction
- Remote Compaction (Experimental)
SST File Formats
- Block-based Table Format
- PlainTable Format
- CuckooTable Format
- Index Block Format
- Bloom Filter
- Data Block Hash Index
IO
- Rate Limiter
- SST File Manager
- Direct I/O
Compression
- Dictionary Compression
Full File Checksum and Checksum Handoff
Background Error Handling
Huge Page TLB Support
Tiered Storage (Experimental)
Logging and Monitoring
- Logger
- Statistics
- Compaction Stats and DB Status
- Perf Context and IO Stats Context
- EventListener
Known Issues
Troubleshooting Guide
Tests
- Stress Test
- Fuzzing
- Benchmarking
Tools / Utilities
- Administration and Data Access Tool
- How to Backup RocksDB?
- Replication Helpers
- Checkpoints
- How to persist in-memory RocksDB database
- Third-party language bindings
- RocksDB Trace, Replay, Analyzer, and Workload Generation
- Block cache analysis and simulation tools
- IO Tracer and Parser
Implementation Details
- Delete Stale Files
- Partitioned Index/Filters
- WritePrepared-Transactions
- WriteUnprepared-Transactions
- How we keep track of live SST files
- How we index SST
- Merge Operator Implementation
- RocksDB Repairer
- Write Batch With Index
- Two Phase Commit
- Iterator's Implementation
- Simulation Cache
- [To Be Deprecated] Persistent Read Cache
- DeleteRange Implementation
- unordered_write
Extending RocksDB
- RocksDB Configurable Objects
- The Customizable Class
- Object Registry
RocksJava
- RocksJava Basics
- Logging in RocksJava
- JNI Debugging
- RocksJava API TODO
- RocksJava Performance on Flash Storage
- Tuning RocksDB from Java
Lua
- Lua CompactionFilter
Performance
- Performance Benchmarks
- In Memory Workload Performance
- Read-Modify-Write (Merge) Performance
- Delete A Range Of Keys
- Write Stalls
- Pipelined Write
- MultiGet Performance
- Tuning Guide
- Memory usage in RocksDB
- Speed-Up DB Open
- Implement Queue Service Using RocksDB
- Asynchronous IO
- Off-peak in RocksDB
Projects Being Developed
Misc
- Building on Windows
- Developing with an IDE
- Open Projects
- Talks
- Publication
- Features Not in LevelDB
- How to ask a performance-related question?
- Articles about Rocks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rocksdb BlockBasedTable Format

File format

`Filter` Meta Block

Full filter

Partitioned Filter

Block-based filter

`Properties` Meta Block

`Compression Dictionary` Meta Block

`Range Deletion` Meta Block

Clone this wiki locally

Rocksdb BlockBasedTable Format

File format

Filter Meta Block

Full filter

Partitioned Filter

Block-based filter

Properties Meta Block

Compression Dictionary Meta Block

Range Deletion Meta Block

Clone this wiki locally

`Filter` Meta Block

`Properties` Meta Block

`Compression Dictionary` Meta Block

`Range Deletion` Meta Block