-
Notifications
You must be signed in to change notification settings - Fork 6.4k
MultiGet Performance
There is a lot of complexity in the underlying RocksDB implementation to lookup a key. The complexity results in a lot of computational overhead, mainly due to cache misses when probing bloom filters, virtual function call dispatches, key comparisons and IO. Users that need to lookup many keys in order to process an application level request end up calling Get() in a loop to read the required KVs. By providing a MultiGet() API that accepts a batch of keys, it is possible for RocksDB to make the lookup more CPU efficient by reducing the number of virtual function calls and pipelining cache misses. Furthermore, latency can be reduced by doing IO in parallel.
A typical RocksDB database instance has multiple levels, with each level containing a few tens to hundreds of SST files. A point lookup goes through the following stages (in order to keep it simple, we ignore merge operands and assume everything is a Put) -
- The mutable memtable is looked up. If a bloom filter is configured for memtable, the filter is probed using either the whole key or prefix. If the result is positive, the memtable rep lookup happens.
- If the key was not found, 0 or more immutable memtables are looked up using the same process as #1
- Next, the SST files in successive levels are looked up as follows -
- In L0, every SST file is looked up in reverse chronological order
- For L1 and above, each level has a vector of SST file metadata objects, with each metadata object containing, among other things, the highest and lowest key in the file. A binary search is performed in this vector to determine the file that overlaps the desired key. There is an auxiliary index that uses pre-calculated information about file ranges in the lsm to determine the set of files overlap a given file in the next level. A full binary search is performed in L1, and this index is used to narrow down the binary search bound in subsequent levels. This is known as fractional cascading.
- Once a candidate file is found, the file's bloom filter block is loaded (either from the block cache or disk) and probed for the key. The probe is likely to result in a CPU cache miss. In many cases, the bottommost level will not have a bloom filter.
- If the probe result is positive, the SST file index block is loaded and binary searched to find the target data block. The filter and index blocks may have to be read from disk, but typically they are either pinned in memory or accessed frequently enough to be found in the block cache.
- The data block is loaded and binary searched to find the key. Data block lookups are more likely to miss in the block cache and result in an IO. It is important to note that each block cache lookup is also likely to result in a CPU cache miss, since the block cache is indexed by a hash table.
- Step #3 is repeated for each level, with the only difference in L2 and higher being the fractional cascading for SST file lookup.
Let us consider the case of a workload with good locality of reference. Successive point lookups in such a workload are likely to repeatedly access the same SST files and index/data blocks. For such workloads, MultiGet provides the following optimizations -
- When
options.cache_index_and_filter_blocks=true
is set, filter and index blocks for an SST file are fetched from the block cache on each key lookup. On a system with many threads performing reads, this results in significant lock contention on the LRU mutex. MultiGet looks up the filter and index block in the block cache only once for a whole batch of keys overlapping an SST file key range, thus drastically reducing the LRU mutex contention. - In steps 1, 2 and 3c, CPU cache misses occur due to bloom filter probes. Assuming a database with 6 levels and most keys being found in the bottommost level, with an average of 2 L0 files, we will have ~6 cache misses due to filter lookups in SST files. There may be an additional 1-2 cache misses if memtable bloom filters are configured. By batching the lookups at each stage, the filter cache line accesses can be pipelined, thus hiding the cache miss latency.
- In a large database, data block reads are highly likely to require IO. This introduces latency. MultiGet has the capability to issue IO requests for multiple data blocks in the same SST file in parallel, thus reducing latency. This depends on support for parallel reads in the same thread from the underlying
Env
implementation. On Linux,PosixEnv
has the capability to do parallel IO for MultiGet() using the IO Uring interface. IO Uring is a new asynchronous IO implementation introduced in the Linux kernel starting from 5.1.
Contents
- RocksDB Wiki
- Overview
- RocksDB FAQ
- Terminology
- Requirements
- Contributors' Guide
- Release Methodology
- RocksDB Users and Use Cases
- RocksDB Public Communication and Information Channels
-
Basic Operations
- Iterator
- Prefix seek
- SeekForPrev
- Tailing Iterator
- Compaction Filter
- Multi Column Family Iterator
- Read-Modify-Write (Merge) Operator
- Column Families
- Creating and Ingesting SST files
- Single Delete
- Low Priority Write
- Time to Live (TTL) Support
- Transactions
- Snapshot
- DeleteRange
- Atomic flush
- Read-only and Secondary instances
- Approximate Size
- User-defined Timestamp
- Wide Columns
- BlobDB
- Online Verification
- Options
- MemTable
- Journal
- Cache
- Write Buffer Manager
- Compaction
- SST File Formats
- IO
- Compression
- Full File Checksum and Checksum Handoff
- Background Error Handling
- Huge Page TLB Support
- Tiered Storage (Experimental)
- Logging and Monitoring
- Known Issues
- Troubleshooting Guide
- Tests
- Tools / Utilities
-
Implementation Details
- Delete Stale Files
- Partitioned Index/Filters
- WritePrepared-Transactions
- WriteUnprepared-Transactions
- How we keep track of live SST files
- How we index SST
- Merge Operator Implementation
- RocksDB Repairer
- Write Batch With Index
- Two Phase Commit
- Iterator's Implementation
- Simulation Cache
- [To Be Deprecated] Persistent Read Cache
- DeleteRange Implementation
- unordered_write
- Extending RocksDB
- RocksJava
- Lua
- Performance
- Projects Being Developed
- Misc