Skip to content

MultiGet Performance

anand1976 edited this page Mar 2, 2020 · 4 revisions

Introduction

There is a lot of complexity in the underlying RocksDB implementation to lookup a key. The complexity results in a lot of computational overhead, mainly due to cache misses when probing bloom filters, virtual function call dispatches, key comparisons and IO. Users that need to lookup many keys in order to process an application level request end up calling Get() in a loop to read the required KVs. By providing a MultiGet() API that accepts a batch of keys, it is possible for RocksDB to make the lookup more CPU efficient by reducing the number of virtual function calls and pipelining cache misses. Furthermore, latency can be reduced by doing IO in parallel.

Read Path

A typical RocksDB database instance has multiple levels, with each level containing a few tens to hundreds of SST files. A point lookup goes through the following stages (in order to keep it simple, we ignore merge operands and assume everything is a Put) -

  1. The mutable memtable is looked up. If a bloom filter is configured for memtable, the filter is probed using either the whole key or prefix. If the result is positive, the memtable rep lookup happens.
  2. If the key was not found, 0 or more immutable memtables are looked up using the same process as #1
  3. Next, the SST files in successive levels are looked up as follows -
    1. In L0, every SST file is looked up in reverse chronological order
    2. For L1 and above, each level has a vector of SST file metadata objects, with each metadata object containing, among other things, the highest and lowest key in the file. A binary search is performed in this vector to determine the file that overlaps the desired key. There is an auxiliary index that uses pre-calculated information about file ranges in the lsm to determine the set of files overlap a given file in the next level. A full binary search is performed in L1, and this index is used to narrow down the binary search bound in subsequent levels. This is known as fractional cascading.
    3. Once a candidate file is found, the file's bloom filter block is loaded (either from the block cache or disk) and probed for the key. The probe is likely to result in a CPU cache miss. In many cases, the bottommost level will not have a bloom filter.
    4. If the probe result is positive, the SST file index block is loaded and binary searched to find the target data block. The filter and index blocks may have to be read from disk, but typically they are either pinned in memory or accessed frequently enough to be found in the block cache.
    5. The data block is loaded and binary searched to find the key. Data block lookups are more likely to miss in the block cache and result in an IO. It is important to note that each block cache lookup is also likely to result in a CPU cache miss, since the block cache is indexed by a hash table.
  4. Step #3 is repeated for each level, with the only difference in L2 and higher being the fractional cascading for SST file lookup.

MultiGet Performance Optimizations

Let us consider the case of a workload with good locality of reference. Successive point lookups in such a workload are likely to repeatedly access the same SST files and index/data blocks. For such workloads, MultiGet provides the following optimizations -

  1. When options.cache_index_and_filter_block=true is set, filter and index blocks for an SST file are fetched from the block cache on each key lookup. On a system with many threads performing reads, this results in significant lock contention on the LRU mutex. MultiGet looks up the filter and index block in the block cache only once for a whole batch of keys overlapping an SST file key range, thus drastically reducing the LRU mutex contention.
  2. In steps 1, 2 and 3c, CPU cache misses occur due to bloom filter probes. Assuming a database with 6 levels and most keys being found in the bottommost level, with an average of 2 L0 files, we will have ~6 cache misses due to filter lookups in SST files. There may be an additional 1-2 cache misses if memtable bloom filters are configured. By batching the lookups at each stage, the filter cache line accesses can be pipelined, thus hiding the cache miss latency.
  3. In a large database, data block reads are highly likely to require IO. This introduces latency. MultiGet has the capability to issue IO requests for multiple data blocks in the same SST file in parallel, thus reducing latency. This depends on support for parallel reads in the same thread from the underlying Env implementation. On Linux, PosixEnv has the capability to do parallel IO for MultiGet() using the IO Uring interface. IO Uring is a new asynchronous IO implementation introduced in the Linux kernel starting from 5.1.

Contents

Clone this wiki locally