Skip to content

mv dynamic block size

Matthew Von-Maszewski edited this page Dec 4, 2013 · 26 revisions

Status

  • merged to master
  • code complete
  • development started November 12, 2013

History / Context

Warning: there are some environments where this feature does not help performance. Please review the Notes section for discussion.

leveldb's random reads perform best when the data is already in the block cache. The second best performance is when the potential files of each level are already in the file cache. The branch works to automatically maximize the latter case of the file cache as the entire data set grows beyond the block cache and file cache size. There is a heavy cost when the file needed is not in the file cache, easily 50 to 66% performance degradation. The goal is to manipulate the block size to keep the maximum number of files in the file cache as possible.

Stated another way, the tuning goal is maximize the number of open files in the file cache so that a random read only has to request data from disk, not open a file and read the file metadata and finally read the data from disk. Random read performance drops rapidly and significantly (at least 50%) once file opens start, a.k.a. file cache thrashing.

Key Hierarchy

leveldb tracks the storage location of a key via a three layer hierarchy:

  • MANIFEST layer: stores first and last key of every leveldb .sst table file

  • .sst metadata layer: stores last key of each data block in the file along with an offset in file to that data block

  • block metadata layer: stores small index of "restart points" within block along with offset to that location within the block

To find the value of a key: leveldb first determines the appropriate .sst table via the MANIFEST. It then locates the appropriate data block within the .sst file via the file's metadata index. leveldb next finds a starting point within the data block that is "close" to the actual key/value pair via the block's metadata index. And finally performs a linear walk from the restart point in the block to the actual key/value data.

The size of a data block is controlled by the block_size parameter of the Options structure (sst_block_size in Riak's app.config). The size represents a threshold, not a fixed count. Whenever a newly created block reaches this uncompressed size, leveldb considers it full and writes the block with its metadata to disk. The number of keys contained in the block depends upon the size of the values and keys.

The default block_size setting is 4096. Four key/value pairs of roughly 1024 bytes each will fit in the block. More will fit if smaller. Large key/value pair sizes above 4096 bytes will fit one per block (remember block is a threshold, not a byte limit).

Performance Scenarios

When leveldb starts a new database (Riak vnode), there is plenty of space in the file cache for .sst metadata. The default block_size emphasizes keeping a greater number of keys in the .sst metadata layer. This emphasis allows the data blocks to be small and read more quickly from disk. The quicker disk read results in a faster response to the user.

Riak has modified leveldb to allow total size of the .sst files to approach 500Mbytes in higher levels. The larger file sizes assist with write throughput (which is not discussed here). The larger file sizes along with the 4096 default block_size can result in .sst metadata that approaches 4Mbytes. The large file sizes maintain the great read performance … as long as the file and its metadata remain in the file_cache.

There comes a point where the number of files and the size of their metadata exceeds the size of the file cache. Riak 2.0's flex-cache extends when that point is reached, but it still happens eventually. The random read of an individual key/value pair becomes quite expensive when leveldb must first load the file and its metadata. The 4096 byte read of a data block now additionally requires a file open (operating system costly) and the read of all the metadata (maybe 4Mbytes): 4,096 bytes becomes 4,004,096. These larger random read operations are a performance killer.

A first step improvement is to have leveldb adjust the block_size parameter dynamically once the file_cache approaches its full state. leveldb could adjust the block_size higher, reducing the metadata size of newly created .sst files. This dynamic block_size adjustment will make individual read actions of an already open file a little slower due to more data per block. That read size cost is offset by two gains: more files can fit into the file cache reducing the number of open operations, and metadata reads of a newly opened file will be less costly. Over time, leveldb will recompact all active .sst files using the dynamic block_size.

There is a secondary performance opportunity with the dynamic block_size. Not all .sst files contain the same size key/value pairs. The size of Riak's 2i key/value data is quite small compared to most user key/value data. And the 2i key/value data clumps together as leveldb compacts it into higher .sst files. The 2i centric .sst files can use a dynamic block_size that is different from the block_size of the user data centric .sst files, optimizing both. No more "one size fits all" non-optimization.

Clone this wiki locally