Skip to content

mv dynamic block size

Matthew Von-Maszewski edited this page Dec 5, 2013 · 26 revisions

Status

  • merged to master
  • code complete
  • development started November 12, 2013

History / Context

Warning: there are some environments where this feature does not help performance. Please review the Notes section for discussion.

leveldb's random reads perform best when the data is already in the block cache. The second best performance is when the potential files of each level that could contain the data are already in the file cache. The third best performance is when the file open operation and data read are minimized. The branch works to automatically maximize the second case so that the file cache has the potential to contain the entire, active set of files even as the dataset grows beyond the block cache and file cache size. Once the dataset grows beyond any possible maximization of the second case, the files' metadata has already been incrementally rebuilt to maximize the third case.

This branch's motivation is from heavy, long write tests that create large datasets. These tests' throughput suddenly degrade by 50 to 60% when the file cache is full. The underlying cause if file thrashing: opening and closing hundreds to thousands of files per second. The problem can be postponed by manually changing the block_size to a larger value, and rerunning the tests. This pushes severe degradation point to a larger dataset size, but also slows down random access. The better solution is to manipulate the block_size parameter dynamically to keep is small during early operations, then gradually raise it as needed to both maximum number of files in the file cache and maintain the highest random access rates.

Key Hierarchy

The underlying technique for dynamic block sizing is created based upon knowledge of how leveldb maintains it key indexes. leveldb tracks the storage location of a key via a three layer hierarchy:

  • MANIFEST layer: stores first and last key of every leveldb .sst table file. The manifest is kept completely in memory. There is one entry per file. Every file is constantly represented in memory. Larger file sizes result in fewer entries.

  • .sst metadata layer: stores one representative key per file data block. The metadata size increases as the block size decreases. This layer is stored in the file cache. It is loaded and unloaded based upon file usage and available file cache space.

  • block metadata layer: stores an index of "restart points" within block along with offset to that location within the block. This layer is stored in the block cache. It is loaded and unload based upon block access and available block cache space. Each block contains at least block_size of user data before compression. The block's index size is influenced by the block_restart_interval option. The higher the number, the smaller the block index (but more time needed to linearly scan the block for the exact key).

To find the value of a key:

** leveldb first determines the appropriate .sst table via the MANIFEST.
** It then locates the appropriate data block within the .sst file via the file's metadata index.
** leveldb next finds a starting point within the data block that is "close" to the actual key/value pair via the block's metadata index.
** And finally performs a linear walk from the restart point in the block to the actual key/value data.

So there is a simple size tradeoff between the file cache's metadata and the block cache's blocks. Smaller blocks cause larger metadata and are quicker to read from disk IF the file is already present in the file cache. Larger blocks are slower to read from the disk, but they result in smaller file cache entries. Smaller file cache entries allow more files to be simultaneously present in the file cache.

about leveldb's block_size

The size of a data block is controlled by the block_size parameter of the Options structure (sst_block_size in Riak's app.config). The size represents a threshold, not a fixed count. Whenever a newly created block reaches this uncompressed size, leveldb considers it full and writes the block with its metadata to disk. The number of keys contained in the block depends upon the size of the values and keys.

This branch dynamically raises block_size as space in the file cache becomes constrained.

Performance Stages

Stage 1: Physical RAM larger than dataset

When leveldb starts a new database (Riak vnode), there is plenty of space in the file cache for .sst metadata. The default block_size of 4096 is reasonable value that for many scenarios emphasizes keeping a greater number of keys in the .sst metadata layer. This emphasis allows the data blocks to be small and read more quickly from disk. The quicker disk read results in a faster response to the user.

User could now chose to lower the initial block_size to match their value size (assuming values smaller than 4096 bytes) because of this dynamic block_size branch. A block_size matching the user's value size will enhance random read operations. As already discussed, the smaller block_size will likely inflate the file metadata and fill the file cache sooner. But the branch's dynamic component should gradually raise the block_size automatically before the file cache actually starts thrashing.

If physical RAM is dramatically larger that the total dataset size and is expected to remain that way for an extended period of time, also consider enabling this branch: https://github.com/basho/leveldb/wiki/mv-fadvise-control

Stage 2: Dataset size greater than physical RAM, but all file metadata fits in file cache allocation

There comes a point where the number of files and the size of their metadata exceeds the size of the file cache. Riak 2.0's flex-cache extends when that point is reached, but it still happens eventually. The random read of an individual key/value pair becomes quite expensive when leveldb must first load the file and its metadata. The 4096 byte read of a data block now additionally requires a file open (operating system costly) and the read of all the metadata (maybe 4Mbytes): 4,096 bytes becomes 4,004,096. These larger random read operations are a performance killer.

This branch's first improvement is to have leveldb adjust the block_size parameter dynamically once the file_cache approaches its full state. leveldb adjusts the block_size incrementally higher, reducing the metadata size of newly created .sst files. This dynamic block_size adjustment makes individual read actions of an already open file a little slower due to the data block being larger. That read size cost is offset by two gains: more files can fit into the file cache reducing the number of open operations, and metadata reads of a newly opened file will be less costly. Over time, leveldb will recompact all active .sst files using the dynamic block_size.

This branch's second improvement is a result of adjusting the block_size based upon the size of data currently being compacted. Not all .sst table files contain the same size key/value pairs. The size of Riak's 2i key/value data is quite small compared to most user key/value data. And the 2i key/value data clumps together as leveldb compacts it into higher .sst files. The 2i centric .sst files can use a dynamic block_size that is different from the block_size of the user data centric .sst files, optimizing both. No more "one size fits all" non-optimization.

This branch's third improvement is thanks to the Riak 2.0 flexcache feature. flexcache automatically unloads file cache entries after 5 days if they have no activity. The unload of stale files creates an opportunity for more file cache space. The dynamic block_size algorithm watches for this opportunity and will decrease the block_size to again emphasize random read performance over .sst metadata size.

Stage 3: Total file metadata exceeds file cache size

Riak has modified leveldb to allow total size of the .sst files to approach 500Mbytes in higher levels. The larger file sizes assist with write throughput (which is not discussed here). The larger file sizes along with the 4096 default block_size can result in .sst metadata that approaches 4Mbytes. The large file sizes maintain the great read performance … as long as the file and its metadata remain in the file_cache.

Block size algorithm

Clone this wiki locally