-
Notifications
You must be signed in to change notification settings - Fork 178
mv dynamic block size
- merged to master
- code complete
- development started November 12, 2013
Warning: there are some environments where this feature does not help performance. Please review the Notes section for discussion.
leveldb's random reads perform best when the data is already in the block cache. The second best performance is when the potential files of each level that could contain the data are already in the file cache. The third best performance is when the file open operation and data read are minimized. The branch works to automatically maximize the second case so that the file cache has the potential to contain the entire, active set of files even as the dataset grows beyond the block cache and file cache size. Once the dataset grows beyond any possible maximization of the second case, the files' metadata has already been incrementally rebuilt to maximize the third case.
This branch's motivation is from heavy, long write tests that create large datasets. These tests' throughput suddenly degrade by 50 to 60% when the file cache is full. The underlying cause if file thrashing: opening and closing hundreds to thousands of files per second. The problem can be postponed by manually changing the block_size to a larger value, and rerunning the tests. This pushes severe degradation point to a larger dataset size, but also slows down random access. The better solution is to manipulate the block_size parameter dynamically to keep is small during early operations, then gradually raise it as needed to both maximum number of files in the file cache and maintain the highest random access rates.
Stated another way, the tuning goal is maximize the number of open files in the file cache so that a random read only has to request data from disk, not open a file and read the file metadata and finally read the data from disk. Random read performance drops rapidly and significantly (at least 50%) once file opens start, a.k.a. file cache thrashing.
leveldb tracks the storage location of a key via a three layer hierarchy:
-
MANIFEST layer: stores first and last key of every leveldb .sst table file. The manifest is kept completely in memory.
-
.sst metadata layer: stores last key of each data block in the file along with an offset in file to that data block. The metadata size increases as the block size decreases. This layer is stored in the file cache. It is loaded and unloaded based upon file usage and available file cache space.
-
block metadata layer: stores an index of "restart points" within block along with offset to that location within the block. This layer is stored in the block cache. It is loaded and unload based upon block access and available block cache space.
To find the value of a key: leveldb first determines the appropriate .sst table via the MANIFEST. It then locates the appropriate data block within the .sst file via the file's metadata index. leveldb next finds a starting point within the data block that is "close" to the actual key/value pair via the block's metadata index. And finally performs a linear walk from the restart point in the block to the actual key/value data.
There is a simple size tradeoff between the file cache's metadata and the block cache's blocks. Smaller blocks cause larger metadata and are quicker to read from disk IF the file is already present in the file cache. Larger blocks are slower to read from the disk, but they result in smaller file cache entries. Smaller file cache entries allow more files to be simultaneously present in the file cache.
The size of a data block is controlled by the block_size parameter of the Options structure (sst_block_size in Riak's app.config). The size represents a threshold, not a fixed count. Whenever a newly created block reaches this uncompressed size, leveldb considers it full and writes the block with its metadata to disk. The number of keys contained in the block depends upon the size of the values and keys.
This branch dynamically raises the block size as space in the file cache becomes constrained.
When leveldb starts a new database (Riak vnode), there is plenty of space in the file cache for .sst metadata. The default block_size emphasizes keeping a greater number of keys in the .sst metadata layer. This emphasis allows the data blocks to be small and read more quickly from disk. The quicker disk read results in a faster response to the user.
Riak has modified leveldb to allow total size of the .sst files to approach 500Mbytes in higher levels. The larger file sizes assist with write throughput (which is not discussed here). The larger file sizes along with the 4096 default block_size can result in .sst metadata that approaches 4Mbytes. The large file sizes maintain the great read performance … as long as the file and its metadata remain in the file_cache.
There comes a point where the number of files and the size of their metadata exceeds the size of the file cache. Riak 2.0's flex-cache extends when that point is reached, but it still happens eventually. The random read of an individual key/value pair becomes quite expensive when leveldb must first load the file and its metadata. The 4096 byte read of a data block now additionally requires a file open (operating system costly) and the read of all the metadata (maybe 4Mbytes): 4,096 bytes becomes 4,004,096. These larger random read operations are a performance killer.
A first step improvement is to have leveldb adjust the block_size parameter dynamically once the file_cache approaches its full state. leveldb could adjust the block_size higher, reducing the metadata size of newly created .sst files. This dynamic block_size adjustment will make individual read actions of an already open file a little slower due to more data per block. That read size cost is offset by two gains: more files can fit into the file cache reducing the number of open operations, and metadata reads of a newly opened file will be less costly. Over time, leveldb will recompact all active .sst files using the dynamic block_size.
There is a secondary performance opportunity with the dynamic block_size. Not all .sst files contain the same size key/value pairs. The size of Riak's 2i key/value data is quite small compared to most user key/value data. And the 2i key/value data clumps together as leveldb compacts it into higher .sst files. The 2i centric .sst files can use a dynamic block_size that is different from the block_size of the user data centric .sst files, optimizing both. No more "one size fits all" non-optimization.