-
Notifications
You must be signed in to change notification settings - Fork 178
Mv filecache tuning2
- merged to "master" May 24, 2013
- developer tested code checked-in May 17, 2013
- development started May 14, 2013
The mv-level-work1 branch changed levels 1 and 2 from sorted files to overlapped files. This change allows writes to occur more quickly and almost halves the write amplification. The overlapped strategy takes all files from level 0 and creates one file at level 1 during a compaction. Similarly all files of from level 1 become one file at level 2. Level 2 files are "large". They compact across all the sorted files of level 3.
The above process works fine when fewer files exist in levels 0 through 3 than the max_open_files limit. Many of the lower level files are likely close during the level 2 to 3 compaction if max_open_files is small. Reopening the larger, overlapped level files is CPU and disk intensive ... and these are files that are most likely to be read.
mv-level-work1 branch continued a previous concept of creating larger and larger files as the levels increased. This concept was based upon write performance observations. Recent analysis shows that larger files really hurt random reads and iterators/iterations.
One quick concept was to leave the overlapped files of levels 0, 1, and 2 large. Then make the sort files of levels 3 and above smaller. Reducing the size of the sorted files in level 3 and beyond however increased the chances that the lower levels' larger files would be closed and later require the expensive reopen.
This branch contains 3 distinctive changes.
There exists a previous Riak change to the cache logic that causes any object with active references to be exempt from cache ejection. This branch adds hard coded logic that adds an extra reference to every level 0, 1, and 2 file. These extra references prevent the files from ever being flushed by compactions with large file counts and/or iterations accessing many files.
The extra reference is only removed when the file is deleted from the disk. This activity only occurs after a file's content has been compacted to higher level and all open iterators using the file have closed.
The benefit of this change is that larger indexes and bloom filters are now unpacked and CRC checked only once in the life of the file. Speedy random access for content of these larger files is therefore guaranteed.
This branch reduces the size threshold for levels 3, 4, 5, and 6 by a factor of 10. The size change greatly reduces the time required to access a random key in a level file that is not currently open. An open operation must read three blocks of meta data, CRC check all three blocks, and decompress all three blocks. Only then can the file be examined to see if the key actually exists within it. Smaller files mean smaller meta data processing overhead.
One of the two best ways to control the Linux page cache is via the posix_fadvise call. A previous Riak change set all files to "may need" state. That was the simplest solution at the time and increased performance. The change in this branch overrides a previous "may need" call with "don't need" during a compaction input's close. "don't need" lets Linux immediately flush the cache pages instead of waiting vm.dirty_expire_centsecs. The immediate flush also reduces the chance that Linux will charge the Riak process' time slice with flushing overhead.