-
Notifications
You must be signed in to change notification settings - Fork 178
riak tuning 2
leveldb added thread names when executing on Linux operating systems. This allowed direct viewing of thread activity via the command line tool "top", e.g. "top -H -p $(pgrep beam)". The shock was that leveldb and its Erlang interface eleveldb used very little CPU time. Erlang scheduler threads ran a constant 75 to 95% of each CPU's time (Erlang release R16B02). A quick review with linux's "perf top" tool showed that 10 to 30% of the total server time was spent in the Erlang routine "scheduler_wait".
"scheduler_wait" is Erlang's busy wait routine. The scheduler threads do nothing but spin, waiting for a new message (event) to arrive. The CPU time spent spinning could often be put to better use by leveldb, network / disk drivers, and other external programs such as Solr/yokozuna.
The obvious solution was to use Erlang's "+sbwt" parameter to lower the busy wait spin time. Performance results with both "+sbwt none" and "+sbwt very_small" tended to actually reduce overall throughput. Some workload and/or disk arrays benefitted from "+sbwt", but not in a consistent manner to generically recommend using "+sbwt".
The "+S" parameter was consistently equal or better compared to Erlang defaults. The recommendation is reduce "+S x:x" by one for every six logic CPUs in the server, e.g. 24 logical CPUs use "+S 20:20", 12 logical CPUs use "+S 10:10", etc. Throughputs could increase as much as 25%. Often latencies in the 95% and worst case measurements decreased 10 to 50%.
The conclusion is that Erlang R16B02 is heavily dependent upon busy wait to control its latencies. But that the entire CPU infrastructure does not need to be dedicated to Erlang. Implicitly reserving CPU resources via "+S" for threads and processes that are independent of Erlang's schedulers can increase total system throughput.
erlang.schedulers.total = 14
erlang.schedulers.online = 14
[
{vm_args, [{"+S", "20:20"}]}
].
Below is a link to the summary of several Riak test scenarios. It contains percentages instead of raw numbers (counts or times). 100% on a line represents the highest performing setting. The lower percentages to the left and right show how much worse the alternate settings performed. All tests execute with Riak's anti_entropy set to active (AAE). The 8 vnode tests executed on r2s06 in the Boston colo. The 64 vnode tests executed on r2s09. All tests were single server. The 2i tests use n_val of 3. All other tests are n_val of 1.