Scheduler Settings #1826

martinsumner · 2022-05-03T15:50:02Z

As an offshoot of #1820 ... a thread related to testing different scheduler settings

martinsumner · 2022-05-03T16:01:46Z

Firstly of note, there are some default changes to standard erl flags in riak:

Load compaction

this is disabled by default in Riak 3.0.9

Utilisation balancing

this is enabled by default in Riak 3.0.9

Force wake-up

this is set to 500ms in Riak 3.0.9

Async threads

this is set to 64 in Riak 3.0.9

The first three items are changed form default, I believe, due to issues with scheduler collapse in Riak on earlier OTP versions. When using leveled backend, for Riak 3.0.10, it makes sense to revert these on-default settings as there almost all I/O is now using inbuilt dirty I/O NIFs. It is expected that scheduler collapse should no longer be an issue.

Likewise the async threads are now redundant when using leveled.

Caution is required if leveldb or bitcask backends are used as they do not necessarily use "Dirty" schedulers. Likewise if standard (non-tictac) kv_index_hashtree AAE is used (which will use a leveldb backend).

When testing with erlang defaults, not basho defaults, and a leveled/tictacaae system - our standard 24 hour test saw a 1.2 % throughput improvement.

martinsumner · 2022-05-03T16:15:00Z

By default Riak will start with one "standard" scheduler per vCPU, one "dirty" CPU scheduler per vCPU, 64 async threads and ten "dirty IO" schedulers.

This means there are many more schedulers than there are vCPUs.

There are two different changes to this which have been initially tried:

1- Use a lower percentage of online schedulers (100%/75% for standard schedulers, 50%/25% for dirty CPU schedulers, 4 async threads). This reduces the ratio of online schedulers to vCPUs, and brings it closer to 1:1.
2- Use the default bind type to keep the default amount of schedulers but bind them directly to the vCPUs.

The comparative throughput between tests with these different configurations can be seen here:

On a log scale to show the delta at the end of the test (when the test is a more realistic representation of performance in a loaded cluster):

The test is a 24-hour test which is split into two phases. The first phase has a relatively higher proportion of PUTs, the second has a relatively higher proportion of GETs. The number of deletes and 2i queries remains constant. The updates are mixed between 2KB inserts, and 8KB+ inserts/updates.

The test is conducted on an 8-node cluster, with each node having 96GB RAM, a series of spinning disks in RAID 10, 2 hex core CPUs (with each core showing as 2 vCPUs), and flash-backed write-cache. The backend is leveled, flushing to disk is left under the control of the operating system.

The basho_bench config file is:

{mode, max}.
{duration, 1440}.
{report_interval, 10}.
{node_name, testnode1}.
{concurrent, 100}.
{driver, basho_bench_driver_nhs}.

{key_generator, {eightytwenty_int, 100000000}}.
{value_generator, {semi_compressible, 8000, 2000, 10, 0.1}}.

{alwaysget, {1000000, 300000, key_order}}.
{unique, {2000, skew_order}}.

{operations, [{alwaysget_pb, 421}, {alwaysget_updatewith2i, 130}, 
                {put_unique, 290}, {get_unique, 130}, {delete_unique, 25},
                {postcodequery_http, 3}, {dobquery_http, 1}]}.

The x-axis of the chart shows the accumulated updates at this point of the test. The y-axis the transactions per seconds (GET, PUT, 2i query, DELETE combined). Each point represents a throughput reading measure over a 10s period. Measurements are taken every 10s in the test.

At the 250M update point, the relative throughput improvement when compared to the same Riak test with basho default settings are:

reduced scheduler count + 22.5%
bound schedulers + 15.8%

This is a very large delta, much bigger than any single throughput improvement that has been delivered recently on Riak.

martinsumner · 2022-05-03T16:50:41Z

However, is this throughput improvement likely to be true across other hardware configurations? Is it likely to exist with different test loads? It would be useful to dig deeper into why throughput improves under these configurations, to understand if this is a general improvement, or one specific to this configuration.

martinsumner · 2022-05-03T17:05:20Z

The test is designed to be primarily CPU limited at first, and then when the load switches to be more GET biased it is expected to be CPU heavy but primarily constrained by Disk I/O.

Looking at total CPU used across the nodes (sys, wait and user), comparing the three test runs:

So the indication is that with a reduced scheduler count, more work is being done, but with the less or similar CPU cycles. With binding of schedulers, more work is being done, but with more use of CPU (although throughput per CPU cycle is still better than with the default setting towards the end of the test).

One obvious difference in how the CPU is being used is the number of context switches:

Either binding schedulers to CPU cores, or reducing the number of schedulers, has a dramatic difference in the number of context switches required.

But what is the cost of a context switch? This seems a little unclear, as although there is some data available , it is difficult to know if this data is relevant to modern OS/CPU architecture - in particular efficiency savings made related to (non-)flushing TLBs. However, how are these improvements impacted by meltdown mitigations?

It might be reasonable to assume there are two basic costs:

A small, o(1) microsecond cost for each switch in terms of CPU time.
Some cost in terms of the L1/L2 CPU cache efficiency, particularly where schedulers are bing switched across cores or processor boundaries.

Perhaps there are other costs related to TLB shootdowns, particular costs if a scheduler is moved between CPU cores to get access to CPU time rather than being switched in and out on the same core. The "cost of a context switch" is almost certainly. a non-trivial question to answer, and will depend on a lot of factors.

martinsumner · 2022-05-03T17:14:51Z

As well as increased throughput, one of the biggest headline gains in performance across the different tests, is the improvement in GET latency:

There are multiple different parts of the GET process where speed has improved.

The slot_fetch_time is the time within a SST file in the penciller to read the HEAD of an object out of a block. The block must be read from disk (which will almost certainly be a read from the VFS page cache), and go through a binary_to_term conversion which will include a zlib decompression, followed by a lists:nth/2 call on the small (normally 32 item) list that has been converted. The average time for slot_fetch through the test (here in microseconds), is dramatically different between the configurations:

The slot_fetch is the key part of the HEAD part of the GET process. The actual GET (which will occur on 1 not 3 vnodes) has 2 parts in the cdb file. The first part is an index lookup, which requires a calculation of the dj bernstein special has of the key/SQN, and then a file position to read the integers in that position on the index (which will almost certainly be in the page cache). The average time for the index fetch (here in microseconds), varies significantly between the configurations.

The final part is the actual reading of the object from the disk (which may or may not be in the page cache). This again varies in line with the headline latency changes (measured here in milliseconds):

martinsumner · 2022-05-03T17:22:05Z

All of these processes involve both some CPU-related activity and some interaction with the virtual file system (and in some cases possibly the underlying disk).

If we compare these timings to the timings of larger CPU-only tasks, a difference emerges.

Firstly, looking at the time need to create the slot_hashlist when writing a new SST file. This is a CPU-only activity, but a background task, i.e. not directly related to any end-user latency. The timings of these in milliseconds:

Intriguingly, once in the backend of the test there is no improvement in the time for this task between reduced and default scheduler counts. However, there is a clear improvement when the schedulers are bound to CPUs.

Secondly, looking at the hashtree computation (milliseconds) in the leveled Inker (an infrequent background process) we can see a small gain, but only through scheduler binding:

martinsumner · 2022-05-03T17:25:52Z

On the write side, when an object is PUT there is a process to update the Bookie's memory, which is a CPU only process. Timings here in microseconds:

This shows the greatest latency improvement with scheduler binding.

The second, and slower phase is the writing of the object within the inker, which includes some CPU work but also an interaction with the page cache. Timings here again in microseconds:

Now with the I/O interaction the big improvement is related to the reduction in scheduler counts.

martinsumner · 2022-05-03T17:31:27Z

Overall there seems to be a pattern. Pure CPU work is generally made faster by binding CPUs to schedulers. Work that interacts with the VFS is made faster/more-efficient by reducing the count of schedulers.

martinsumner · 2022-05-03T17:39:06Z

But why would reducing the scheduler count have this impact?

If we look at the actual underlying volume of data being written and read from the disk, it would be reasonable to expect that this will vary between the test runs in-line with the throughput. Looking at write KB per second (from disk):

The alignment between write volume and throughput appears to be roughly present as expected.

However, looking at read KBs per second (from disk):

Now the alignment doesn't appear to exist, especially with reduced scheduler counts. With reduced scheduler counts, there must be more reading from the VFS (as more throughput), but this is achieved with less reading form actual disk. This is also true (but to a lesser extent) with scheduler binding.

This would imply that the VFS page cache is being used much more in these cases. But how are scheduler changes making the VFS page cache more efficient?

Looking at the reported memory deployed by the VFS page cache - there is not an obvious difference here:

So the apparent improved VFS page cache efficiency with reduced schedulers does not have an obvious or easy explanation.

martinsumner · 2022-05-03T17:45:04Z

In summary, there is a performance test where we improve throughput by 22.5% by reducing the scheduler count, and get an improvement of 15.8% by binding schedulers to CPUs.

The reason why seems relatively obvious and expected for scheduler binding. Reduced context switching improve efficiency of CPU centric tasks.

The reason why seems strange and mysterious for the reduced scheduler count. Yes, there are some CPU efficiency improvements potentially related to reduced context switching, but the biggest improvement appears to be in the efficiency of the VFS page cache on both reads and writes.

martinsumner · 2022-05-03T18:03:17Z

Scheduler binding is a known way to improve performance in Erlang systems., and is advertised as such within the erlang documentation.

In RabbitMQ, it is now the default to use scheduler binding.

However, the side effects associated with other activity on the same node can be severe. In RabbitMQ this is mitigated that nodes used for RabbitMQ are dedicated to that purpose.

The same mitigation could be stipulated for Riak. However, even if no application workloads co-exist on the same node, all Riak nodes will have operational software (e.g. monitoring, security, backups etc). That operational software may have a limited throughput when working correctly - but may also have error conditions where they work in unexpected ways.

Overall, the potential side effects of scheduler binding seem out-of-step with the primary aim of Riak to be reliable in exceptional circumstances (as a priority over reduced latency in the median case).

martinsumner · 2022-05-03T18:05:37Z

Reducing scheduler counts appears to be a safer way to improve performance, but without a full understanding of why it is improving performance, it doesn't seem to correct to make this a default setting. This is especially true, as what might be an improvement with a full beam-based backend like leveled, may not be true with NIF-based backends like eleveldb/bitcask (where dirty scheduler improvements have not been made).

martinsumner · 2022-05-03T18:42:47Z

One interesting aside is - if the throughput improvements for scheduler binding and scheduler reductions are not tightly correlated, what would happen if the two changes were combined? More throughput improvements?

See below for the log-scale throughput chart, this time with a green line showing the combination of the two changes:

In the initial part of the test - there is an improvement over and above what can be achieved through either setting. However, towards the end of the test, the combined setting has lower throughput than either individual setting. At the 250M update mark, the throughput improvement is just 4.7% more than the default.

This is hard to explain, but it is noticeable that the I/O & page cache related improvements appear to reverse when reduced schedulers is combined with bound schedulers.

The throughput towards the end of the test is much more important than at the beginning - most Riak cluster spend most of their time processing requests with a large set of data already present. So combining the two changes, does not seem to be as good as making either one of the changes.

martinsumner · 2022-05-03T19:07:05Z

One thing of potential interest to emerge in this thread is the sbwt setting.

The reason described by @seancribbs for changing this are related specifically to containerised environments, but perhaps it is worth adding this to the riak.conf tuneable parameters, and experiment further to see if there is any influence on context switching through reduction of buy waiting.

It was noticeable during testing of the combined binding/reduced setup that the "Other" count was dominant form of busyness on the standard schedulers:

       Thread      aux check_io emulator       gc    other     port    sleep

Stats per thread:
     async( 0)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
     async( 1)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
     async( 2)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
     async( 3)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
       aux( 1)    1.43%    0.50%    0.00%    0.00%    0.17%    0.00%   97.90%
dirty_cpu_( 1)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_cpu_( 2)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_cpu_( 3)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_cpu_( 4)    0.00%    0.00%    0.00%    1.74%    0.09%    0.00%   98.17%
dirty_cpu_( 5)    0.00%    0.00%    0.02%    6.09%    0.45%    0.00%   93.44%
dirty_cpu_( 6)    0.00%    0.00%    0.01%    2.70%    0.20%    0.00%   97.10%
dirty_cpu_( 7)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_cpu_( 8)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_cpu_( 9)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_cpu_(10)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_cpu_(11)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_cpu_(12)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_io_s( 1)    0.00%    0.00%   75.61%    0.00%    5.27%    0.00%   19.12%
dirty_io_s( 2)    0.00%    0.00%   74.64%    0.00%    5.65%    0.00%   19.71%
dirty_io_s( 3)    0.00%    0.00%   74.86%    0.00%    4.95%    0.00%   20.19%
dirty_io_s( 4)    0.00%    0.00%   76.37%    0.00%    5.07%    0.00%   18.56%
dirty_io_s( 5)    0.00%    0.00%   71.89%    0.00%    5.48%    0.00%   22.63%
dirty_io_s( 6)    0.00%    0.00%   73.23%    0.00%    5.41%    0.00%   21.36%
dirty_io_s( 7)    0.00%    0.00%   72.72%    0.00%    5.23%    0.00%   22.05%
dirty_io_s( 8)    0.00%    0.00%   76.20%    0.00%    5.07%    0.00%   18.73%
dirty_io_s( 9)    0.00%    0.00%   72.99%    0.00%    5.40%    0.00%   21.60%
dirty_io_s(10)    0.00%    0.00%   75.00%    0.00%    5.50%    0.00%   19.50%
      poll( 0)    0.00%    0.61%    0.00%    0.00%    0.00%    0.00%   99.39%
 scheduler( 1)    1.78%    0.31%   24.80%    2.31%   29.83%    1.39%   39.58%
 scheduler( 2)    1.75%    0.32%   23.67%    2.45%   30.19%    1.43%   40.19%
 scheduler( 3)    1.79%    0.34%   25.18%    2.34%   30.60%    1.48%   38.27%
 scheduler( 4)    1.72%    0.33%   23.94%    2.30%   29.71%    1.49%   40.51%
 scheduler( 5)    1.74%    0.31%   24.15%    2.37%   29.96%    1.44%   40.03%
 scheduler( 6)    1.78%    0.30%   23.26%    2.19%   29.81%    1.38%   41.28%
 scheduler( 7)    1.91%    0.35%   26.51%    2.58%   29.54%    1.71%   37.40%
 scheduler( 8)    1.95%    0.34%   27.67%    2.75%   29.31%    1.69%   36.29%
 scheduler( 9)    1.86%    0.35%   26.29%    2.55%   29.39%    1.71%   37.85%
 scheduler(10)    1.94%    0.36%   27.61%    2.76%   29.44%    1.76%   36.12%
 scheduler(11)    1.92%    0.35%   27.22%    2.71%   29.06%    1.74%   37.00%
 scheduler(12)    1.90%    0.37%   27.15%    2.65%   29.53%    1.84%   36.56%
 scheduler(13)    1.74%    0.32%   23.98%    2.30%   29.97%    1.45%   40.25%
 scheduler(14)    1.76%    0.32%   23.72%    2.31%   29.97%    1.46%   40.46%
 scheduler(15)    1.77%    0.34%   24.33%    2.30%   30.32%    1.50%   39.46%
 scheduler(16)    1.79%    0.33%   25.17%    2.14%   29.90%    1.49%   39.19%
 scheduler(17)    1.81%    0.32%   25.50%    2.27%   30.29%    1.54%   38.27%
 scheduler(18)    1.59%    0.29%   21.21%    1.92%   28.39%    1.26%   45.34%
 scheduler(19)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
 scheduler(20)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
 scheduler(21)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
 scheduler(22)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
 scheduler(23)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
 scheduler(24)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%

Stats per type:
         async    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
           aux    1.43%    0.50%    0.00%    0.00%    0.17%    0.00%   97.90%
dirty_cpu_sche    0.00%    0.00%    0.00%    0.88%    0.06%    0.00%   99.06%
dirty_io_sched    0.00%    0.00%   74.35%    0.00%    5.30%    0.00%   20.35%
          poll    0.00%    0.61%    0.00%    0.00%    0.00%    0.00%   99.39%
     scheduler    1.35%    0.25%   18.81%    1.80%   22.30%    1.16%   54.33%

More information on this potential change.

martinsumner · 2022-05-03T21:11:06Z

A further note on the combination of reduced scheduler counts and binding. When this combined setting is in place, the CPU bound activity gets faster than just having the normal number of CPUs bound. However, any impact related improved interaction with the page cache is entirely negated.

Hence we see in the curve, throughput improved while the cluster is primarily CPU bound, but throughput worse when the cluster is more disk bound.

basho/riak_kv#1826

martinsumner · 2022-05-17T21:03:27Z

The tests have been re-run with some subtle changes to the load, and over a longer test (48 hours). Towards the end of this test, disk busyness is almost totally dominant in determining throughput.

There are three variants from defaults tested:

Reduced scheduler counts
Bound schedulers
Disabled busy-wait threshold

Looking at the results with different VM/scheduler settings - improvement sin throughput in the mid-part of the test are seen with all variants:

What we can see in all variants improvements in response times, especially GET:

What is noticeable, is the difference in CPU utilisation between the settings. The test is run on a cluster of 12-core (24 vCPU) servers. Disabling the busy wait gives throughput improvements whilst running 3.4 vCPU less on average compared to VM defaults. This is a significant improvement in efficiency:

martinsumner · 2022-05-25T11:13:56Z

Note RabbitMQ has gained improvement through scheduler binding - rabbitmq/rabbitmq-server#612.

However, what is the impact of this change if there are other processes running on the same machine (at the NHS there was an incident related to scheduler binding on RabbitMQ - although in this case RabbitMQ was not being run on a dedicated machine in line with Rabbit guidance). Riak guidance is also to run Riak on dedicated machines, but even if the machine is "dedicated" there will still be alternative processes for operational reasons (e.g. monitoring), and it is not possible that such processes will always behave as expected. So I'm wary of scheduler binding as a default approach.

My preference would be disable busy-wait as the recommended change, primarily because of the CPU efficiency improvements we see. However there exists a specific warning in the docs that this flag may not be available in future releases.

martinsumner self-assigned this May 3, 2022

martinsumner added a commit to basho/riak that referenced this issue May 6, 2022

Add support for changing busy wait threshold

daa520a

basho/riak_kv#1826

martinsumner added the 3.0.10 label May 24, 2022

martinsumner closed this as completed Dec 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduler Settings #1826

Scheduler Settings #1826

martinsumner commented May 3, 2022 •

edited

Loading

martinsumner commented May 3, 2022 •

edited

Loading

martinsumner commented May 3, 2022 •

edited

Loading

martinsumner commented May 3, 2022

martinsumner commented May 3, 2022 •

edited

Loading

martinsumner commented May 3, 2022

martinsumner commented May 3, 2022

martinsumner commented May 3, 2022

martinsumner commented May 3, 2022

martinsumner commented May 3, 2022 •

edited

Loading

martinsumner commented May 3, 2022

martinsumner commented May 3, 2022 •

edited

Loading

martinsumner commented May 3, 2022

martinsumner commented May 3, 2022

martinsumner commented May 3, 2022

martinsumner commented May 3, 2022

martinsumner commented May 17, 2022 •

edited

Loading

martinsumner commented May 25, 2022

Scheduler Settings #1826

Scheduler Settings #1826

Comments

martinsumner commented May 3, 2022 • edited Loading

martinsumner commented May 3, 2022 • edited Loading

martinsumner commented May 3, 2022 • edited Loading

martinsumner commented May 3, 2022

martinsumner commented May 3, 2022 • edited Loading

martinsumner commented May 3, 2022

martinsumner commented May 3, 2022

martinsumner commented May 3, 2022

martinsumner commented May 3, 2022

martinsumner commented May 3, 2022 • edited Loading

martinsumner commented May 3, 2022

martinsumner commented May 3, 2022 • edited Loading

martinsumner commented May 3, 2022

martinsumner commented May 3, 2022

martinsumner commented May 3, 2022

martinsumner commented May 3, 2022

martinsumner commented May 17, 2022 • edited Loading

martinsumner commented May 25, 2022

martinsumner commented May 3, 2022 •

edited

Loading

martinsumner commented May 3, 2022 •

edited

Loading

martinsumner commented May 3, 2022 •

edited

Loading

martinsumner commented May 3, 2022 •

edited

Loading

martinsumner commented May 3, 2022 •

edited

Loading

martinsumner commented May 3, 2022 •

edited

Loading

martinsumner commented May 17, 2022 •

edited

Loading