Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler Settings #1826

Closed
martinsumner opened this issue May 3, 2022 · 17 comments
Closed

Scheduler Settings #1826

martinsumner opened this issue May 3, 2022 · 17 comments
Assignees
Labels

Comments

@martinsumner
Copy link
Contributor

martinsumner commented May 3, 2022

As an offshoot of #1820 ... a thread related to testing different scheduler settings

@martinsumner
Copy link
Contributor Author

martinsumner commented May 3, 2022

Firstly of note, there are some default changes to standard erl flags in riak:

Load compaction

  • this is disabled by default in Riak 3.0.9

Utilisation balancing

  • this is enabled by default in Riak 3.0.9

Force wake-up

  • this is set to 500ms in Riak 3.0.9

Async threads

  • this is set to 64 in Riak 3.0.9

The first three items are changed form default, I believe, due to issues with scheduler collapse in Riak on earlier OTP versions. When using leveled backend, for Riak 3.0.10, it makes sense to revert these on-default settings as there almost all I/O is now using inbuilt dirty I/O NIFs. It is expected that scheduler collapse should no longer be an issue.

Likewise the async threads are now redundant when using leveled.

Caution is required if leveldb or bitcask backends are used as they do not necessarily use "Dirty" schedulers. Likewise if standard (non-tictac) kv_index_hashtree AAE is used (which will use a leveldb backend).

When testing with erlang defaults, not basho defaults, and a leveled/tictacaae system - our standard 24 hour test saw a 1.2 % throughput improvement.

@martinsumner
Copy link
Contributor Author

martinsumner commented May 3, 2022

By default Riak will start with one "standard" scheduler per vCPU, one "dirty" CPU scheduler per vCPU, 64 async threads and ten "dirty IO" schedulers.

This means there are many more schedulers than there are vCPUs.

There are two different changes to this which have been initially tried:

1- Use a lower percentage of online schedulers (100%/75% for standard schedulers, 50%/25% for dirty CPU schedulers, 4 async threads). This reduces the ratio of online schedulers to vCPUs, and brings it closer to 1:1.
2- Use the default bind type to keep the default amount of schedulers but bind them directly to the vCPUs.

The comparative throughput between tests with these different configurations can be seen here:

ThroughputComparison_24hr_LinearScale

On a log scale to show the delta at the end of the test (when the test is a more realistic representation of performance in a loaded cluster):

ThroughputComparison_24hr_LogScale

The test is a 24-hour test which is split into two phases. The first phase has a relatively higher proportion of PUTs, the second has a relatively higher proportion of GETs. The number of deletes and 2i queries remains constant. The updates are mixed between 2KB inserts, and 8KB+ inserts/updates.

The test is conducted on an 8-node cluster, with each node having 96GB RAM, a series of spinning disks in RAID 10, 2 hex core CPUs (with each core showing as 2 vCPUs), and flash-backed write-cache. The backend is leveled, flushing to disk is left under the control of the operating system.

The basho_bench config file is:

{mode, max}.
{duration, 1440}.
{report_interval, 10}.
{node_name, testnode1}.
{concurrent, 100}.
{driver, basho_bench_driver_nhs}.

{key_generator, {eightytwenty_int, 100000000}}.
{value_generator, {semi_compressible, 8000, 2000, 10, 0.1}}.

{alwaysget, {1000000, 300000, key_order}}.
{unique, {2000, skew_order}}.

{operations, [{alwaysget_pb, 421}, {alwaysget_updatewith2i, 130}, 
                {put_unique, 290}, {get_unique, 130}, {delete_unique, 25},
                {postcodequery_http, 3}, {dobquery_http, 1}]}.

The x-axis of the chart shows the accumulated updates at this point of the test. The y-axis the transactions per seconds (GET, PUT, 2i query, DELETE combined). Each point represents a throughput reading measure over a 10s period. Measurements are taken every 10s in the test.

At the 250M update point, the relative throughput improvement when compared to the same Riak test with basho default settings are:

  • reduced scheduler count + 22.5%
  • bound schedulers + 15.8%

This is a very large delta, much bigger than any single throughput improvement that has been delivered recently on Riak.

@martinsumner
Copy link
Contributor Author

However, is this throughput improvement likely to be true across other hardware configurations? Is it likely to exist with different test loads? It would be useful to dig deeper into why throughput improves under these configurations, to understand if this is a general improvement, or one specific to this configuration.

@martinsumner
Copy link
Contributor Author

martinsumner commented May 3, 2022

The test is designed to be primarily CPU limited at first, and then when the load switches to be more GET biased it is expected to be CPU heavy but primarily constrained by Disk I/O.

Looking at total CPU used across the nodes (sys, wait and user), comparing the three test runs:

CPUTime_24hr

So the indication is that with a reduced scheduler count, more work is being done, but with the less or similar CPU cycles. With binding of schedulers, more work is being done, but with more use of CPU (although throughput per CPU cycle is still better than with the default setting towards the end of the test).

One obvious difference in how the CPU is being used is the number of context switches:

ContextSwitches_24hr

Either binding schedulers to CPU cores, or reducing the number of schedulers, has a dramatic difference in the number of context switches required.

But what is the cost of a context switch? This seems a little unclear, as although there is some data available , it is difficult to know if this data is relevant to modern OS/CPU architecture - in particular efficiency savings made related to (non-)flushing TLBs. However, how are these improvements impacted by meltdown mitigations?

It might be reasonable to assume there are two basic costs:

  • A small, o(1) microsecond cost for each switch in terms of CPU time.
  • Some cost in terms of the L1/L2 CPU cache efficiency, particularly where schedulers are bing switched across cores or processor boundaries.

Perhaps there are other costs related to TLB shootdowns, particular costs if a scheduler is moved between CPU cores to get access to CPU time rather than being switched in and out on the same core. The "cost of a context switch" is almost certainly. a non-trivial question to answer, and will depend on a lot of factors.

@martinsumner
Copy link
Contributor Author

As well as increased throughput, one of the biggest headline gains in performance across the different tests, is the improvement in GET latency:

MeanGETTime_24hr

There are multiple different parts of the GET process where speed has improved.

The slot_fetch_time is the time within a SST file in the penciller to read the HEAD of an object out of a block. The block must be read from disk (which will almost certainly be a read from the VFS page cache), and go through a binary_to_term conversion which will include a zlib decompression, followed by a lists:nth/2 call on the small (normally 32 item) list that has been converted. The average time for slot_fetch through the test (here in microseconds), is dramatically different between the configurations:

SSTSlotFetchTimeSST12US_24hr

The slot_fetch is the key part of the HEAD part of the GET process. The actual GET (which will occur on 1 not 3 vnodes) has 2 parts in the cdb file. The first part is an index lookup, which requires a calculation of the dj bernstein special has of the key/SQN, and then a file position to read the integers in that position on the index (which will almost certainly be in the page cache). The average time for the index fetch (here in microseconds), varies significantly between the configurations.

CDBIndexFetchTimeCDB19US_24hr

The final part is the actual reading of the object from the disk (which may or may not be in the page cache). This again varies in line with the headline latency changes (measured here in milliseconds):

CDBObjFetchTimeCDB19MS_24hr

@martinsumner
Copy link
Contributor Author

All of these processes involve both some CPU-related activity and some interaction with the virtual file system (and in some cases possibly the underlying disk).

If we compare these timings to the timings of larger CPU-only tasks, a difference emerges.

Firstly, looking at the time need to create the slot_hashlist when writing a new SST file. This is a CPU-only activity, but a background task, i.e. not directly related to any end-user latency. The timings of these in milliseconds:

SSTSlotHashListSST13MS_24hr

Intriguingly, once in the backend of the test there is no improvement in the time for this task between reduced and default scheduler counts. However, there is a clear improvement when the schedulers are bound to CPUs.

Secondly, looking at the hashtree computation (milliseconds) in the leveled Inker (an infrequent background process) we can see a small gain, but only through scheduler binding:

CDBHashtreeComputeCDB07MS_24hr

@martinsumner
Copy link
Contributor Author

On the write side, when an object is PUT there is a process to update the Bookie's memory, which is a CPU only process. Timings here in microseconds:

BookieMemWriteTimeB0015US_24hr

This shows the greatest latency improvement with scheduler binding.

The second, and slower phase is the writing of the object within the inker, which includes some CPU work but also an interaction with the page cache. Timings here again in microseconds:

InkerWriteTimeB0015US_24hr

Now with the I/O interaction the big improvement is related to the reduction in scheduler counts.

@martinsumner
Copy link
Contributor Author

Overall there seems to be a pattern. Pure CPU work is generally made faster by binding CPUs to schedulers. Work that interacts with the VFS is made faster/more-efficient by reducing the count of schedulers.

@martinsumner
Copy link
Contributor Author

martinsumner commented May 3, 2022

But why would reducing the scheduler count have this impact?

If we look at the actual underlying volume of data being written and read from the disk, it would be reasonable to expect that this will vary between the test runs in-line with the throughput. Looking at write KB per second (from disk):

WritesKBpersec_24hr

The alignment between write volume and throughput appears to be roughly present as expected.

However, looking at read KBs per second (from disk):

ReadsKBpersec_24hr

Now the alignment doesn't appear to exist, especially with reduced scheduler counts. With reduced scheduler counts, there must be more reading from the VFS (as more throughput), but this is achieved with less reading form actual disk. This is also true (but to a lesser extent) with scheduler binding.

This would imply that the VFS page cache is being used much more in these cases. But how are scheduler changes making the VFS page cache more efficient?

Looking at the reported memory deployed by the VFS page cache - there is not an obvious difference here:

CacheMemory_24hr

So the apparent improved VFS page cache efficiency with reduced schedulers does not have an obvious or easy explanation.

@martinsumner
Copy link
Contributor Author

In summary, there is a performance test where we improve throughput by 22.5% by reducing the scheduler count, and get an improvement of 15.8% by binding schedulers to CPUs.

The reason why seems relatively obvious and expected for scheduler binding. Reduced context switching improve efficiency of CPU centric tasks.

The reason why seems strange and mysterious for the reduced scheduler count. Yes, there are some CPU efficiency improvements potentially related to reduced context switching, but the biggest improvement appears to be in the efficiency of the VFS page cache on both reads and writes.

@martinsumner
Copy link
Contributor Author

martinsumner commented May 3, 2022

Scheduler binding is a known way to improve performance in Erlang systems., and is advertised as such within the erlang documentation.

In RabbitMQ, it is now the default to use scheduler binding.

However, the side effects associated with other activity on the same node can be severe. In RabbitMQ this is mitigated that nodes used for RabbitMQ are dedicated to that purpose.

The same mitigation could be stipulated for Riak. However, even if no application workloads co-exist on the same node, all Riak nodes will have operational software (e.g. monitoring, security, backups etc). That operational software may have a limited throughput when working correctly - but may also have error conditions where they work in unexpected ways.

Overall, the potential side effects of scheduler binding seem out-of-step with the primary aim of Riak to be reliable in exceptional circumstances (as a priority over reduced latency in the median case).

@martinsumner
Copy link
Contributor Author

Reducing scheduler counts appears to be a safer way to improve performance, but without a full understanding of why it is improving performance, it doesn't seem to correct to make this a default setting. This is especially true, as what might be an improvement with a full beam-based backend like leveled, may not be true with NIF-based backends like eleveldb/bitcask (where dirty scheduler improvements have not been made).

@martinsumner
Copy link
Contributor Author

One interesting aside is - if the throughput improvements for scheduler binding and scheduler reductions are not tightly correlated, what would happen if the two changes were combined? More throughput improvements?

See below for the log-scale throughput chart, this time with a green line showing the combination of the two changes:

ThroughputComparison_24hr_LogScale_withCombo

In the initial part of the test - there is an improvement over and above what can be achieved through either setting. However, towards the end of the test, the combined setting has lower throughput than either individual setting. At the 250M update mark, the throughput improvement is just 4.7% more than the default.

This is hard to explain, but it is noticeable that the I/O & page cache related improvements appear to reverse when reduced schedulers is combined with bound schedulers.

The throughput towards the end of the test is much more important than at the beginning - most Riak cluster spend most of their time processing requests with a large set of data already present. So combining the two changes, does not seem to be as good as making either one of the changes.

@martinsumner
Copy link
Contributor Author

One thing of potential interest to emerge in this thread is the sbwt setting.

The reason described by @seancribbs for changing this are related specifically to containerised environments, but perhaps it is worth adding this to the riak.conf tuneable parameters, and experiment further to see if there is any influence on context switching through reduction of buy waiting.

It was noticeable during testing of the combined binding/reduced setup that the "Other" count was dominant form of busyness on the standard schedulers:

       Thread      aux check_io emulator       gc    other     port    sleep

Stats per thread:
     async( 0)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
     async( 1)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
     async( 2)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
     async( 3)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
       aux( 1)    1.43%    0.50%    0.00%    0.00%    0.17%    0.00%   97.90%
dirty_cpu_( 1)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_cpu_( 2)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_cpu_( 3)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_cpu_( 4)    0.00%    0.00%    0.00%    1.74%    0.09%    0.00%   98.17%
dirty_cpu_( 5)    0.00%    0.00%    0.02%    6.09%    0.45%    0.00%   93.44%
dirty_cpu_( 6)    0.00%    0.00%    0.01%    2.70%    0.20%    0.00%   97.10%
dirty_cpu_( 7)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_cpu_( 8)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_cpu_( 9)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_cpu_(10)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_cpu_(11)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_cpu_(12)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_io_s( 1)    0.00%    0.00%   75.61%    0.00%    5.27%    0.00%   19.12%
dirty_io_s( 2)    0.00%    0.00%   74.64%    0.00%    5.65%    0.00%   19.71%
dirty_io_s( 3)    0.00%    0.00%   74.86%    0.00%    4.95%    0.00%   20.19%
dirty_io_s( 4)    0.00%    0.00%   76.37%    0.00%    5.07%    0.00%   18.56%
dirty_io_s( 5)    0.00%    0.00%   71.89%    0.00%    5.48%    0.00%   22.63%
dirty_io_s( 6)    0.00%    0.00%   73.23%    0.00%    5.41%    0.00%   21.36%
dirty_io_s( 7)    0.00%    0.00%   72.72%    0.00%    5.23%    0.00%   22.05%
dirty_io_s( 8)    0.00%    0.00%   76.20%    0.00%    5.07%    0.00%   18.73%
dirty_io_s( 9)    0.00%    0.00%   72.99%    0.00%    5.40%    0.00%   21.60%
dirty_io_s(10)    0.00%    0.00%   75.00%    0.00%    5.50%    0.00%   19.50%
      poll( 0)    0.00%    0.61%    0.00%    0.00%    0.00%    0.00%   99.39%
 scheduler( 1)    1.78%    0.31%   24.80%    2.31%   29.83%    1.39%   39.58%
 scheduler( 2)    1.75%    0.32%   23.67%    2.45%   30.19%    1.43%   40.19%
 scheduler( 3)    1.79%    0.34%   25.18%    2.34%   30.60%    1.48%   38.27%
 scheduler( 4)    1.72%    0.33%   23.94%    2.30%   29.71%    1.49%   40.51%
 scheduler( 5)    1.74%    0.31%   24.15%    2.37%   29.96%    1.44%   40.03%
 scheduler( 6)    1.78%    0.30%   23.26%    2.19%   29.81%    1.38%   41.28%
 scheduler( 7)    1.91%    0.35%   26.51%    2.58%   29.54%    1.71%   37.40%
 scheduler( 8)    1.95%    0.34%   27.67%    2.75%   29.31%    1.69%   36.29%
 scheduler( 9)    1.86%    0.35%   26.29%    2.55%   29.39%    1.71%   37.85%
 scheduler(10)    1.94%    0.36%   27.61%    2.76%   29.44%    1.76%   36.12%
 scheduler(11)    1.92%    0.35%   27.22%    2.71%   29.06%    1.74%   37.00%
 scheduler(12)    1.90%    0.37%   27.15%    2.65%   29.53%    1.84%   36.56%
 scheduler(13)    1.74%    0.32%   23.98%    2.30%   29.97%    1.45%   40.25%
 scheduler(14)    1.76%    0.32%   23.72%    2.31%   29.97%    1.46%   40.46%
 scheduler(15)    1.77%    0.34%   24.33%    2.30%   30.32%    1.50%   39.46%
 scheduler(16)    1.79%    0.33%   25.17%    2.14%   29.90%    1.49%   39.19%
 scheduler(17)    1.81%    0.32%   25.50%    2.27%   30.29%    1.54%   38.27%
 scheduler(18)    1.59%    0.29%   21.21%    1.92%   28.39%    1.26%   45.34%
 scheduler(19)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
 scheduler(20)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
 scheduler(21)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
 scheduler(22)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
 scheduler(23)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
 scheduler(24)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%

Stats per type:
         async    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
           aux    1.43%    0.50%    0.00%    0.00%    0.17%    0.00%   97.90%
dirty_cpu_sche    0.00%    0.00%    0.00%    0.88%    0.06%    0.00%   99.06%
dirty_io_sched    0.00%    0.00%   74.35%    0.00%    5.30%    0.00%   20.35%
          poll    0.00%    0.61%    0.00%    0.00%    0.00%    0.00%   99.39%
     scheduler    1.35%    0.25%   18.81%    1.80%   22.30%    1.16%   54.33%

More information on this potential change.

@martinsumner martinsumner self-assigned this May 3, 2022
@martinsumner
Copy link
Contributor Author

A further note on the combination of reduced scheduler counts and binding. When this combined setting is in place, the CPU bound activity gets faster than just having the normal number of CPUs bound. However, any impact related improved interaction with the page cache is entirely negated.

Hence we see in the curve, throughput improved while the cluster is primarily CPU bound, but throughput worse when the cluster is more disk bound.

@martinsumner
Copy link
Contributor Author

martinsumner commented May 17, 2022

The tests have been re-run with some subtle changes to the load, and over a longer test (48 hours). Towards the end of this test, disk busyness is almost totally dominant in determining throughput.

There are three variants from defaults tested:

  • Reduced scheduler counts
  • Bound schedulers
  • Disabled busy-wait threshold

Looking at the results with different VM/scheduler settings - improvement sin throughput in the mid-part of the test are seen with all variants:

ThroughputComparison_48h

What we can see in all variants improvements in response times, especially GET:

GetTime_48h

What is noticeable, is the difference in CPU utilisation between the settings. The test is run on a cluster of 12-core (24 vCPU) servers. Disabling the busy wait gives throughput improvements whilst running 3.4 vCPU less on average compared to VM defaults. This is a significant improvement in efficiency:

CPUutilisation_48h

@martinsumner
Copy link
Contributor Author

Note RabbitMQ has gained improvement through scheduler binding - rabbitmq/rabbitmq-server#612.

However, what is the impact of this change if there are other processes running on the same machine (at the NHS there was an incident related to scheduler binding on RabbitMQ - although in this case RabbitMQ was not being run on a dedicated machine in line with Rabbit guidance). Riak guidance is also to run Riak on dedicated machines, but even if the machine is "dedicated" there will still be alternative processes for operational reasons (e.g. monitoring), and it is not possible that such processes will always behave as expected. So I'm wary of scheduler binding as a default approach.

My preference would be disable busy-wait as the recommended change, primarily because of the CPU efficiency improvements we see. However there exists a specific warning in the docs that this flag may not be available in future releases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant