Skip to content

Commit

Permalink
admission: add support for disk bandwidth as a bottleneck resource
Browse files Browse the repository at this point in the history
We assume that:
- There is a provisioned known limit on the sum of read and write
  bandwidth. This limit is allowed to change.
- Admission control can only shape the rate of admission of writes. Writes
  also cause reads, since compactions do reads and writes.

There are multiple challenges:
- We are unable to precisely track the causes of disk read bandwidth, since
  we do not have observability into what reads missed the OS page cache.
  That is, we don't know how much of the reads were due to incoming reads
  (that we don't shape) and how much due to compaction read bandwidth.
- We don't shape incoming reads.
- There can be a large time lag between the shaping of incoming writes, and when
  it affects actual writes in the system, since compaction backlog can
  build up in various levels of the LSM store.
- Signals of overload are coarse, since we cannot view all the internal
  queues that can build up due to resource overload. For instance,
  different examples of bandwidth saturation exhibit different
  latency effects, presumably because the queue buildup is different. So it
  is non-trivial to approach full utilization without risking high latency.

Due to these challenges, and previous design attempts that were quite
complicated (and incomplete), we adopt a goal of simplicity of design, and strong
abstraction boundaries.
- The disk load is abstracted using an enum. The diskLoadWatcher can be
  evolved independently.
- The approach uses easy to understand additive increase and multiplicative
  decrease, (unlike what we do for flush and compaction tokens, where we
  try to more precisely calculate the sustainable rates).

Since we are using a simple approach that is somewhat coarse in its behavior,
we start by limiting its application to two kinds of writes:
- Incoming writes that are deemed "elastic": This can be done by
  introducing a work-class (in addition to admissionpb.WorkPriority), or by
  implying a work-class from the priority (e.g. priorities < NormalPri are
  deemed elastic). This prototype does the latter.
- Optional compactions: We assume that the LSM store is configured with a
  ceiling on number of regular concurrent compactions, and if it needs more
  it can request resources for additional (optional) compactions. These
  latter compactions can be limited by this approach. See
  cockroachdb/pebble/issues/1329 for motivation. This control on compactions
  is not currently implemented and is future work (though the prototype
  in #82813 had code for
  it).

The reader should start with disk_bandwidth.go, consisting of
- diskLoadWatcher: which computes load levels.
- diskBandwidthLimiter: It used the load level computed by diskLoadWatcher
  to limit write tokens for elastic writes and in the future will also
  limit compactions.

There is significant refactoring and changes in granter.go and
work_queue.go. This is driven by the fact that:
- Previously the tokens were for L0 and now we need to support tokens for
  bytes into L0 and tokens for bytes into the LSM (the former being a subset
  of the latter).
- Elastic work is in a different WorkQueue than regular work, but they
  are competing for the same tokens.

The latter is handled by allowing kvSlotGranter to multiplex across
multiple requesters, via multiple child granters. A number of interfaces
are adjusted to make this viable. In general, the GrantCoordinator
is now slightly dumber and some of that logic is moved into the granters.

For the former (handling two kinds of tokens), I considered adding multiple
resource dimensions to the granter-requester interaction but found it
too complicated. Instead we rely on the observation that we request
tokens based on the total incoming bytes of the request (not just L0),
and when the request is completed, tell the granter how many bytes
went into L0. The latter allows us to return tokens to L0. So at the
time the request is completed, we can account separately for the L0
tokens and these new tokens for all incoming bytes (which we are calling
disk bandwidth tokens, since they are constrained based on disk bandwidth).

This is a cleaned up version of the prototype in
#82813 which contains the
experimental results. The plumbing from the KV layer to populate the
disk reads, writes and provisioned bandwidth is absent in this PR,
and will be added in a subsequent PR.

Disk bandwidth bottlenecks are considered only if both the following
are true:
- DiskStats.ProvisionedBandwidth is non-zero.
- The cluster setting admission.disk_bandwidth_tokens.elastic.enabled
  is true (defaults to true).

Informs #82898

Release note: None (the cluster setting mentioned earlier is useless
since the integration with CockroachDB will be in a future PR).
  • Loading branch information
sumeerbhola committed Aug 11, 2022
1 parent 6c24f35 commit b02940e
Show file tree
Hide file tree
Showing 16 changed files with 1,866 additions and 596 deletions.
1 change: 1 addition & 0 deletions docs/generated/settings/settings.html
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
<table>
<thead><tr><th>Setting</th><th>Type</th><th>Default</th><th>Description</th></tr></thead>
<tbody>
<tr><td><code>admission.disk_bandwidth_tokens.elastic.enabled</code></td><td>boolean</td><td><code>true</code></td><td>when true, and provisioned bandwidth for the disk corresponding to a store is configured, tokens for elastic work will be limited if disk bandwidth becomes a bottleneck</td></tr>
<tr><td><code>admission.epoch_lifo.enabled</code></td><td>boolean</td><td><code>false</code></td><td>when true, epoch-LIFO behavior is enabled when there is significant delay in admission</td></tr>
<tr><td><code>admission.epoch_lifo.epoch_closing_delta_duration</code></td><td>duration</td><td><code>5ms</code></td><td>the delta duration before closing an epoch, for epoch-LIFO admission control ordering</td></tr>
<tr><td><code>admission.epoch_lifo.epoch_duration</code></td><td>duration</td><td><code>100ms</code></td><td>the duration of an epoch, for epoch-LIFO admission control ordering</td></tr>
Expand Down
2 changes: 2 additions & 0 deletions pkg/util/admission/BUILD.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ load("@io_bazel_rules_go//go:def.bzl", "go_library", "go_test")
go_library(
name = "admission",
srcs = [
"disk_bandwidth.go",
"doc.go",
"granter.go",
"store_token_estimation.go",
Expand Down Expand Up @@ -32,6 +33,7 @@ go_library(
go_test(
name = "admission_test",
srcs = [
"disk_bandwidth_test.go",
"granter_test.go",
"store_token_estimation_test.go",
"work_queue_test.go",
Expand Down
335 changes: 335 additions & 0 deletions pkg/util/admission/disk_bandwidth.go

Large diffs are not rendered by default.

11 changes: 11 additions & 0 deletions pkg/util/admission/disk_bandwidth_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
// Copyright 2022 The Cockroach Authors.
//
// Use of this software is governed by the Business Source License
// included in the file licenses/BSL.txt.
//
// As of the Change Date specified in that file, in accordance with
// the Business Source License, use of this software will be governed
// by the Apache License, Version 2.0, included in the file
// licenses/APL.txt.

package admission
2 changes: 2 additions & 0 deletions pkg/util/admission/doc.go
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,8 @@
// either in a comment here or a separate RFC.
//

// TODO(sumeer): update with all the recent changes.

// Internal organization:
//
// The package is mostly structured as a set of interfaces that are meant to
Expand Down
Loading

0 comments on commit b02940e

Please sign in to comment.