[WIP] admission: add support for disk bandwidth as a bottleneck resource #82813

sumeerbhola · 2022-06-13T12:25:16Z

The first commit is from 82440

We assume that:

There is a provisioned known limit on the sum of read and write
bandwidth. This limit is allowed to change.
Admission control can only shape the rate of admission of writes. Writes
also cause reads, since compactions do reads and writes.

There are multiple challenges:

We are unable to precisely track the causes of disk read bandwidth, since
we do not have observability into what reads missed the OS page cache.
That is, we don't know how much of the reads were due to incoming reads
(that we don't shape) and how much due to compaction read bandwidth.
We don't shape incoming reads.
There can be a large time lag between the shaping of incoming writes, and when
it affects actual writes in the system, since compaction backlog can
build up in various levels of the LSM store.
Signals of overload are coarse, since we cannot view all the internal
queues that can build up due to resource overload. For instance,
different examples of bandwidth saturation exhibit wildly different
latency effects, presumably because the queue buildup is different. So it
is non-trivial to approach full utilization without risking high latency.

Due to these challenges, and previous design attempts that were quite
complicated (and incomplete), we adopt a goal of simplicity of design, and strong
abstraction boundaries.

The disk load is abstracted using an enum. The diskLoadWatcher can be
evolved independently.
The approach uses easy to understand additive increase and multiplicative
decrease, (unlike what we do for flush and compaction tokens, where we
try to more precisely calculate the sustainable rates).

Since we are using a simple approach that is somewhat coarse in its behavior,
we start by limiting its application to two kinds of writes:

Incoming writes that are deemed "elastic": This can be done by
introducing a work-class (in addition to admissionpb.WorkPriority), or by
implying a work-class from the priority (e.g. priorities < NormalPri are
deemed elastic). This prototype does the latter.
Optional compactions: We assume that the LSM store is configured with a
ceiling on number of regular concurrent compactions, and if it needs more
it can request resources for additional (optional) compactions. These
latter compactions can be limited by this approach. See
db: automatically tune compaction concurrency based on available CPU/disk headroom and read-amp pebble#1329 for motivation.

The reader should start with disk_bandwidth.go, consisting of

diskLoadWatcher: which computes load levels.
compactionLimiter: which tracks all compaction slots and limits
optional compactions.
diskBandwidthLimiter: It composes the previous two objects and
uses load information to limit write tokens for elastic writes
and limit compactions.

There is significant refactoring and changes in granter.go and
work_queue.go. This is driven by the fact that:

Previously the tokens were for L0 and now we need to support tokens for
bytes into L0 and tokens for bytes into the LSM (the former being a subset
of the latter).
Elastic work is in a different WorkQueue than regular work, but they
are competing for the same tokens.

The latter is handled by allowing kvSlotGranter to multiplex across
multiple requesters, via multiple child granters. A number of interfaces
are adjusted to make this viable. In general, the GrantCoordinator
is now slightly dumber and some of that logic is moved into the granters.

For the former (two kinds of tokens), I considered adding multiple
resource dimensions to the granter-requester interaction but found it
too complicated. Instead we rely on the observation that we can request
tokens based on the total incoming bytes of the request (not just L0),
and when the request is completed, can tell the granter how many bytes
went into L0. The latter allows us to return tokens to L0. There was
also the (unrelated) realization that we can use the information
of the size of the batch in the call to AdmittedWorkDone and fix
estimation that we had to make pre-evaluation. This resulted in a
bunch of changes to how we do estimation to adjust the tokens consumed:
we now estimate how much we need to compensate what is being asked
for at (a) admission time, (b) work done time, for the bytes added
to the LSM, (c) work done time, for the bytes added to L0. Since we
are askinf for tokens at admission time based on the full incoming
bytes, the estimation for what fraction of an ingest goes into L0 is
eliminated. This had the consequence of simplifying some of the
estimation logic that was distinguishing writes from ingests.

There are no tests (and breaks existing tests) so this code is probably littered with bugs.

Next steps:

Unit tests
Pebble changes for IntervalCompactionInfo
CockroachDB changes for IntervalDiskLoadInfo
Experimental evaluation and tuning
Separate into multiple PRs for review
KV and storage package plumbing for properly populating
StoreWriteWorkInfo.{WriteBytes,IngestRequest} for ingestions and
StoreWorkDoneInfo.{ActualBytes,ActualBytesIntoL0} for writes and
ingestions.

Some experimental results with artificially set provisioned bandwidth limit of 95MiB and a kv0 workload with 4KB writes that are all considered elastic traffic. There were 4 runs: the first one has no provisioned bw limit and the subsequent ones are iterations over heuristics. The last one is the latest code: it is tuned to not increase load if we have reached 70% of provisioned bandwidth.

The challenge in doing better is the sharp transitions from < 0.7 fraction bandwidth utilization to > 0.95, due to the lag in compactions. For example:

I220712 18:17:34.770083 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 543  diskLoadWatcher: rb: 0 B, wb: 3.0 MiB, pb: 95 MiB, util: 0.03
I220712 18:17:49.770806 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 555  diskLoadWatcher: rb: 0 B, wb: 54 MiB, pb: 95 MiB, util: 0.57
I220712 18:18:04.770748 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 566  diskLoadWatcher: rb: 0 B, wb: 53 MiB, pb: 95 MiB, util: 0.56
I220712 18:18:19.770290 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 578  diskLoadWatcher: rb: 0 B, wb: 67 MiB, pb: 95 MiB, util: 0.70
I220712 18:18:34.770280 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 589  diskLoadWatcher: rb: 0 B, wb: 104 MiB, pb: 95 MiB, util: 1.10
I220712 18:18:49.769979 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 600  diskLoadWatcher: rb: 0 B, wb: 53 MiB, pb: 95 MiB, util: 0.56
I220712 18:19:04.770342 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 612  diskLoadWatcher: rb: 0 B, wb: 17 MiB, pb: 95 MiB, util: 0.18
I220712 18:19:19.771061 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 623  diskLoadWatcher: rb: 0 B, wb: 66 MiB, pb: 95 MiB, util: 0.69
I220712 18:19:34.770318 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 636  diskLoadWatcher: rb: 0 B, wb: 96 MiB, pb: 95 MiB, util: 1.01

Release note: None

In addition to byte tokens for writes computed based on compaction rate out of L0, we now compute byte tokens based on how fast the system can flush memtables into L0. The motivation is that writing to the memtable, or creating memtables faster than the system can flush results in write stalls due to memtables, that create a latency hiccup for all write traffic. We have observed write stalls that lasted > 100ms. The approach taken here for flush tokens is straightforward (there is justification based on experiments, mentioned in code comments): - Measure and smooth the peak rate that the flush loop can operate on. This relies on the recently added pebble.InternalIntervalMetrics. - The peak rate causes 100% utilization of the single flush thread, and that is potentially too high to prevent write stalls (depending on how long it takes to do a single flush). So we multiply the smoothed peak rate by a utilization-target-fraction which is dynamically adjusted and by default is constrained to the interval [0.5, 1.5]. There is additive increase and decrease of this fraction: - High usage of tokens and no write stalls cause an additive increase. - Write stalls cause an additive decrease. A small multiplier is used if there are multiple write stalls, so that the probing falls more in the region where there are no write stalls. Note that this probing scheme cannot eliminate all write stalls. For now we are ok with a reduction in write stalls. For convenience, and some additional justification mentioned in a code comment, the scheme uses the minimum of the flush and compaction tokens for writes to L0. This means that sstable ingestion into L0 is also subject to such tokens. The periodic token computation continues to be done at 15s intervals. However, instead of giving out these tokens at 1s intervals, we now give them out at 250ms intervals. This is to reduce the burstiness, since that can cause write stalls. There is a new metric, storage.write-stall-nanos, that measures the cumulative duration of write stalls, since it gives a more intuitive feel for how the system is behaving, compared to a write stall count. The scheme can be disabled by increasing the cluster setting admission.min_flush_util_percent, which defaults to 50% (corresponding to the 0.5 lower bound mentioned earluer), to a high value, say 1000%. The scheme was evaluated using a single node cluster with the node having a high CPU count, such that CPU was not a bottleneck, even with max compaction concurrency set to 8. A kv0 workload with high concurrency and 4KB writes was used to overload the store. Due to the high compaction concurrency, L0 stayed below the unhealthy thresholds, and the resource bottleneck became the total bandwidth provisioned for the disk. This setup was evaluated under both: - early-life: when the store had 10-20GB of data, when the compaction backlog was not very heavy, so there was less queueing for the limited disk bandwidth (it was still usually saturated). - later-life: when the store had around 150GB of data. In both cases, turning off flush tokens increased the duration of write stalls by > 5x. For the early-life case, ~750ms per second was spent in a write stall with flush-tokens off. The later-life case had ~200ms per second of write stalls with flush-tokens off. The lower value of the latter is paradoxically due to the worse bandwidth saturation: fsync latency rose from 2-4ms with flush-tokens on, to 11-20ms with flush-tokens off. This increase imposed a natural backpressure on writes due to the need to sync the WAL. In contrast the fsync latency was low in the early-life case, though it did increase from 0.125ms to 0.25ms when flush-tokens were turned off. In both cases, the admission throughput did not increase when turning off flush-tokens. That is, the system cannot sustain more throughput, but by turning on flush tokens, we shift queueing from the disk layer the admission control layer (where we have the capability to reorder work). Fixes cockroachdb#77357 Release note (ops change): The cluster setting admission.min_flush_util_percent can be used to disable or tune flush throughput based admission tokens, for writes to a store. Tokens based on flush throughput attempt to reduce storage layer write stalls.

The first commit is from 82440 We assume that: - There is a provisioned known limit on the sum of read and write bandwidth. This limit is allowed to change. - Admission control can only shape the rate of admission of writes. Writes also cause reads, since compactions do reads and writes. There are multiple challenges: - We are unable to precisely track the causes of disk read bandwidth, since we do not have observability into what reads missed the OS page cache. That is, we don't know how much of the reads were due to incoming reads (that we don't shape) and how much due to compaction read bandwidth. - We don't shape incoming reads. - There can be a large time lag between the shaping of incoming writes, and when it affects actual writes in the system, since compaction backlog can build up in various levels of the LSM store. - Signals of overload are coarse, since we cannot view all the internal queues that can build up due to resource overload. For instance, different examples of bandwidth saturation exhibit wildly different latency effects, presumably because the queue buildup is different. So it is non-trivial to approach full utilization without risking high latency. Due to these challenges, and previous design attempts that were quite complicated (and incomplete), we adopt a goal of simplicity of design, and strong abstraction boundaries. - The disk load is abstracted using an enum. The diskLoadWatcher can be evolved independently. - The approach uses easy to understand additive increase and multiplicative decrease, (unlike what we do for flush and compaction tokens, where we try to more precisely calculate the sustainable rates). Since we are using a simple approach that is somewhat coarse in its behavior, we start by limiting its application to two kinds of writes: - Incoming writes that are deemed "elastic": This can be done by introducing a work-class (in addition to admissionpb.WorkPriority), or by implying a work-class from the priority (e.g. priorities < NormalPri are deemed elastic). This prototype does the latter. - Optional compactions: We assume that the LSM store is configured with a ceiling on number of regular concurrent compactions, and if it needs more it can request resources for additional (optional) compactions. These latter compactions can be limited by this approach. See cockroachdb/pebble/issues/1329 for motivation. The reader should start with disk_bandwidth.go, consisting of - diskLoadWatcher: which computes load levels. - compactionLimiter: which tracks all compaction slots and limits optional compactions. - diskBandwidthLimiter: It composes the previous two objects and uses load information to limit write tokens for elastic writes and limit compactions. There is significant refactoring and changes in granter.go and work_queue.go. This is driven by the fact that: - Previously the tokens were for L0 and now we need to support tokens for bytes into L0 and tokens for bytes into the LSM (the former being a subset of the latter). - Elastic work is in a different WorkQueue than regular work, but they are competing for the same tokens. The latter is handled by allowing kvSlotGranter to multiplex across multiple requesters, via multiple child granters. A number of interfaces are adjusted to make this viable. In general, the GrantCoordinator is now slightly dumber and some of that logic is moved into the granters. For the former (two kinds of tokens), I considered adding multiple resource dimensions to the granter-requester interaction but found it too complicated. Instead we rely on the observation that we can request tokens based on the total incoming bytes of the request (not just L0), and when the request is completed, can tell the granter how many bytes went into L0. The latter allows us to return tokens to L0. There was also the (unrelated) realization that we can use the information of the size of the batch in the call to AdmittedWorkDone and fix estimation that we had to make pre-evaluation. This resulted in a bunch of changes to how we do estimation to adjust the tokens consumed: we now estimate how much we need to compensate what is being asked for at (a) admission time, (b) work done time, for the bytes added to the LSM, (c) work done time, for the bytes added to L0. Since we are askinf for tokens at admission time based on the full incoming bytes, the estimation for what fraction of an ingest goes into L0 is eliminated. This had the consequence of simplifying some of the estimation logic that was distinguishing writes from ingests. There are no tests, so this code is probably littered with bugs. Next steps: - Unit tests - Pebble changes for IntervalCompactionInfo - CockroachDB changes for IntervalDiskLoadInfo - Experimental evaluation and tuning - Separate into multiple PRs for review - KV and storage package plumbing for properly populating StoreWriteWorkInfo.{WriteBytes,IngestRequest} for ingestions and StoreWorkDoneInfo.{ActualBytes,ActualBytesIntoL0} for writes and ingestions. Release note: None

cockroach-teamcity · 2022-06-13T12:25:25Z

This change is

sumeerbhola · 2022-07-13T17:11:55Z

Interestingly, we fare better with the provisioned disk bandwidth set to the actual provisioned value of 250MiB/s. See the graph below where the red line represents when we switched from an outrageously high configuration of hack.provisioned_bandwidth to a value of 250MiB/s. Compactions (which were set to a max of 8) had been falling behind earlier (because of the actual disk bandwidth limit). We see some high fluctuations, and then because there is spare disk bandwidth for compactions, they eventually catch up. We then setting into a stable regime of 80+% of disk bandwidth used. My theory on why this one is more stable is that because this limit is representative of the actual limit, we do not have compactions bursting significantly over the limit to complete their work -- i.e. the rate shaping done by EBS is keeping compactions in-check, which means our utilization doesn't blow over into overload territory.

At a finer-granularity this behavior can be observed in the following log statements

I220713 16:26:09.084355 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 1742  diskLoadWatcher: rb: 0 B, wb: 250 MiB, pb: 250 MiB, util: 1.00
I220713 16:26:24.110345 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 1754  diskLoadWatcher: rb: 546 B, wb: 250 MiB, pb: 250 MiB, util: 1.00
I220713 16:26:39.093034 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 1766  diskLoadWatcher: rb: 546 B, wb: 250 MiB, pb: 250 MiB, util: 1.00
I220713 16:26:54.091511 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 1778  diskLoadWatcher: rb: 819 B, wb: 250 MiB, pb: 250 MiB, util: 1.00
I220713 16:27:09.083969 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 1789  diskLoadWatcher: rb: 546 B, wb: 250 MiB, pb: 250 MiB, util: 1.00
I220713 16:27:24.084199 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 1800  diskLoadWatcher: rb: 273 B, wb: 97 MiB, pb: 250 MiB, util: 0.39
I220713 16:27:39.084573 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 1812  diskLoadWatcher: rb: 0 B, wb: 1.4 MiB, pb: 250 MiB, util: 0.01
I220713 16:27:54.084110 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 1823  diskLoadWatcher: rb: 0 B, wb: 1.4 MiB, pb: 250 MiB, util: 0.01
I220713 16:28:09.084457 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 1835  diskLoadWatcher: rb: 546 B, wb: 266 MiB, pb: 250 MiB, util: 1.07
I220713 16:28:24.084627 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 1846  diskLoadWatcher: rb: 819 B, wb: 250 MiB, pb: 250 MiB, util: 1.00
I220713 16:28:39.084221 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 1858  diskLoadWatcher: rb: 273 B, wb: 103 MiB, pb: 250 MiB, util: 0.41
I220713 16:28:54.085021 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 1869  diskLoadWatcher: rb: 0 B, wb: 47 MiB, pb: 250 MiB, util: 0.19
I220713 16:29:09.084317 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 1880  diskLoadWatcher: rb: 273 B, wb: 81 MiB, pb: 250 MiB, util: 0.32
I220713 16:29:24.084402 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 1891  diskLoadWatcher: rb: 273 B, wb: 128 MiB, pb: 250 MiB, util: 0.51
I220713 16:29:39.084058 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 1905  diskLoadWatcher: rb: 273 B, wb: 157 MiB, pb: 250 MiB, util: 0.63
I220713 16:29:54.084239 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 1919  diskLoadWatcher: rb: 546 B, wb: 206 MiB, pb: 250 MiB, util: 0.83
I220713 16:30:09.084490 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 1931  diskLoadWatcher: rb: 546 B, wb: 235 MiB, pb: 250 MiB, util: 0.94
I220713 16:30:24.084371 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 1943  diskLoadWatcher: rb: 0 B, wb: 240 MiB, pb: 250 MiB, util: 0.96
I220713 16:30:39.084111 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 1954  diskLoadWatcher: rb: 0 B, wb: 130 MiB, pb: 250 MiB, util: 0.52
I220713 16:30:54.084203 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 1966  diskLoadWatcher: rb: 0 B, wb: 66 MiB, pb: 250 MiB, util: 0.26
I220713 16:31:09.083723 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 1977  diskLoadWatcher: rb: 0 B, wb: 106 MiB, pb: 250 MiB, util: 0.42
I220713 16:31:24.084763 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 1988  diskLoadWatcher: rb: 0 B, wb: 153 MiB, pb: 250 MiB, util: 0.61
I220713 16:31:39.083880 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2000  diskLoadWatcher: rb: 0 B, wb: 205 MiB, pb: 250 MiB, util: 0.82
I220713 16:31:54.083780 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2015  diskLoadWatcher: rb: 0 B, wb: 198 MiB, pb: 250 MiB, util: 0.79
I220713 16:32:09.084175 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2026  diskLoadWatcher: rb: 0 B, wb: 207 MiB, pb: 250 MiB, util: 0.83
I220713 16:32:24.084038 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2038  diskLoadWatcher: rb: 0 B, wb: 211 MiB, pb: 250 MiB, util: 0.84
I220713 16:32:39.084352 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2049  diskLoadWatcher: rb: 0 B, wb: 209 MiB, pb: 250 MiB, util: 0.84
I220713 16:32:54.083970 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2061  diskLoadWatcher: rb: 0 B, wb: 251 MiB, pb: 250 MiB, util: 1.01
I220713 16:33:09.083874 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2073  diskLoadWatcher: rb: 34 KiB, wb: 149 MiB, pb: 250 MiB, util: 0.59
I220713 16:33:24.112694 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2085  diskLoadWatcher: rb: 0 B, wb: 73 MiB, pb: 250 MiB, util: 0.29
I220713 16:33:39.084302 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2096  diskLoadWatcher: rb: 0 B, wb: 210 MiB, pb: 250 MiB, util: 0.84
I220713 16:33:54.083833 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2107  diskLoadWatcher: rb: 0 B, wb: 250 MiB, pb: 250 MiB, util: 1.00
I220713 16:34:09.084128 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2118  diskLoadWatcher: rb: 0 B, wb: 205 MiB, pb: 250 MiB, util: 0.82
I220713 16:34:24.084762 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2130  diskLoadWatcher: rb: 0 B, wb: 129 MiB, pb: 250 MiB, util: 0.51
I220713 16:34:39.084357 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2143  diskLoadWatcher: rb: 0 B, wb: 163 MiB, pb: 250 MiB, util: 0.65
I220713 16:34:54.084647 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2156  diskLoadWatcher: rb: 0 B, wb: 200 MiB, pb: 250 MiB, util: 0.80
I220713 16:35:09.084342 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2169  diskLoadWatcher: rb: 0 B, wb: 211 MiB, pb: 250 MiB, util: 0.84
I220713 16:35:24.084554 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2181  diskLoadWatcher: rb: 0 B, wb: 183 MiB, pb: 250 MiB, util: 0.73
I220713 16:35:39.084211 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2192  diskLoadWatcher: rb: 0 B, wb: 200 MiB, pb: 250 MiB, util: 0.80
I220713 16:35:54.083938 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2203  diskLoadWatcher: rb: 0 B, wb: 191 MiB, pb: 250 MiB, util: 0.76
I220713 16:36:09.083808 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2215  diskLoadWatcher: rb: 0 B, wb: 206 MiB, pb: 250 MiB, util: 0.82
I220713 16:36:24.084340 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2226  diskLoadWatcher: rb: 0 B, wb: 184 MiB, pb: 250 MiB, util: 0.74
I220713 16:36:39.084443 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2237  diskLoadWatcher: rb: 546 B, wb: 186 MiB, pb: 250 MiB, util: 0.74
I220713 16:36:54.084441 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2248  diskLoadWatcher: rb: 546 B, wb: 197 MiB, pb: 250 MiB, util: 0.79
I220713 16:37:09.084365 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2260  diskLoadWatcher: rb: 546 B, wb: 211 MiB, pb: 250 MiB, util: 0.84
I220713 16:37:24.084351 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2273  diskLoadWatcher: rb: 273 B, wb: 197 MiB, pb: 250 MiB, util: 0.79
I220713 16:37:39.084338 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2284  diskLoadWatcher: rb: 546 B, wb: 197 MiB, pb: 250 MiB, util: 0.79
I220713 16:37:54.084637 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2295  diskLoadWatcher: rb: 546 B, wb: 204 MiB, pb: 250 MiB, util: 0.82
I220713 16:38:09.084089 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2306  diskLoadWatcher: rb: 546 B, wb: 206 MiB, pb: 250 MiB, util: 0.82
I220713 16:38:24.083903 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2318  diskLoadWatcher: rb: 546 B, wb: 198 MiB, pb: 250 MiB, util: 0.79
I220713 16:38:39.084134 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2329  diskLoadWatcher: rb: 546 B, wb: 208 MiB, pb: 250 MiB, util: 0.83
I220713 16:38:54.084425 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2340  diskLoadWatcher: rb: 546 B, wb: 207 MiB, pb: 250 MiB, util: 0.83
I220713 16:39:09.084089 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2351  diskLoadWatcher: rb: 273 B, wb: 200 MiB, pb: 250 MiB, util: 0.80
I220713 16:39:24.084313 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2363  diskLoadWatcher: rb: 546 B, wb: 209 MiB, pb: 250 MiB, util: 0.83
I220713 16:39:39.083771 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2376  diskLoadWatcher: rb: 546 B, wb: 204 MiB, pb: 250 MiB, util: 0.82
I220713 16:39:54.084614 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2390  diskLoadWatcher: rb: 546 B, wb: 201 MiB, pb: 250 MiB, util: 0.81
I220713 16:40:09.084715 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2403  diskLoadWatcher: rb: 546 B, wb: 205 MiB, pb: 250 MiB, util: 0.82
I220713 16:40:24.084363 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2415  diskLoadWatcher: rb: 546 B, wb: 200 MiB, pb: 250 MiB, util: 0.80
I220713 16:40:39.084409 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2426  diskLoadWatcher: rb: 273 B, wb: 206 MiB, pb: 250 MiB, util: 0.83
I220713 16:40:54.084186 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2437  diskLoadWatcher: rb: 546 B, wb: 197 MiB, pb: 250 MiB, util: 0.79
I220713 16:41:09.084607 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2448  diskLoadWatcher: rb: 546 B, wb: 195 MiB, pb: 250 MiB, util: 0.78
I220713 16:41:24.083822 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2459  diskLoadWatcher: rb: 546 B, wb: 188 MiB, pb: 250 MiB, util: 0.75
I220713 16:41:39.083878 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2471  diskLoadWatcher: rb: 546 B, wb: 224 MiB, pb: 250 MiB, util: 0.89
I220713 16:41:54.084186 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2482  diskLoadWatcher: rb: 0 B, wb: 200 MiB, pb: 250 MiB, util: 0.80
I220713 16:42:09.084315 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2493  diskLoadWatcher: rb: 0 B, wb: 188 MiB, pb: 250 MiB, util: 0.75
I220713 16:42:24.083722 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2505  diskLoadWatcher: rb: 0 B, wb: 233 MiB, pb: 250 MiB, util: 0.93
I220713 16:42:39.084307 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2518  diskLoadWatcher: rb: 0 B, wb: 190 MiB, pb: 250 MiB, util: 0.76
I220713 16:42:54.084574 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2529  diskLoadWatcher: rb: 0 B, wb: 210 MiB, pb: 250 MiB, util: 0.84
I220713 16:43:09.084527 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2540  diskLoadWatcher: rb: 0 B, wb: 205 MiB, pb: 250 MiB, util: 0.82
I220713 16:43:24.084559 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2551  diskLoadWatcher: rb: 0 B, wb: 192 MiB, pb: 250 MiB, util: 0.77
I220713 16:43:39.085508 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2562  diskLoadWatcher: rb: 0 B, wb: 218 MiB, pb: 250 MiB, util: 0.87
I220713 16:43:54.083781 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2574  diskLoadWatcher: rb: 0 B, wb: 202 MiB, pb: 250 MiB, util: 0.81
I220713 16:44:09.084171 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2585  diskLoadWatcher: rb: 0 B, wb: 228 MiB, pb: 250 MiB, util: 0.91
I220713 16:44:24.084059 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2596  diskLoadWatcher: rb: 0 B, wb: 204 MiB, pb: 250 MiB, util: 0.81
I220713 16:44:39.084728 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2609  diskLoadWatcher: rb: 0 B, wb: 204 MiB, pb: 250 MiB, util: 0.82
I220713 16:44:54.084473 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2623  diskLoadWatcher: rb: 0 B, wb: 209 MiB, pb: 250 MiB, util: 0.84
I220713 16:45:09.084400 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2634  diskLoadWatcher: rb: 0 B, wb: 208 MiB, pb: 250 MiB, util: 0.83
I220713 16:45:24.084551 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2645  diskLoadWatcher: rb: 0 B, wb: 228 MiB, pb: 250 MiB, util: 0.91
I220713 16:45:39.084682 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2656  diskLoadWatcher: rb: 0 B, wb: 208 MiB, pb: 250 MiB, util: 0.83
I220713 16:45:54.084178 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2668  diskLoadWatcher: rb: 0 B, wb: 201 MiB, pb: 250 MiB, util: 0.80
I220713 16:46:09.084265 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2679  diskLoadWatcher: rb: 0 B, wb: 234 MiB, pb: 250 MiB, util: 0.94
I220713 16:46:24.083794 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2690  diskLoadWatcher: rb: 0 B, wb: 219 MiB, pb: 250 MiB, util: 0.88
I220713 16:46:39.084311 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2701  diskLoadWatcher: rb: 0 B, wb: 190 MiB, pb: 250 MiB, util: 0.76
I220713 16:46:54.084156 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2712  diskLoadWatcher: rb: 0 B, wb: 220 MiB, pb: 250 MiB, util: 0.88
I220713 16:47:09.084472 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2724  diskLoadWatcher: rb: 0 B, wb: 206 MiB, pb: 250 MiB, util: 0.82
I220713 16:47:24.085097 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2735  diskLoadWatcher: rb: 0 B, wb: 228 MiB, pb: 250 MiB, util: 0.91
I220713 16:47:39.083773 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2746  diskLoadWatcher: rb: 0 B, wb: 207 MiB, pb: 250 MiB, util: 0.83
I220713 16:47:54.084655 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2757  diskLoadWatcher: rb: 0 B, wb: 208 MiB, pb: 250 MiB, util: 0.83
I220713 16:48:09.084053 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2768  diskLoadWatcher: rb: 0 B, wb: 228 MiB, pb: 250 MiB, util: 0.91
I220713 16:48:24.084351 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2780  diskLoadWatcher: rb: 0 B, wb: 227 MiB, pb: 250 MiB, util: 0.91
I220713 16:48:39.084335 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2791  diskLoadWatcher: rb: 0 B, wb: 201 MiB, pb: 250 MiB, util: 0.80
I220713 16:48:54.084414 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2802  diskLoadWatcher: rb: 0 B, wb: 208 MiB, pb: 250 MiB, util: 0.83
I220713 16:49:09.084379 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2813  diskLoadWatcher: rb: 0 B, wb: 206 MiB, pb: 250 MiB, util: 0.82
I220713 16:49:24.083817 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2825  diskLoadWatcher: rb: 0 B, wb: 221 MiB, pb: 250 MiB, util: 0.88
I220713 16:49:39.084453 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 2838  diskLoadWatcher: rb: 0 B, wb: 223 MiB, pb: 250 MiB, util: 0.89

sumeerbhola · 2022-07-13T18:52:52Z

Now running a mix of regular and elastic traffic.
regular: consumes 40-50% of the disk bandwidth (note the low concurrency=2, since regular traffic does not cause any disk bandwidth controls to be actived -- so we've explicitly set it up to leave significant unused bw)

roachprod run sumeer-io:2 -- ./workload run kv --init --histograms=perf/stats.json --concurrency=2 --splits=1000 --duration=30m0s --read-percent=0 --min-block-bytes=4096 --max-block-bytes=4096  {pgurl:1-1}

Then added elastic traffic with a high concurrency=1024 (this is more than enough to blow past the provisioned limit if there was no disk bw control). The throughput of regular traffic stays stable.

roachprod run sumeer-io:2 -- ./workload run kv --init --histograms=perf/stats.json --concurrency=1024 --splits=1000 --duration=30m0s --read-percent=0 --min-block-bytes=4096 --max-block-bytes=4096 --background-qos=true {pgurl:1-1}

Logs before adding elastic traffic

I220713 18:27:24.083678 382 util/admission/granter.go:2091 ⋮ [-] 7598  Incoming LSM 105 MiB, tokens (regular, elastic): 92 MiB, 0 B, per-req: (6.9 KiB,6.9 KiB), compaction-w: 1.6 GiB
I220713 18:27:24.083688 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 7599  diskLoadWatcher: rb: 0 B, wb: 130 MiB, pb: 250 MiB, util: 0.52
I220713 18:27:24.083707 382 util/admission/disk_bandwidth.go:344 ⋮ [-] 7600  diskBandwidthLimiter: moderate elasticTokens (limit, used): 11706698, 0
I220713 18:27:39.084211 382 util/admission/granter.go:2091 ⋮ [-] 7609  Incoming LSM 106 MiB, tokens (regular, elastic): 99 MiB, 0 B, per-req: (7.4 KiB,7.4 KiB), compaction-w: 1.1 GiB
I220713 18:27:39.084220 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 7610  diskLoadWatcher: rb: 0 B, wb: 94 MiB, pb: 250 MiB, util: 0.38
I220713 18:27:39.084227 382 util/admission/disk_bandwidth.go:344 ⋮ [-] 7611  diskBandwidthLimiter: moderate elasticTokens (limit, used): 11706698, 0
I220713 18:27:54.083990 382 util/admission/granter.go:2091 ⋮ [-] 7620  Incoming LSM 53 MiB, tokens (regular, elastic): 105 MiB, 0 B, per-req: (7.7 KiB,7.7 KiB), compaction-w: 1.4 GiB
I220713 18:27:54.084013 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 7621  diskLoadWatcher: rb: 0 B, wb: 113 MiB, pb: 250 MiB, util: 0.45
I220713 18:27:54.084052 382 util/admission/disk_bandwidth.go:344 ⋮ [-] 7622  diskBandwidthLimiter: moderate elasticTokens (limit, used): 11706698, 0
I220713 18:28:09.083881 382 util/admission/granter.go:2091 ⋮ [-] 7632  Incoming LSM 105 MiB, tokens (regular, elastic): 78 MiB, 0 B, per-req: (5.8 KiB,5.8 KiB), compaction-w: 1.5 GiB
I220713 18:28:09.083891 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 7633  diskLoadWatcher: rb: 0 B, wb: 121 MiB, pb: 250 MiB, util: 0.48
I220713 18:28:09.083910 382 util/admission/disk_bandwidth.go:344 ⋮ [-] 7634  diskBandwidthLimiter: moderate elasticTokens (limit, used): 11706698, 0
I220713 18:28:24.084186 382 util/admission/granter.go:2091 ⋮ [-] 7643  Incoming LSM 105 MiB, tokens (regular, elastic): 90 MiB, 0 B, per-req: (6.8 KiB,6.8 KiB), compaction-w: 1.7 GiB
I220713 18:28:24.084195 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 7644  diskLoadWatcher: rb: 0 B, wb: 137 MiB, pb: 250 MiB, util: 0.55
I220713 18:28:24.084212 382 util/admission/disk_bandwidth.go:344 ⋮ [-] 7645  diskBandwidthLimiter: moderate elasticTokens (limit, used): 11706698, 0
I220713 18:28:39.084139 382 util/admission/granter.go:2091 ⋮ [-] 7654  Incoming LSM 106 MiB, tokens (regular, elastic): 98 MiB, 0 B, per-req: (7.4 KiB,7.4 KiB), compaction-w: 1.2 GiB
I220713 18:28:39.084148 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 7655  diskLoadWatcher: rb: 0 B, wb: 111 MiB, pb: 250 MiB, util: 0.44
I220713 18:28:39.084155 382 util/admission/disk_bandwidth.go:344 ⋮ [-] 7656  diskBandwidthLimiter: moderate elasticTokens (limit, used): 11706698, 0
I220713 18:28:54.084309 382 util/admission/granter.go:2091 ⋮ [-] 7666  Incoming LSM 52 MiB, tokens (regular, elastic): 102 MiB, 0 B, per-req: (7.7 KiB,7.7 KiB), compaction-w: 1.2 GiB
I220713 18:28:54.084331 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 7667  diskLoadWatcher: rb: 0 B, wb: 89 MiB, pb: 250 MiB, util: 0.36
I220713 18:28:54.084369 382 util/admission/disk_bandwidth.go:344 ⋮ [-] 7668  diskBandwidthLimiter: moderate elasticTokens (limit, used): 11706698, 0

Then after adding elastic traffic, we first start increasing the elastic tokens:

I220713 18:29:09.083725 382 util/admission/granter.go:2091 ⋮ [-] 7679  Incoming LSM 105 MiB, tokens (regular, elastic): 77 MiB, 5.4 MiB, per-req: (5.8 KiB,5.8 KiB), compaction-w: 1.7 GiB
I220713 18:29:09.083736 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 7680  diskLoadWatcher: rb: 0 B, wb: 139 MiB, pb: 250 MiB, util: 0.56
I220713 18:29:09.083756 382 util/admission/disk_bandwidth.go:344 ⋮ [-] 7681  diskBandwidthLimiter: moderate elasticTokens (limit, used): 11706698, 5663679
I220713 18:29:24.084020 382 util/admission/granter.go:2091 ⋮ [-] 7690  Incoming LSM 148 MiB, tokens (regular, elastic): 86 MiB, 11 MiB, per-req: (6.6 KiB,6.6 KiB), compaction-w: 1.7 GiB
I220713 18:29:24.084030 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 7691  diskLoadWatcher: rb: 0 B, wb: 139 MiB, pb: 250 MiB, util: 0.56
I220713 18:29:24.084039 382 util/admission/disk_bandwidth.go:338 ⋮ [-] 7692  diskBandwidthLimiter: moderate fr: 0.07, smoothed-incoming: 120 MiB, unusedBW: 111 MiB, elasticBytes/Tokens: 26 MiB
I220713 18:29:39.083879 382 util/admission/granter.go:2091 ⋮ [-] 7701  Incoming LSM 99 MiB, tokens (regular, elastic): 110 MiB, 26 MiB, per-req: (8.4 KiB,8.4 KiB), compaction-w: 2.1 GiB
I220713 18:29:39.083889 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 7702  diskLoadWatcher: rb: 0 B, wb: 165 MiB, pb: 250 MiB, util: 0.66
I220713 18:29:39.083897 382 util/admission/disk_bandwidth.go:338 ⋮ [-] 7703  diskBandwidthLimiter: moderate fr: 0.13, smoothed-incoming: 109 MiB, unusedBW: 85 MiB, elasticBytes/Tokens: 28 MiB
I220713 18:29:54.083861 382 util/admission/granter.go:2091 ⋮ [-] 7717  Incoming LSM 100 MiB, tokens (regular, elastic): 98 MiB, 28 MiB, per-req: (7.2 KiB,7.2 KiB), compaction-w: 2.0 GiB
I220713 18:29:54.083871 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 7718  diskLoadWatcher: rb: 0 B, wb: 163 MiB, pb: 250 MiB, util: 0.65
I220713 18:29:54.083891 382 util/admission/disk_bandwidth.go:338 ⋮ [-] 7719  diskBandwidthLimiter: moderate fr: 0.18, smoothed-incoming: 104 MiB, unusedBW: 87 MiB, elasticBytes/Tokens: 31 MiB
I220713 18:30:09.083951 382 util/admission/granter.go:2091 ⋮ [-] 7730  Incoming LSM 149 MiB, tokens (regular, elastic): 87 MiB, 31 MiB, per-req: (6.4 KiB,6.4 KiB), compaction-w: 2.1 GiB
I220713 18:30:09.083961 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 7731  diskLoadWatcher: rb: 0 B, wb: 173 MiB, pb: 250 MiB, util: 0.69
I220713 18:30:09.083970 382 util/admission/disk_bandwidth.go:338 ⋮ [-] 7732  diskBandwidthLimiter: moderate fr: 0.22, smoothed-incoming: 127 MiB, unusedBW: 77 MiB, elasticBytes/Tokens: 48 MiB

We then stabilize

I220713 18:30:24.084233 382 util/admission/granter.go:2091 ⋮ [-] 7741  Incoming LSM 150 MiB, tokens (regular, elastic): 97 MiB, 48 MiB, per-req: (7.3 KiB,7.3 KiB), compaction-w: 2.2 GiB
I220713 18:30:24.084258 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 7742  diskLoadWatcher: rb: 0 B, wb: 176 MiB, pb: 250 MiB, util: 0.71
I220713 18:30:24.084273 382 util/admission/disk_bandwidth.go:359 ⋮ [-] 7743  diskBandwidthLimiter: high elastic fr: 0.28, smoothed-incoming: 145111778, elasticTokens: 50186744
I220713 18:30:39.083901 382 util/admission/granter.go:2091 ⋮ [-] 7752  Incoming LSM 149 MiB, tokens (regular, elastic): 98 MiB, 48 MiB, per-req: (7.4 KiB,7.4 KiB), compaction-w: 2.2 GiB
I220713 18:30:39.083912 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 7753  diskLoadWatcher: rb: 0 B, wb: 179 MiB, pb: 250 MiB, util: 0.71
I220713 18:30:39.083919 382 util/admission/disk_bandwidth.go:359 ⋮ [-] 7754  diskBandwidthLimiter: high elastic fr: 0.30, smoothed-incoming: 150528601, elasticTokens: 50186744
I220713 18:30:54.084006 382 util/admission/granter.go:2091 ⋮ [-] 7763  Incoming LSM 99 MiB, tokens (regular, elastic): 101 MiB, 48 MiB, per-req: (7.5 KiB,7.5 KiB), compaction-w: 2.1 GiB
I220713 18:30:54.084028 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 7764  diskLoadWatcher: rb: 0 B, wb: 170 MiB, pb: 250 MiB, util: 0.68
I220713 18:30:54.084071 382 util/admission/disk_bandwidth.go:338 ⋮ [-] 7765  diskBandwidthLimiter: moderate fr: 0.31, smoothed-incoming: 121 MiB, unusedBW: 80 MiB, elasticBytes/Tokens: 53 MiB
I220713 18:31:09.083909 382 util/admission/granter.go:2091 ⋮ [-] 7775  Incoming LSM 152 MiB, tokens (regular, elastic): 84 MiB, 53 MiB, per-req: (6.3 KiB,6.3 KiB), compaction-w: 2.4 GiB
I220713 18:31:09.083920 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 7776  diskLoadWatcher: rb: 0 B, wb: 194 MiB, pb: 250 MiB, util: 0.78
I220713 18:31:09.083938 382 util/admission/disk_bandwidth.go:359 ⋮ [-] 7777  diskBandwidthLimiter: high elastic fr: 0.35, smoothed-incoming: 143081287, elasticTokens: 55205990
I220713 18:31:24.084075 382 util/admission/granter.go:2091 ⋮ [-] 7786  Incoming LSM 152 MiB, tokens (regular, elastic): 87 MiB, 53 MiB, per-req: (6.6 KiB,6.6 KiB), compaction-w: 2.6 GiB
I220713 18:31:24.084086 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 7787  diskLoadWatcher: rb: 0 B, wb: 207 MiB, pb: 250 MiB, util: 0.83
I220713 18:31:24.084103 382 util/admission/disk_bandwidth.go:359 ⋮ [-] 7788  diskBandwidthLimiter: high elastic fr: 0.36, smoothed-incoming: 151084905, elasticTokens: 55205990
I220713 18:31:39.083915 382 util/admission/granter.go:2091 ⋮ [-] 7797  Incoming LSM 152 MiB, tokens (regular, elastic): 92 MiB, 53 MiB, per-req: (6.9 KiB,6.9 KiB), compaction-w: 2.4 GiB
I220713 18:31:39.083927 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 7798  diskLoadWatcher: rb: 0 B, wb: 194 MiB, pb: 250 MiB, util: 0.78
I220713 18:31:39.083945 382 util/admission/disk_bandwidth.go:359 ⋮ [-] 7799  diskBandwidthLimiter: high elastic fr: 0.36, smoothed-incoming: 155080500, elasticTokens: 55205990
I220713 18:31:54.084142 382 util/admission/granter.go:2091 ⋮ [-] 7808  Incoming LSM 151 MiB, tokens (regular, elastic): 93 MiB, 53 MiB, per-req: (7.0 KiB,7.0 KiB), compaction-w: 2.5 GiB
I220713 18:31:54.084164 382 util/admission/disk_bandwidth.go:110 ⋮ [-] 7809  diskLoadWatcher: rb: 0 B, wb: 197 MiB, pb: 250 MiB, util: 0.79
I220713 18:31:54.084204 382 util/admission/disk_bandwidth.go:359 ⋮ [-] 7810  diskBandwidthLimiter: high elastic fr: 0.36, smoothed-incoming: 156691037, elasticTokens: 55205990

Challenge is the sharp transition from 0.7 or less to > 0.95. This is all because of compactions. There is a lag from writes to the full implication in terms of write amp. Also, when we start cutting there is a sharp fall from > 0.95 -- that is partly because of our multiplicative decrease but we've tried to dampen the multiplicative decrease and start growing quickly again otherwise we would fall too much I220712 18:12:04.770141 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 289 diskLoadWatcher: rb: 0 B, wb: 80 MiB, pb: 95 MiB, util: 0.84 I220712 18:12:19.770363 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 300 diskLoadWatcher: rb: 273 B, wb: 54 MiB, pb: 95 MiB, util: 0.57 I220712 18:12:34.770694 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 312 diskLoadWatcher: rb: 0 B, wb: 115 MiB, pb: 95 MiB, util: 1.21 I220712 18:12:49.770632 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 324 diskLoadWatcher: rb: 0 B, wb: 102 MiB, pb: 95 MiB, util: 1.07 I220712 18:13:04.769926 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 335 diskLoadWatcher: rb: 0 B, wb: 80 MiB, pb: 95 MiB, util: 0.84 I220712 18:13:19.770618 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 347 diskLoadWatcher: rb: 0 B, wb: 33 MiB, pb: 95 MiB, util: 0.35 I220712 18:13:34.770323 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 358 diskLoadWatcher: rb: 0 B, wb: 11 MiB, pb: 95 MiB, util: 0.11 I220712 18:13:49.770645 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 370 diskLoadWatcher: rb: 0 B, wb: 2.6 MiB, pb: 95 MiB, util: 0.03 I220712 18:14:04.769960 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 383 diskLoadWatcher: rb: 0 B, wb: 266 MiB, pb: 95 MiB, util: 2.79 I220712 18:14:19.770059 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 394 diskLoadWatcher: rb: 819 B, wb: 250 MiB, pb: 95 MiB, util: 2.63 I220712 18:14:34.769914 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 406 diskLoadWatcher: rb: 546 B, wb: 243 MiB, pb: 95 MiB, util: 2.54 I220712 18:14:49.770237 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 419 diskLoadWatcher: rb: 0 B, wb: 76 MiB, pb: 95 MiB, util: 0.80 I220712 18:15:04.770697 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 431 diskLoadWatcher: rb: 0 B, wb: 2.1 MiB, pb: 95 MiB, util: 0.02 I220712 18:15:19.770365 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 442 diskLoadWatcher: rb: 273 B, wb: 52 MiB, pb: 95 MiB, util: 0.55 I220712 18:15:34.770506 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 453 diskLoadWatcher: rb: 0 B, wb: 39 MiB, pb: 95 MiB, util: 0.41 I220712 18:15:49.771073 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 465 diskLoadWatcher: rb: 273 B, wb: 71 MiB, pb: 95 MiB, util: 0.74 I220712 18:16:04.770788 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 476 diskLoadWatcher: rb: 0 B, wb: 105 MiB, pb: 95 MiB, util: 1.10 I220712 18:16:19.769824 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 487 diskLoadWatcher: rb: 0 B, wb: 42 MiB, pb: 95 MiB, util: 0.44 I220712 18:16:34.770666 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 498 diskLoadWatcher: rb: 0 B, wb: 60 MiB, pb: 95 MiB, util: 0.63 I220712 18:16:49.770379 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 510 diskLoadWatcher: rb: 0 B, wb: 70 MiB, pb: 95 MiB, util: 0.73 I220712 18:17:04.770687 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 521 diskLoadWatcher: rb: 0 B, wb: 77 MiB, pb: 95 MiB, util: 0.80 I220712 18:17:19.770664 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 532 diskLoadWatcher: rb: 0 B, wb: 118 MiB, pb: 95 MiB, util: 1.24 I220712 18:17:34.770083 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 543 diskLoadWatcher: rb: 0 B, wb: 3.0 MiB, pb: 95 MiB, util: 0.03 I220712 18:17:49.770806 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 555 diskLoadWatcher: rb: 0 B, wb: 54 MiB, pb: 95 MiB, util: 0.57 I220712 18:18:04.770748 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 566 diskLoadWatcher: rb: 0 B, wb: 53 MiB, pb: 95 MiB, util: 0.56 I220712 18:18:19.770290 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 578 diskLoadWatcher: rb: 0 B, wb: 67 MiB, pb: 95 MiB, util: 0.70 I220712 18:18:34.770280 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 589 diskLoadWatcher: rb: 0 B, wb: 104 MiB, pb: 95 MiB, util: 1.10 I220712 18:18:49.769979 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 600 diskLoadWatcher: rb: 0 B, wb: 53 MiB, pb: 95 MiB, util: 0.56 I220712 18:19:04.770342 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 612 diskLoadWatcher: rb: 0 B, wb: 17 MiB, pb: 95 MiB, util: 0.18 I220712 18:19:19.771061 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 623 diskLoadWatcher: rb: 0 B, wb: 66 MiB, pb: 95 MiB, util: 0.69 I220712 18:19:34.770318 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 636 diskLoadWatcher: rb: 0 B, wb: 96 MiB, pb: 95 MiB, util: 1.01 I220712 18:19:49.769739 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 650 diskLoadWatcher: rb: 0 B, wb: 13 MiB, pb: 95 MiB, util: 0.14 I220712 18:20:04.769936 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 663 diskLoadWatcher: rb: 0 B, wb: 42 MiB, pb: 95 MiB, util: 0.44 I220712 18:20:19.770775 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 674 diskLoadWatcher: rb: 0 B, wb: 52 MiB, pb: 95 MiB, util: 0.54 I220712 18:20:34.775699 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 685 diskLoadWatcher: rb: 273 B, wb: 54 MiB, pb: 95 MiB, util: 0.57 I220712 18:20:49.770837 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 696 diskLoadWatcher: rb: 273 B, wb: 103 MiB, pb: 95 MiB, util: 1.08 I220712 18:21:04.770360 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 708 diskLoadWatcher: rb: 0 B, wb: 9.4 MiB, pb: 95 MiB, util: 0.10 I220712 18:21:19.771030 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 719 diskLoadWatcher: rb: 0 B, wb: 60 MiB, pb: 95 MiB, util: 0.63 I220712 18:21:34.769898 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 730 diskLoadWatcher: rb: 273 B, wb: 59 MiB, pb: 95 MiB, util: 0.62 I220712 18:21:49.770729 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 742 diskLoadWatcher: rb: 0 B, wb: 40 MiB, pb: 95 MiB, util: 0.42 I220712 18:22:04.769814 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 753 diskLoadWatcher: rb: 273 B, wb: 62 MiB, pb: 95 MiB, util: 0.65 I220712 18:22:19.770621 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 764 diskLoadWatcher: rb: 0 B, wb: 71 MiB, pb: 95 MiB, util: 0.75 I220712 18:22:34.769902 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 775 diskLoadWatcher: rb: 273 B, wb: 71 MiB, pb: 95 MiB, util: 0.74 I220712 18:22:49.769792 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 787 diskLoadWatcher: rb: 0 B, wb: 84 MiB, pb: 95 MiB, util: 0.88 I220712 18:23:04.770131 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 798 diskLoadWatcher: rb: 273 B, wb: 74 MiB, pb: 95 MiB, util: 0.78 I220712 18:23:19.770370 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 809 diskLoadWatcher: rb: 0 B, wb: 53 MiB, pb: 95 MiB, util: 0.56 I220712 18:23:34.770599 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 820 diskLoadWatcher: rb: 273 B, wb: 121 MiB, pb: 95 MiB, util: 1.27 I220712 18:23:49.771022 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 831 diskLoadWatcher: rb: 0 B, wb: 49 MiB, pb: 95 MiB, util: 0.51 I220712 18:24:04.770034 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 843 diskLoadWatcher: rb: 0 B, wb: 47 MiB, pb: 95 MiB, util: 0.49 I220712 18:24:19.770685 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 854 diskLoadWatcher: rb: 273 B, wb: 90 MiB, pb: 95 MiB, util: 0.95 I220712 18:24:34.770236 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 867 diskLoadWatcher: rb: 273 B, wb: 96 MiB, pb: 95 MiB, util: 1.00 I220712 18:24:49.770619 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 881 diskLoadWatcher: rb: 0 B, wb: 60 MiB, pb: 95 MiB, util: 0.63 I220712 18:25:04.769913 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 892 diskLoadWatcher: rb: 273 B, wb: 51 MiB, pb: 95 MiB, util: 0.53 I220712 18:25:19.770673 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 903 diskLoadWatcher: rb: 0 B, wb: 53 MiB, pb: 95 MiB, util: 0.55 I220712 18:25:34.770651 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 914 diskLoadWatcher: rb: 273 B, wb: 63 MiB, pb: 95 MiB, util: 0.66 I220712 18:25:49.770871 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 926 diskLoadWatcher: rb: 273 B, wb: 122 MiB, pb: 95 MiB, util: 1.28 I220712 18:26:04.770125 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 938 diskLoadWatcher: rb: 0 B, wb: 69 MiB, pb: 95 MiB, util: 0.72 I220712 18:26:19.770632 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 949 diskLoadWatcher: rb: 0 B, wb: 60 MiB, pb: 95 MiB, util: 0.63 I220712 18:26:34.770592 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 960 diskLoadWatcher: rb: 0 B, wb: 55 MiB, pb: 95 MiB, util: 0.58 I220712 18:26:49.770778 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 972 diskLoadWatcher: rb: 0 B, wb: 62 MiB, pb: 95 MiB, util: 0.65 I220712 18:27:04.770242 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 983 diskLoadWatcher: rb: 0 B, wb: 117 MiB, pb: 95 MiB, util: 1.23 I220712 18:27:19.770137 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 994 diskLoadWatcher: rb: 0 B, wb: 60 MiB, pb: 95 MiB, util: 0.63 I220712 18:27:34.770353 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1005 diskLoadWatcher: rb: 0 B, wb: 47 MiB, pb: 95 MiB, util: 0.49 I220712 18:27:49.770114 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1017 diskLoadWatcher: rb: 0 B, wb: 66 MiB, pb: 95 MiB, util: 0.69 I220712 18:28:04.769837 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1029 diskLoadWatcher: rb: 0 B, wb: 116 MiB, pb: 95 MiB, util: 1.22 I220712 18:28:19.770701 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1040 diskLoadWatcher: rb: 0 B, wb: 53 MiB, pb: 95 MiB, util: 0.55 I220712 18:28:34.770075 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1051 diskLoadWatcher: rb: 0 B, wb: 47 MiB, pb: 95 MiB, util: 0.50 I220712 18:28:49.769819 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1062 diskLoadWatcher: rb: 0 B, wb: 131 MiB, pb: 95 MiB, util: 1.37 I220712 18:29:04.770145 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1074 diskLoadWatcher: rb: 0 B, wb: 63 MiB, pb: 95 MiB, util: 0.66 I220712 18:29:19.770112 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1085 diskLoadWatcher: rb: 0 B, wb: 49 MiB, pb: 95 MiB, util: 0.52 I220712 18:29:34.770037 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1098 diskLoadWatcher: rb: 0 B, wb: 69 MiB, pb: 95 MiB, util: 0.72 I220712 18:29:49.770141 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1112 diskLoadWatcher: rb: 0 B, wb: 116 MiB, pb: 95 MiB, util: 1.22 I220712 18:30:04.770340 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1125 diskLoadWatcher: rb: 0 B, wb: 58 MiB, pb: 95 MiB, util: 0.60 I220712 18:30:19.770347 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1136 diskLoadWatcher: rb: 0 B, wb: 60 MiB, pb: 95 MiB, util: 0.63 I220712 18:30:34.770577 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1147 diskLoadWatcher: rb: 0 B, wb: 128 MiB, pb: 95 MiB, util: 1.34 I220712 18:30:49.770405 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1158 diskLoadWatcher: rb: 0 B, wb: 52 MiB, pb: 95 MiB, util: 0.54 I220712 18:31:04.770181 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1170 diskLoadWatcher: rb: 0 B, wb: 61 MiB, pb: 95 MiB, util: 0.64 I220712 18:31:19.770070 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1181 diskLoadWatcher: rb: 0 B, wb: 59 MiB, pb: 95 MiB, util: 0.61 I220712 18:31:34.770327 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1192 diskLoadWatcher: rb: 0 B, wb: 121 MiB, pb: 95 MiB, util: 1.27 I220712 18:31:49.771027 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1203 diskLoadWatcher: rb: 0 B, wb: 63 MiB, pb: 95 MiB, util: 0.66 I220712 18:32:04.770572 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1214 diskLoadWatcher: rb: 0 B, wb: 91 MiB, pb: 95 MiB, util: 0.96 I220712 18:32:19.770161 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1226 diskLoadWatcher: rb: 0 B, wb: 31 MiB, pb: 95 MiB, util: 0.32 I220712 18:32:34.770428 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1237 diskLoadWatcher: rb: 0 B, wb: 57 MiB, pb: 95 MiB, util: 0.60 I220712 18:32:49.770396 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1248 diskLoadWatcher: rb: 0 B, wb: 56 MiB, pb: 95 MiB, util: 0.58 I220712 18:33:04.770595 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1260 diskLoadWatcher: rb: 0 B, wb: 52 MiB, pb: 95 MiB, util: 0.55 I220712 18:33:19.770179 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1271 diskLoadWatcher: rb: 0 B, wb: 49 MiB, pb: 95 MiB, util: 0.51 I220712 18:33:34.770001 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1282 diskLoadWatcher: rb: 0 B, wb: 77 MiB, pb: 95 MiB, util: 0.81 I220712 18:33:49.770413 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1293 diskLoadWatcher: rb: 0 B, wb: 94 MiB, pb: 95 MiB, util: 0.98 I220712 18:34:04.770672 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1304 diskLoadWatcher: rb: 0 B, wb: 2.6 MiB, pb: 95 MiB, util: 0.03 I220712 18:34:19.770153 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1316 diskLoadWatcher: rb: 0 B, wb: 51 MiB, pb: 95 MiB, util: 0.53 I220712 18:34:34.770660 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1329 diskLoadWatcher: rb: 0 B, wb: 58 MiB, pb: 95 MiB, util: 0.61 I220712 18:34:49.770319 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1342 diskLoadWatcher: rb: 0 B, wb: 50 MiB, pb: 95 MiB, util: 0.53 I220712 18:35:04.770335 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1354 diskLoadWatcher: rb: 273 B, wb: 52 MiB, pb: 95 MiB, util: 0.55 I220712 18:35:19.771075 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1365 diskLoadWatcher: rb: 0 B, wb: 55 MiB, pb: 95 MiB, util: 0.57 I220712 18:35:34.769749 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1376 diskLoadWatcher: rb: 273 B, wb: 71 MiB, pb: 95 MiB, util: 0.74 I220712 18:35:49.769878 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1387 diskLoadWatcher: rb: 273 B, wb: 95 MiB, pb: 95 MiB, util: 1.00 I220712 18:36:04.770526 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1399 diskLoadWatcher: rb: 0 B, wb: 18 MiB, pb: 95 MiB, util: 0.19 I220712 18:36:19.769831 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1410 diskLoadWatcher: rb: 0 B, wb: 45 MiB, pb: 95 MiB, util: 0.47 I220712 18:36:34.769896 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1421 diskLoadWatcher: rb: 273 B, wb: 50 MiB, pb: 95 MiB, util: 0.52 I220712 18:36:49.770154 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1432 diskLoadWatcher: rb: 0 B, wb: 55 MiB, pb: 95 MiB, util: 0.58 I220712 18:37:04.770233 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1443 diskLoadWatcher: rb: 273 B, wb: 63 MiB, pb: 95 MiB, util: 0.66 I220712 18:37:19.770668 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1455 diskLoadWatcher: rb: 0 B, wb: 57 MiB, pb: 95 MiB, util: 0.60 I220712 18:37:34.770360 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1466 diskLoadWatcher: rb: 273 B, wb: 114 MiB, pb: 95 MiB, util: 1.20 I220712 18:37:49.770815 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1478 diskLoadWatcher: rb: 273 B, wb: 40 MiB, pb: 95 MiB, util: 0.42 I220712 18:38:04.769966 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1490 diskLoadWatcher: rb: 0 B, wb: 66 MiB, pb: 95 MiB, util: 0.69 I220712 18:38:19.770167 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1501 diskLoadWatcher: rb: 273 B, wb: 89 MiB, pb: 95 MiB, util: 0.94 I220712 18:38:34.770339 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1512 diskLoadWatcher: rb: 0 B, wb: 82 MiB, pb: 95 MiB, util: 0.86 I220712 18:38:49.770747 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1523 diskLoadWatcher: rb: 273 B, wb: 98 MiB, pb: 95 MiB, util: 1.03 I220712 18:39:04.769737 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1534 diskLoadWatcher: rb: 0 B, wb: 2.9 MiB, pb: 95 MiB, util: 0.03 I220712 18:39:19.770772 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1546 diskLoadWatcher: rb: 273 B, wb: 61 MiB, pb: 95 MiB, util: 0.64 I220712 18:39:34.769759 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1559 diskLoadWatcher: rb: 0 B, wb: 60 MiB, pb: 95 MiB, util: 0.62 I220712 18:39:49.770282 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1578 diskLoadWatcher: rb: 273 B, wb: 66 MiB, pb: 95 MiB, util: 0.69 I220712 18:40:04.770348 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1590 diskLoadWatcher: rb: 273 B, wb: 119 MiB, pb: 95 MiB, util: 1.25 I220712 18:40:19.770883 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1602 diskLoadWatcher: rb: 0 B, wb: 53 MiB, pb: 95 MiB, util: 0.56 I220712 18:40:34.770312 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1613 diskLoadWatcher: rb: 0 B, wb: 20 MiB, pb: 95 MiB, util: 0.21 I220712 18:40:49.770157 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1624 diskLoadWatcher: rb: 273 B, wb: 70 MiB, pb: 95 MiB, util: 0.73 Release note: None

We assume that: - There is a provisioned known limit on the sum of read and write bandwidth. This limit is allowed to change. - Admission control can only shape the rate of admission of writes. Writes also cause reads, since compactions do reads and writes. There are multiple challenges: - We are unable to precisely track the causes of disk read bandwidth, since we do not have observability into what reads missed the OS page cache. That is, we don't know how much of the reads were due to incoming reads (that we don't shape) and how much due to compaction read bandwidth. - We don't shape incoming reads. - There can be a large time lag between the shaping of incoming writes, and when it affects actual writes in the system, since compaction backlog can build up in various levels of the LSM store. - Signals of overload are coarse, since we cannot view all the internal queues that can build up due to resource overload. For instance, different examples of bandwidth saturation exhibit different latency effects, presumably because the queue buildup is different. So it is non-trivial to approach full utilization without risking high latency. Due to these challenges, and previous design attempts that were quite complicated (and incomplete), we adopt a goal of simplicity of design, and strong abstraction boundaries. - The disk load is abstracted using an enum. The diskLoadWatcher can be evolved independently. - The approach uses easy to understand additive increase and multiplicative decrease, (unlike what we do for flush and compaction tokens, where we try to more precisely calculate the sustainable rates). Since we are using a simple approach that is somewhat coarse in its behavior, we start by limiting its application to two kinds of writes: - Incoming writes that are deemed "elastic": This can be done by introducing a work-class (in addition to admissionpb.WorkPriority), or by implying a work-class from the priority (e.g. priorities < NormalPri are deemed elastic). This prototype does the latter. - Optional compactions: We assume that the LSM store is configured with a ceiling on number of regular concurrent compactions, and if it needs more it can request resources for additional (optional) compactions. These latter compactions can be limited by this approach. See cockroachdb/pebble/issues/1329 for motivation. This control on compactions is not currently implemented and is future work (though the prototype in cockroachdb#82813 had code for it). The reader should start with disk_bandwidth.go, consisting of - diskLoadWatcher: which computes load levels. - diskBandwidthLimiter: It composes the previous two objects and uses load information to limit write tokens for elastic writes and limit compactions. There is significant refactoring and changes in granter.go and work_queue.go. This is driven by the fact that: - Previously the tokens were for L0 and now we need to support tokens for bytes into L0 and tokens for bytes into the LSM (the former being a subset of the latter). - Elastic work is in a different WorkQueue than regular work, but they are competing for the same tokens. The latter is handled by allowing kvSlotGranter to multiplex across multiple requesters, via multiple child granters. A number of interfaces are adjusted to make this viable. In general, the GrantCoordinator is now slightly dumber and some of that logic is moved into the granters. For the former (handling two kinds of tokens), I considered adding multiple resource dimensions to the granter-requester interaction but found it too complicated. Instead we rely on the observation that we request tokens based on the total incoming bytes of the request (not just L0), and when the request is completed, tell the granter how many bytes went into L0. The latter allows us to return tokens to L0. So at the time the request is completed, we can account separately for the L0 tokens and these new tokens for all incoming bytes (which we are calling disk bandwidth tokens, since they are constrained based on disk bandwidth). This is a cleaned up version of the prototype in cockroachdb#82813 which contains the experimental results. The plumbing from the KV layer to populate the disk reads, writes and provisioned bandwidth is absent in this PR, and will be added in a subsequent PR. Disk bandwidth bottlenecks are considered only if both the following are true: - DiskStats.ProvisionedBandwidth is non-zero. - The cluster setting admission.disk_bandwidth_tokens.elastic.enabled is true (defaults to true). Informs cockroachdb#82898 Release note: None (the cluster setting mentioned earlier is useless since the integration with CockroachDB will be in a future PR).

We assume that: - There is a provisioned known limit on the sum of read and write bandwidth. This limit is allowed to change. - Admission control can only shape the rate of admission of writes. Writes also cause reads, since compactions do reads and writes. There are multiple challenges: - We are unable to precisely track the causes of disk read bandwidth, since we do not have observability into what reads missed the OS page cache. That is, we don't know how much of the reads were due to incoming reads (that we don't shape) and how much due to compaction read bandwidth. - We don't shape incoming reads. - There can be a large time lag between the shaping of incoming writes, and when it affects actual writes in the system, since compaction backlog can build up in various levels of the LSM store. - Signals of overload are coarse, since we cannot view all the internal queues that can build up due to resource overload. For instance, different examples of bandwidth saturation exhibit different latency effects, presumably because the queue buildup is different. So it is non-trivial to approach full utilization without risking high latency. Due to these challenges, and previous design attempts that were quite complicated (and incomplete), we adopt a goal of simplicity of design, and strong abstraction boundaries. - The disk load is abstracted using an enum. The diskLoadWatcher can be evolved independently. - The approach uses easy to understand additive increase and multiplicative decrease, (unlike what we do for flush and compaction tokens, where we try to more precisely calculate the sustainable rates). Since we are using a simple approach that is somewhat coarse in its behavior, we start by limiting its application to two kinds of writes: - Incoming writes that are deemed "elastic": This can be done by introducing a work-class (in addition to admissionpb.WorkPriority), or by implying a work-class from the priority (e.g. priorities < NormalPri are deemed elastic). This prototype does the latter. - Optional compactions: We assume that the LSM store is configured with a ceiling on number of regular concurrent compactions, and if it needs more it can request resources for additional (optional) compactions. These latter compactions can be limited by this approach. See cockroachdb/pebble/issues/1329 for motivation. This control on compactions is not currently implemented and is future work (though the prototype in cockroachdb#82813 had code for it). The reader should start with disk_bandwidth.go, consisting of - diskLoadWatcher: which computes load levels. - diskBandwidthLimiter: It used the load level computed by diskLoadWatcher to limit write tokens for elastic writes and in the future will also limit compactions. There is significant refactoring and changes in granter.go and work_queue.go. This is driven by the fact that: - Previously the tokens were for L0 and now we need to support tokens for bytes into L0 and tokens for bytes into the LSM (the former being a subset of the latter). - Elastic work is in a different WorkQueue than regular work, but they are competing for the same tokens. The latter is handled by allowing kvSlotGranter to multiplex across multiple requesters, via multiple child granters. A number of interfaces are adjusted to make this viable. In general, the GrantCoordinator is now slightly dumber and some of that logic is moved into the granters. For the former (handling two kinds of tokens), I considered adding multiple resource dimensions to the granter-requester interaction but found it too complicated. Instead we rely on the observation that we request tokens based on the total incoming bytes of the request (not just L0), and when the request is completed, tell the granter how many bytes went into L0. The latter allows us to return tokens to L0. So at the time the request is completed, we can account separately for the L0 tokens and these new tokens for all incoming bytes (which we are calling disk bandwidth tokens, since they are constrained based on disk bandwidth). This is a cleaned up version of the prototype in cockroachdb#82813 which contains the experimental results. The plumbing from the KV layer to populate the disk reads, writes and provisioned bandwidth is absent in this PR, and will be added in a subsequent PR. Disk bandwidth bottlenecks are considered only if both the following are true: - DiskStats.ProvisionedBandwidth is non-zero. - The cluster setting admission.disk_bandwidth_tokens.elastic.enabled is true (defaults to true). Informs cockroachdb#82898 Release note: None (the cluster setting mentioned earlier is useless since the integration with CockroachDB will be in a future PR).

We assume that: - There is a provisioned known limit on the sum of read and write bandwidth. This limit is allowed to change. - Admission control can only shape the rate of admission of writes. Writes also cause reads, since compactions do reads and writes. There are multiple challenges: - We are unable to precisely track the causes of disk read bandwidth, since we do not have observability into what reads missed the OS page cache. That is, we don't know how much of the reads were due to incoming reads (that we don't shape) and how much due to compaction read bandwidth. - We don't shape incoming reads. - There can be a large time lag between the shaping of incoming writes, and when it affects actual writes in the system, since compaction backlog can build up in various levels of the LSM store. - Signals of overload are coarse, since we cannot view all the internal queues that can build up due to resource overload. For instance, different examples of bandwidth saturation exhibit different latency effects, presumably because the queue buildup is different. So it is non-trivial to approach full utilization without risking high latency. Due to these challenges, and previous design attempts that were quite complicated (and incomplete), we adopt a goal of simplicity of design, and strong abstraction boundaries. - The disk load is abstracted using an enum. The diskLoadWatcher can be evolved independently. - The approach uses easy to understand small multiplicative increase and large multiplicative decrease, (unlike what we do for flush and compaction tokens, where we try to more precisely calculate the sustainable rates). Since we are using a simple approach that is somewhat coarse in its behavior, we start by limiting its application to two kinds of writes: - Incoming writes that are deemed "elastic": This can be done by introducing a work-class (in addition to admissionpb.WorkPriority), or by implying a work-class from the priority (e.g. priorities < NormalPri are deemed elastic). This prototype does the latter. - Optional compactions: We assume that the LSM store is configured with a ceiling on number of regular concurrent compactions, and if it needs more it can request resources for additional (optional) compactions. These latter compactions can be limited by this approach. See cockroachdb/pebble/issues/1329 for motivation. This control on compactions is not currently implemented and is future work (though the prototype in cockroachdb#82813 had code for it). The reader should start with disk_bandwidth.go, consisting of - diskLoadWatcher: which computes load levels. - diskBandwidthLimiter: It used the load level computed by diskLoadWatcher to limit write tokens for elastic writes and in the future will also limit compactions. There is significant refactoring and changes in granter.go and work_queue.go. This is driven by the fact that: - Previously the tokens were for L0 and now we need to support tokens for bytes into L0 and tokens for bytes into the LSM (the former being a subset of the latter). - Elastic work is in a different WorkQueue than regular work, but they are competing for the same tokens. A different WorkQueue is needed to prevent a situation where elastic work for one tenant is queued ahead of regualar work from another tenant, and stops the latter from making progress due to lack of elastic tokens. The latter is handled by allowing kvSlotGranter to multiplex across multiple requesters, via multiple child granters. A number of interfaces are adjusted to make this viable. In general, the GrantCoordinator is now slightly dumber and some of that logic is moved into the granters. For the former (handling two kinds of tokens), I considered adding multiple resource dimensions to the granter-requester interaction but found it too complicated. Instead we rely on the observation that we request tokens based on the total incoming bytes of the request (not just L0), and when the request is completed, tell the granter how many bytes went into L0. The latter allows us to return tokens to L0. So at the time the request is completed, we can account separately for the L0 tokens and these new tokens for all incoming bytes (which we are calling disk bandwidth tokens, since they are constrained based on disk bandwidth). This is a cleaned up version of the prototype in cockroachdb#82813 which contains the experimental results. The plumbing from the KV layer to populate the disk reads, writes and provisioned bandwidth is absent in this PR, and will be added in a subsequent PR. Disk bandwidth bottlenecks are considered only if both the following are true: - DiskStats.ProvisionedBandwidth is non-zero. - The cluster setting admission.disk_bandwidth_tokens.elastic.enabled is true (defaults to true). Informs cockroachdb#82898 Release note: None (the cluster setting mentioned earlier is useless since the integration with CockroachDB will be in a future PR).

85722: admission: add support for disk bandwidth as a bottleneck resource r=tbg,irfansharif a=sumeerbhola We assume that: - There is a provisioned known limit on the sum of read and write bandwidth. This limit is allowed to change. - Admission control can only shape the rate of admission of writes. Writes also cause reads, since compactions do reads and writes. There are multiple challenges: - We are unable to precisely track the causes of disk read bandwidth, since we do not have observability into what reads missed the OS page cache. That is, we don't know how much of the reads were due to incoming reads (that we don't shape) and how much due to compaction read bandwidth. - We don't shape incoming reads. - There can be a large time lag between the shaping of incoming writes, and when it affects actual writes in the system, since compaction backlog can build up in various levels of the LSM store. - Signals of overload are coarse, since we cannot view all the internal queues that can build up due to resource overload. For instance, different examples of bandwidth saturation exhibit different latency effects, presumably because the queue buildup is different. So it is non-trivial to approach full utilization without risking high latency. Due to these challenges, and previous design attempts that were quite complicated (and incomplete), we adopt a goal of simplicity of design, and strong abstraction boundaries. - The disk load is abstracted using an enum. The diskLoadWatcher can be evolved independently. - The approach uses easy to understand small multiplicative increase and large multiplicative decrease, (unlike what we do for flush and compaction tokens, where we try to more precisely calculate the sustainable rates). Since we are using a simple approach that is somewhat coarse in its behavior, we start by limiting its application to two kinds of writes: - Incoming writes that are deemed "elastic": This can be done by introducing a work-class (in addition to admissionpb.WorkPriority), or by implying a work-class from the priority (e.g. priorities < NormalPri are deemed elastic). This prototype does the latter. - Optional compactions: We assume that the LSM store is configured with a ceiling on number of regular concurrent compactions, and if it needs more it can request resources for additional (optional) compactions. These latter compactions can be limited by this approach. See cockroachdb/pebble#1329 for motivation. This control on compactions is not currently implemented and is future work (though the prototype in #82813 had code for it). The reader should start with disk_bandwidth.go, consisting of - diskLoadWatcher: which computes load levels. - diskBandwidthLimiter: It used the load level computed by diskLoadWatcher to limit write tokens for elastic writes and in the future will also limit compactions. There is significant refactoring and changes in granter.go and work_queue.go. This is driven by the fact that: - Previously the tokens were for L0 and now we need to support tokens for bytes into L0 and tokens for bytes into the LSM (the former being a subset of the latter). - Elastic work is in a different WorkQueue than regular work, but they are competing for the same tokens. A different WorkQueue is needed to prevent a situation where elastic work for one tenant is queued ahead of regualar work from another tenant, and stops the latter from making progress due to lack of elastic tokens. The latter is handled by allowing kvSlotGranter to multiplex across multiple requesters, via multiple child granters. A number of interfaces are adjusted to make this viable. In general, the GrantCoordinator is now slightly dumber and some of that logic is moved into the granters. For the former (handling two kinds of tokens), I considered adding multiple resource dimensions to the granter-requester interaction but found it too complicated. Instead we rely on the observation that we request tokens based on the total incoming bytes of the request (not just L0), and when the request is completed, tell the granter how many bytes went into L0. The latter allows us to return tokens to L0. So at the time the request is completed, we can account separately for the L0 tokens and these new tokens for all incoming bytes (which we are calling disk bandwidth tokens, since they are constrained based on disk bandwidth). This is a cleaned up version of the prototype in #82813 which contains the experimental results. The plumbing from the KV layer to populate the disk reads, writes and provisioned bandwidth is absent in this PR, and will be added in a subsequent PR. Disk bandwidth bottlenecks are considered only if both the following are true: - DiskStats.ProvisionedBandwidth is non-zero. - The cluster setting admission.disk_bandwidth_tokens.elastic.enabled is true (defaults to true). Informs #82898 Release note: None (the cluster setting mentioned earlier is useless since the integration with CockroachDB will be in a future PR). 85786: sql: support UDFs with named args, strictness, and volatility r=mgartner a=mgartner #### sql: UDF with empty result should evaluate to NULL If the last statement in a UDF returns no rows, the UDF will evaluate to NULL. Prior to this commit the evaluation of the UDF would panic. Release note: None #### sql: support UDFs with named arguments UDFs with named arguments can now be evaluated. During query planning, statements in the function body are built with a scope that includes the named arguments for the function as columns. This allows references to arguments to be resolved as variables. During evaluation, the input expressions are first evaluated into datums. When a plan is built for each statement in the UDF, the argument columns in the expression are replaced with the input datums before the expression is optimized. Note that anonymous arguments and integer references to arguments (e.g., `$1`) are not yet supported. Also, the formatting of `UDFExpr`s has been improved to show argument columns and input expressions. Release note: None #### sql: do not evaluate strict UDFs if any input values are NULL A UDF can have one of two behaviors when it is invoked with NULL inputs: 1. If the UDF is `CALLED ON NULL INPUT` (the default) then the function is evaluated regardless of whether or not any of the input values are NULL. 2. If the UDF `RETURNS NULL ON NULL INPUT` or is `STRICT` then the function is not evaluated if any of the input values are NULL. Instead, the function directly results in NULL. This commit implements these two behaviors. In the future, we can add a normalization rule that folds a strict UDF if any of its inputs are constant NULL values. Release note: None #### sql: make mutations visible to volatile UDFs The volatility of a UDF affects the visibility of mutations made by the statement calling the function. A volatile function will see these mutations. Also, statements within a volatile function's body will see changes made by previous statements the function body (note that this is left untested in this commit because we do not currently support mutations within UDF bodies). In contrast, a stable, immutable, or leakproof function will see a snapshot of the data as of the start of the statement calling the function. Release note: None Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com> Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com>

irfansharif · 2023-01-20T15:33:56Z

This prototype was merged in #85722. I don't see any additional code here. I've linked the experimental notes above from #86857 since we want to re-run + run more experiments with this machinery.

Integration test for disk bandwidth tokens, copying over what we ran in \cockroachdb#82813. Part of cockroachdb#86857 Release note: None

sumeerbhola added 2 commits June 8, 2022 11:52

sumeerbhola requested review from tbg, irfansharif, bananabrick and a team June 13, 2022 12:25

irfansharif added the A-admission-control label Jun 13, 2022

sumeerbhola mentioned this pull request Jun 14, 2022

admission: disk bandwidth as a bottleneck resource #82898

Closed

tbg mentioned this pull request Jun 28, 2022

kvserver: throttle writes on followers #79215

Closed

sumeerbhola mentioned this pull request Jul 1, 2022

allocator,admission: consider resource utilization + throttling signals directly #83490

Open

irfansharif mentioned this pull request Jul 4, 2022

kv,bulkio: throttle per-store column/index backfill requests #82556

Closed

sumeerbhola mentioned this pull request Jul 6, 2022

admission: improve write request size estimation to account for proposed writes and follower writes #82536

Closed

sumeerbhola force-pushed the disk_bw branch 4 times, most recently from e51e900 to f8a4c4d Compare July 12, 2022 18:47

sumeerbhola force-pushed the disk_bw branch from f8a4c4d to a90d4e5 Compare August 2, 2022 00:40

sumeerbhola mentioned this pull request Aug 7, 2022

admission: add support for disk bandwidth as a bottleneck resource #85722

Merged

tbg removed their request for review August 9, 2022 09:06

irfansharif added the X-noremind Bots won't notify about PRs with X-noremind label Oct 3, 2022

irfansharif removed their request for review October 3, 2022 16:08

irfansharif mentioned this pull request Jan 20, 2023

admission: enable disk bandwidth bottleneck resource #86857

Open

4 tasks

irfansharif closed this Jan 20, 2023

irfansharif added a commit to irfansharif/cockroach that referenced this pull request Sep 2, 2023

roachtest: add admission-control/disk-bandwidth

ed8be57

Integration test for disk bandwidth tokens, copying over what we ran in \cockroachdb#82813. Part of cockroachdb#86857 Release note: None

aadityasondhi mentioned this pull request Apr 2, 2024

admission: roachtest for disk bandwidth limiter #121576

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] admission: add support for disk bandwidth as a bottleneck resource #82813

[WIP] admission: add support for disk bandwidth as a bottleneck resource #82813

sumeerbhola commented Jun 13, 2022 •

edited

Loading

cockroach-teamcity commented Jun 13, 2022

sumeerbhola commented Jul 13, 2022

sumeerbhola commented Jul 13, 2022

irfansharif commented Jan 20, 2023

[WIP] admission: add support for disk bandwidth as a bottleneck resource #82813

[WIP] admission: add support for disk bandwidth as a bottleneck resource #82813

Conversation

sumeerbhola commented Jun 13, 2022 • edited Loading

cockroach-teamcity commented Jun 13, 2022

sumeerbhola commented Jul 13, 2022

sumeerbhola commented Jul 13, 2022

irfansharif commented Jan 20, 2023

sumeerbhola commented Jun 13, 2022 •

edited

Loading