-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] admission: add support for disk bandwidth as a bottleneck resource #82813
Conversation
In addition to byte tokens for writes computed based on compaction rate out of L0, we now compute byte tokens based on how fast the system can flush memtables into L0. The motivation is that writing to the memtable, or creating memtables faster than the system can flush results in write stalls due to memtables, that create a latency hiccup for all write traffic. We have observed write stalls that lasted > 100ms. The approach taken here for flush tokens is straightforward (there is justification based on experiments, mentioned in code comments): - Measure and smooth the peak rate that the flush loop can operate on. This relies on the recently added pebble.InternalIntervalMetrics. - The peak rate causes 100% utilization of the single flush thread, and that is potentially too high to prevent write stalls (depending on how long it takes to do a single flush). So we multiply the smoothed peak rate by a utilization-target-fraction which is dynamically adjusted and by default is constrained to the interval [0.5, 1.5]. There is additive increase and decrease of this fraction: - High usage of tokens and no write stalls cause an additive increase. - Write stalls cause an additive decrease. A small multiplier is used if there are multiple write stalls, so that the probing falls more in the region where there are no write stalls. Note that this probing scheme cannot eliminate all write stalls. For now we are ok with a reduction in write stalls. For convenience, and some additional justification mentioned in a code comment, the scheme uses the minimum of the flush and compaction tokens for writes to L0. This means that sstable ingestion into L0 is also subject to such tokens. The periodic token computation continues to be done at 15s intervals. However, instead of giving out these tokens at 1s intervals, we now give them out at 250ms intervals. This is to reduce the burstiness, since that can cause write stalls. There is a new metric, storage.write-stall-nanos, that measures the cumulative duration of write stalls, since it gives a more intuitive feel for how the system is behaving, compared to a write stall count. The scheme can be disabled by increasing the cluster setting admission.min_flush_util_percent, which defaults to 50% (corresponding to the 0.5 lower bound mentioned earluer), to a high value, say 1000%. The scheme was evaluated using a single node cluster with the node having a high CPU count, such that CPU was not a bottleneck, even with max compaction concurrency set to 8. A kv0 workload with high concurrency and 4KB writes was used to overload the store. Due to the high compaction concurrency, L0 stayed below the unhealthy thresholds, and the resource bottleneck became the total bandwidth provisioned for the disk. This setup was evaluated under both: - early-life: when the store had 10-20GB of data, when the compaction backlog was not very heavy, so there was less queueing for the limited disk bandwidth (it was still usually saturated). - later-life: when the store had around 150GB of data. In both cases, turning off flush tokens increased the duration of write stalls by > 5x. For the early-life case, ~750ms per second was spent in a write stall with flush-tokens off. The later-life case had ~200ms per second of write stalls with flush-tokens off. The lower value of the latter is paradoxically due to the worse bandwidth saturation: fsync latency rose from 2-4ms with flush-tokens on, to 11-20ms with flush-tokens off. This increase imposed a natural backpressure on writes due to the need to sync the WAL. In contrast the fsync latency was low in the early-life case, though it did increase from 0.125ms to 0.25ms when flush-tokens were turned off. In both cases, the admission throughput did not increase when turning off flush-tokens. That is, the system cannot sustain more throughput, but by turning on flush tokens, we shift queueing from the disk layer the admission control layer (where we have the capability to reorder work). Fixes cockroachdb#77357 Release note (ops change): The cluster setting admission.min_flush_util_percent can be used to disable or tune flush throughput based admission tokens, for writes to a store. Tokens based on flush throughput attempt to reduce storage layer write stalls.
The first commit is from 82440 We assume that: - There is a provisioned known limit on the sum of read and write bandwidth. This limit is allowed to change. - Admission control can only shape the rate of admission of writes. Writes also cause reads, since compactions do reads and writes. There are multiple challenges: - We are unable to precisely track the causes of disk read bandwidth, since we do not have observability into what reads missed the OS page cache. That is, we don't know how much of the reads were due to incoming reads (that we don't shape) and how much due to compaction read bandwidth. - We don't shape incoming reads. - There can be a large time lag between the shaping of incoming writes, and when it affects actual writes in the system, since compaction backlog can build up in various levels of the LSM store. - Signals of overload are coarse, since we cannot view all the internal queues that can build up due to resource overload. For instance, different examples of bandwidth saturation exhibit wildly different latency effects, presumably because the queue buildup is different. So it is non-trivial to approach full utilization without risking high latency. Due to these challenges, and previous design attempts that were quite complicated (and incomplete), we adopt a goal of simplicity of design, and strong abstraction boundaries. - The disk load is abstracted using an enum. The diskLoadWatcher can be evolved independently. - The approach uses easy to understand additive increase and multiplicative decrease, (unlike what we do for flush and compaction tokens, where we try to more precisely calculate the sustainable rates). Since we are using a simple approach that is somewhat coarse in its behavior, we start by limiting its application to two kinds of writes: - Incoming writes that are deemed "elastic": This can be done by introducing a work-class (in addition to admissionpb.WorkPriority), or by implying a work-class from the priority (e.g. priorities < NormalPri are deemed elastic). This prototype does the latter. - Optional compactions: We assume that the LSM store is configured with a ceiling on number of regular concurrent compactions, and if it needs more it can request resources for additional (optional) compactions. These latter compactions can be limited by this approach. See cockroachdb/pebble/issues/1329 for motivation. The reader should start with disk_bandwidth.go, consisting of - diskLoadWatcher: which computes load levels. - compactionLimiter: which tracks all compaction slots and limits optional compactions. - diskBandwidthLimiter: It composes the previous two objects and uses load information to limit write tokens for elastic writes and limit compactions. There is significant refactoring and changes in granter.go and work_queue.go. This is driven by the fact that: - Previously the tokens were for L0 and now we need to support tokens for bytes into L0 and tokens for bytes into the LSM (the former being a subset of the latter). - Elastic work is in a different WorkQueue than regular work, but they are competing for the same tokens. The latter is handled by allowing kvSlotGranter to multiplex across multiple requesters, via multiple child granters. A number of interfaces are adjusted to make this viable. In general, the GrantCoordinator is now slightly dumber and some of that logic is moved into the granters. For the former (two kinds of tokens), I considered adding multiple resource dimensions to the granter-requester interaction but found it too complicated. Instead we rely on the observation that we can request tokens based on the total incoming bytes of the request (not just L0), and when the request is completed, can tell the granter how many bytes went into L0. The latter allows us to return tokens to L0. There was also the (unrelated) realization that we can use the information of the size of the batch in the call to AdmittedWorkDone and fix estimation that we had to make pre-evaluation. This resulted in a bunch of changes to how we do estimation to adjust the tokens consumed: we now estimate how much we need to compensate what is being asked for at (a) admission time, (b) work done time, for the bytes added to the LSM, (c) work done time, for the bytes added to L0. Since we are askinf for tokens at admission time based on the full incoming bytes, the estimation for what fraction of an ingest goes into L0 is eliminated. This had the consequence of simplifying some of the estimation logic that was distinguishing writes from ingests. There are no tests, so this code is probably littered with bugs. Next steps: - Unit tests - Pebble changes for IntervalCompactionInfo - CockroachDB changes for IntervalDiskLoadInfo - Experimental evaluation and tuning - Separate into multiple PRs for review - KV and storage package plumbing for properly populating StoreWriteWorkInfo.{WriteBytes,IngestRequest} for ingestions and StoreWorkDoneInfo.{ActualBytes,ActualBytesIntoL0} for writes and ingestions. Release note: None
e51e900
to
f8a4c4d
Compare
Now running a mix of regular and elastic traffic.
Then added elastic traffic with a high concurrency=1024 (this is more than enough to blow past the provisioned limit if there was no disk bw control). The throughput of regular traffic stays stable.
Logs before adding elastic traffic
Then after adding elastic traffic, we first start increasing the elastic tokens:
We then stabilize
|
Challenge is the sharp transition from 0.7 or less to > 0.95. This is all because of compactions. There is a lag from writes to the full implication in terms of write amp. Also, when we start cutting there is a sharp fall from > 0.95 -- that is partly because of our multiplicative decrease but we've tried to dampen the multiplicative decrease and start growing quickly again otherwise we would fall too much I220712 18:12:04.770141 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 289 diskLoadWatcher: rb: 0 B, wb: 80 MiB, pb: 95 MiB, util: 0.84 I220712 18:12:19.770363 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 300 diskLoadWatcher: rb: 273 B, wb: 54 MiB, pb: 95 MiB, util: 0.57 I220712 18:12:34.770694 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 312 diskLoadWatcher: rb: 0 B, wb: 115 MiB, pb: 95 MiB, util: 1.21 I220712 18:12:49.770632 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 324 diskLoadWatcher: rb: 0 B, wb: 102 MiB, pb: 95 MiB, util: 1.07 I220712 18:13:04.769926 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 335 diskLoadWatcher: rb: 0 B, wb: 80 MiB, pb: 95 MiB, util: 0.84 I220712 18:13:19.770618 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 347 diskLoadWatcher: rb: 0 B, wb: 33 MiB, pb: 95 MiB, util: 0.35 I220712 18:13:34.770323 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 358 diskLoadWatcher: rb: 0 B, wb: 11 MiB, pb: 95 MiB, util: 0.11 I220712 18:13:49.770645 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 370 diskLoadWatcher: rb: 0 B, wb: 2.6 MiB, pb: 95 MiB, util: 0.03 I220712 18:14:04.769960 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 383 diskLoadWatcher: rb: 0 B, wb: 266 MiB, pb: 95 MiB, util: 2.79 I220712 18:14:19.770059 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 394 diskLoadWatcher: rb: 819 B, wb: 250 MiB, pb: 95 MiB, util: 2.63 I220712 18:14:34.769914 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 406 diskLoadWatcher: rb: 546 B, wb: 243 MiB, pb: 95 MiB, util: 2.54 I220712 18:14:49.770237 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 419 diskLoadWatcher: rb: 0 B, wb: 76 MiB, pb: 95 MiB, util: 0.80 I220712 18:15:04.770697 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 431 diskLoadWatcher: rb: 0 B, wb: 2.1 MiB, pb: 95 MiB, util: 0.02 I220712 18:15:19.770365 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 442 diskLoadWatcher: rb: 273 B, wb: 52 MiB, pb: 95 MiB, util: 0.55 I220712 18:15:34.770506 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 453 diskLoadWatcher: rb: 0 B, wb: 39 MiB, pb: 95 MiB, util: 0.41 I220712 18:15:49.771073 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 465 diskLoadWatcher: rb: 273 B, wb: 71 MiB, pb: 95 MiB, util: 0.74 I220712 18:16:04.770788 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 476 diskLoadWatcher: rb: 0 B, wb: 105 MiB, pb: 95 MiB, util: 1.10 I220712 18:16:19.769824 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 487 diskLoadWatcher: rb: 0 B, wb: 42 MiB, pb: 95 MiB, util: 0.44 I220712 18:16:34.770666 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 498 diskLoadWatcher: rb: 0 B, wb: 60 MiB, pb: 95 MiB, util: 0.63 I220712 18:16:49.770379 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 510 diskLoadWatcher: rb: 0 B, wb: 70 MiB, pb: 95 MiB, util: 0.73 I220712 18:17:04.770687 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 521 diskLoadWatcher: rb: 0 B, wb: 77 MiB, pb: 95 MiB, util: 0.80 I220712 18:17:19.770664 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 532 diskLoadWatcher: rb: 0 B, wb: 118 MiB, pb: 95 MiB, util: 1.24 I220712 18:17:34.770083 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 543 diskLoadWatcher: rb: 0 B, wb: 3.0 MiB, pb: 95 MiB, util: 0.03 I220712 18:17:49.770806 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 555 diskLoadWatcher: rb: 0 B, wb: 54 MiB, pb: 95 MiB, util: 0.57 I220712 18:18:04.770748 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 566 diskLoadWatcher: rb: 0 B, wb: 53 MiB, pb: 95 MiB, util: 0.56 I220712 18:18:19.770290 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 578 diskLoadWatcher: rb: 0 B, wb: 67 MiB, pb: 95 MiB, util: 0.70 I220712 18:18:34.770280 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 589 diskLoadWatcher: rb: 0 B, wb: 104 MiB, pb: 95 MiB, util: 1.10 I220712 18:18:49.769979 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 600 diskLoadWatcher: rb: 0 B, wb: 53 MiB, pb: 95 MiB, util: 0.56 I220712 18:19:04.770342 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 612 diskLoadWatcher: rb: 0 B, wb: 17 MiB, pb: 95 MiB, util: 0.18 I220712 18:19:19.771061 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 623 diskLoadWatcher: rb: 0 B, wb: 66 MiB, pb: 95 MiB, util: 0.69 I220712 18:19:34.770318 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 636 diskLoadWatcher: rb: 0 B, wb: 96 MiB, pb: 95 MiB, util: 1.01 I220712 18:19:49.769739 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 650 diskLoadWatcher: rb: 0 B, wb: 13 MiB, pb: 95 MiB, util: 0.14 I220712 18:20:04.769936 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 663 diskLoadWatcher: rb: 0 B, wb: 42 MiB, pb: 95 MiB, util: 0.44 I220712 18:20:19.770775 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 674 diskLoadWatcher: rb: 0 B, wb: 52 MiB, pb: 95 MiB, util: 0.54 I220712 18:20:34.775699 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 685 diskLoadWatcher: rb: 273 B, wb: 54 MiB, pb: 95 MiB, util: 0.57 I220712 18:20:49.770837 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 696 diskLoadWatcher: rb: 273 B, wb: 103 MiB, pb: 95 MiB, util: 1.08 I220712 18:21:04.770360 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 708 diskLoadWatcher: rb: 0 B, wb: 9.4 MiB, pb: 95 MiB, util: 0.10 I220712 18:21:19.771030 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 719 diskLoadWatcher: rb: 0 B, wb: 60 MiB, pb: 95 MiB, util: 0.63 I220712 18:21:34.769898 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 730 diskLoadWatcher: rb: 273 B, wb: 59 MiB, pb: 95 MiB, util: 0.62 I220712 18:21:49.770729 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 742 diskLoadWatcher: rb: 0 B, wb: 40 MiB, pb: 95 MiB, util: 0.42 I220712 18:22:04.769814 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 753 diskLoadWatcher: rb: 273 B, wb: 62 MiB, pb: 95 MiB, util: 0.65 I220712 18:22:19.770621 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 764 diskLoadWatcher: rb: 0 B, wb: 71 MiB, pb: 95 MiB, util: 0.75 I220712 18:22:34.769902 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 775 diskLoadWatcher: rb: 273 B, wb: 71 MiB, pb: 95 MiB, util: 0.74 I220712 18:22:49.769792 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 787 diskLoadWatcher: rb: 0 B, wb: 84 MiB, pb: 95 MiB, util: 0.88 I220712 18:23:04.770131 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 798 diskLoadWatcher: rb: 273 B, wb: 74 MiB, pb: 95 MiB, util: 0.78 I220712 18:23:19.770370 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 809 diskLoadWatcher: rb: 0 B, wb: 53 MiB, pb: 95 MiB, util: 0.56 I220712 18:23:34.770599 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 820 diskLoadWatcher: rb: 273 B, wb: 121 MiB, pb: 95 MiB, util: 1.27 I220712 18:23:49.771022 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 831 diskLoadWatcher: rb: 0 B, wb: 49 MiB, pb: 95 MiB, util: 0.51 I220712 18:24:04.770034 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 843 diskLoadWatcher: rb: 0 B, wb: 47 MiB, pb: 95 MiB, util: 0.49 I220712 18:24:19.770685 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 854 diskLoadWatcher: rb: 273 B, wb: 90 MiB, pb: 95 MiB, util: 0.95 I220712 18:24:34.770236 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 867 diskLoadWatcher: rb: 273 B, wb: 96 MiB, pb: 95 MiB, util: 1.00 I220712 18:24:49.770619 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 881 diskLoadWatcher: rb: 0 B, wb: 60 MiB, pb: 95 MiB, util: 0.63 I220712 18:25:04.769913 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 892 diskLoadWatcher: rb: 273 B, wb: 51 MiB, pb: 95 MiB, util: 0.53 I220712 18:25:19.770673 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 903 diskLoadWatcher: rb: 0 B, wb: 53 MiB, pb: 95 MiB, util: 0.55 I220712 18:25:34.770651 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 914 diskLoadWatcher: rb: 273 B, wb: 63 MiB, pb: 95 MiB, util: 0.66 I220712 18:25:49.770871 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 926 diskLoadWatcher: rb: 273 B, wb: 122 MiB, pb: 95 MiB, util: 1.28 I220712 18:26:04.770125 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 938 diskLoadWatcher: rb: 0 B, wb: 69 MiB, pb: 95 MiB, util: 0.72 I220712 18:26:19.770632 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 949 diskLoadWatcher: rb: 0 B, wb: 60 MiB, pb: 95 MiB, util: 0.63 I220712 18:26:34.770592 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 960 diskLoadWatcher: rb: 0 B, wb: 55 MiB, pb: 95 MiB, util: 0.58 I220712 18:26:49.770778 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 972 diskLoadWatcher: rb: 0 B, wb: 62 MiB, pb: 95 MiB, util: 0.65 I220712 18:27:04.770242 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 983 diskLoadWatcher: rb: 0 B, wb: 117 MiB, pb: 95 MiB, util: 1.23 I220712 18:27:19.770137 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 994 diskLoadWatcher: rb: 0 B, wb: 60 MiB, pb: 95 MiB, util: 0.63 I220712 18:27:34.770353 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1005 diskLoadWatcher: rb: 0 B, wb: 47 MiB, pb: 95 MiB, util: 0.49 I220712 18:27:49.770114 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1017 diskLoadWatcher: rb: 0 B, wb: 66 MiB, pb: 95 MiB, util: 0.69 I220712 18:28:04.769837 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1029 diskLoadWatcher: rb: 0 B, wb: 116 MiB, pb: 95 MiB, util: 1.22 I220712 18:28:19.770701 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1040 diskLoadWatcher: rb: 0 B, wb: 53 MiB, pb: 95 MiB, util: 0.55 I220712 18:28:34.770075 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1051 diskLoadWatcher: rb: 0 B, wb: 47 MiB, pb: 95 MiB, util: 0.50 I220712 18:28:49.769819 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1062 diskLoadWatcher: rb: 0 B, wb: 131 MiB, pb: 95 MiB, util: 1.37 I220712 18:29:04.770145 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1074 diskLoadWatcher: rb: 0 B, wb: 63 MiB, pb: 95 MiB, util: 0.66 I220712 18:29:19.770112 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1085 diskLoadWatcher: rb: 0 B, wb: 49 MiB, pb: 95 MiB, util: 0.52 I220712 18:29:34.770037 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1098 diskLoadWatcher: rb: 0 B, wb: 69 MiB, pb: 95 MiB, util: 0.72 I220712 18:29:49.770141 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1112 diskLoadWatcher: rb: 0 B, wb: 116 MiB, pb: 95 MiB, util: 1.22 I220712 18:30:04.770340 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1125 diskLoadWatcher: rb: 0 B, wb: 58 MiB, pb: 95 MiB, util: 0.60 I220712 18:30:19.770347 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1136 diskLoadWatcher: rb: 0 B, wb: 60 MiB, pb: 95 MiB, util: 0.63 I220712 18:30:34.770577 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1147 diskLoadWatcher: rb: 0 B, wb: 128 MiB, pb: 95 MiB, util: 1.34 I220712 18:30:49.770405 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1158 diskLoadWatcher: rb: 0 B, wb: 52 MiB, pb: 95 MiB, util: 0.54 I220712 18:31:04.770181 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1170 diskLoadWatcher: rb: 0 B, wb: 61 MiB, pb: 95 MiB, util: 0.64 I220712 18:31:19.770070 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1181 diskLoadWatcher: rb: 0 B, wb: 59 MiB, pb: 95 MiB, util: 0.61 I220712 18:31:34.770327 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1192 diskLoadWatcher: rb: 0 B, wb: 121 MiB, pb: 95 MiB, util: 1.27 I220712 18:31:49.771027 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1203 diskLoadWatcher: rb: 0 B, wb: 63 MiB, pb: 95 MiB, util: 0.66 I220712 18:32:04.770572 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1214 diskLoadWatcher: rb: 0 B, wb: 91 MiB, pb: 95 MiB, util: 0.96 I220712 18:32:19.770161 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1226 diskLoadWatcher: rb: 0 B, wb: 31 MiB, pb: 95 MiB, util: 0.32 I220712 18:32:34.770428 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1237 diskLoadWatcher: rb: 0 B, wb: 57 MiB, pb: 95 MiB, util: 0.60 I220712 18:32:49.770396 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1248 diskLoadWatcher: rb: 0 B, wb: 56 MiB, pb: 95 MiB, util: 0.58 I220712 18:33:04.770595 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1260 diskLoadWatcher: rb: 0 B, wb: 52 MiB, pb: 95 MiB, util: 0.55 I220712 18:33:19.770179 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1271 diskLoadWatcher: rb: 0 B, wb: 49 MiB, pb: 95 MiB, util: 0.51 I220712 18:33:34.770001 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1282 diskLoadWatcher: rb: 0 B, wb: 77 MiB, pb: 95 MiB, util: 0.81 I220712 18:33:49.770413 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1293 diskLoadWatcher: rb: 0 B, wb: 94 MiB, pb: 95 MiB, util: 0.98 I220712 18:34:04.770672 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1304 diskLoadWatcher: rb: 0 B, wb: 2.6 MiB, pb: 95 MiB, util: 0.03 I220712 18:34:19.770153 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1316 diskLoadWatcher: rb: 0 B, wb: 51 MiB, pb: 95 MiB, util: 0.53 I220712 18:34:34.770660 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1329 diskLoadWatcher: rb: 0 B, wb: 58 MiB, pb: 95 MiB, util: 0.61 I220712 18:34:49.770319 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1342 diskLoadWatcher: rb: 0 B, wb: 50 MiB, pb: 95 MiB, util: 0.53 I220712 18:35:04.770335 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1354 diskLoadWatcher: rb: 273 B, wb: 52 MiB, pb: 95 MiB, util: 0.55 I220712 18:35:19.771075 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1365 diskLoadWatcher: rb: 0 B, wb: 55 MiB, pb: 95 MiB, util: 0.57 I220712 18:35:34.769749 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1376 diskLoadWatcher: rb: 273 B, wb: 71 MiB, pb: 95 MiB, util: 0.74 I220712 18:35:49.769878 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1387 diskLoadWatcher: rb: 273 B, wb: 95 MiB, pb: 95 MiB, util: 1.00 I220712 18:36:04.770526 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1399 diskLoadWatcher: rb: 0 B, wb: 18 MiB, pb: 95 MiB, util: 0.19 I220712 18:36:19.769831 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1410 diskLoadWatcher: rb: 0 B, wb: 45 MiB, pb: 95 MiB, util: 0.47 I220712 18:36:34.769896 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1421 diskLoadWatcher: rb: 273 B, wb: 50 MiB, pb: 95 MiB, util: 0.52 I220712 18:36:49.770154 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1432 diskLoadWatcher: rb: 0 B, wb: 55 MiB, pb: 95 MiB, util: 0.58 I220712 18:37:04.770233 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1443 diskLoadWatcher: rb: 273 B, wb: 63 MiB, pb: 95 MiB, util: 0.66 I220712 18:37:19.770668 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1455 diskLoadWatcher: rb: 0 B, wb: 57 MiB, pb: 95 MiB, util: 0.60 I220712 18:37:34.770360 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1466 diskLoadWatcher: rb: 273 B, wb: 114 MiB, pb: 95 MiB, util: 1.20 I220712 18:37:49.770815 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1478 diskLoadWatcher: rb: 273 B, wb: 40 MiB, pb: 95 MiB, util: 0.42 I220712 18:38:04.769966 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1490 diskLoadWatcher: rb: 0 B, wb: 66 MiB, pb: 95 MiB, util: 0.69 I220712 18:38:19.770167 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1501 diskLoadWatcher: rb: 273 B, wb: 89 MiB, pb: 95 MiB, util: 0.94 I220712 18:38:34.770339 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1512 diskLoadWatcher: rb: 0 B, wb: 82 MiB, pb: 95 MiB, util: 0.86 I220712 18:38:49.770747 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1523 diskLoadWatcher: rb: 273 B, wb: 98 MiB, pb: 95 MiB, util: 1.03 I220712 18:39:04.769737 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1534 diskLoadWatcher: rb: 0 B, wb: 2.9 MiB, pb: 95 MiB, util: 0.03 I220712 18:39:19.770772 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1546 diskLoadWatcher: rb: 273 B, wb: 61 MiB, pb: 95 MiB, util: 0.64 I220712 18:39:34.769759 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1559 diskLoadWatcher: rb: 0 B, wb: 60 MiB, pb: 95 MiB, util: 0.62 I220712 18:39:49.770282 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1578 diskLoadWatcher: rb: 273 B, wb: 66 MiB, pb: 95 MiB, util: 0.69 I220712 18:40:04.770348 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1590 diskLoadWatcher: rb: 273 B, wb: 119 MiB, pb: 95 MiB, util: 1.25 I220712 18:40:19.770883 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1602 diskLoadWatcher: rb: 0 B, wb: 53 MiB, pb: 95 MiB, util: 0.56 I220712 18:40:34.770312 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1613 diskLoadWatcher: rb: 0 B, wb: 20 MiB, pb: 95 MiB, util: 0.21 I220712 18:40:49.770157 541 util/admission/disk_bandwidth.go:110 ⋮ [-] 1624 diskLoadWatcher: rb: 273 B, wb: 70 MiB, pb: 95 MiB, util: 0.73 Release note: None
We assume that: - There is a provisioned known limit on the sum of read and write bandwidth. This limit is allowed to change. - Admission control can only shape the rate of admission of writes. Writes also cause reads, since compactions do reads and writes. There are multiple challenges: - We are unable to precisely track the causes of disk read bandwidth, since we do not have observability into what reads missed the OS page cache. That is, we don't know how much of the reads were due to incoming reads (that we don't shape) and how much due to compaction read bandwidth. - We don't shape incoming reads. - There can be a large time lag between the shaping of incoming writes, and when it affects actual writes in the system, since compaction backlog can build up in various levels of the LSM store. - Signals of overload are coarse, since we cannot view all the internal queues that can build up due to resource overload. For instance, different examples of bandwidth saturation exhibit different latency effects, presumably because the queue buildup is different. So it is non-trivial to approach full utilization without risking high latency. Due to these challenges, and previous design attempts that were quite complicated (and incomplete), we adopt a goal of simplicity of design, and strong abstraction boundaries. - The disk load is abstracted using an enum. The diskLoadWatcher can be evolved independently. - The approach uses easy to understand additive increase and multiplicative decrease, (unlike what we do for flush and compaction tokens, where we try to more precisely calculate the sustainable rates). Since we are using a simple approach that is somewhat coarse in its behavior, we start by limiting its application to two kinds of writes: - Incoming writes that are deemed "elastic": This can be done by introducing a work-class (in addition to admissionpb.WorkPriority), or by implying a work-class from the priority (e.g. priorities < NormalPri are deemed elastic). This prototype does the latter. - Optional compactions: We assume that the LSM store is configured with a ceiling on number of regular concurrent compactions, and if it needs more it can request resources for additional (optional) compactions. These latter compactions can be limited by this approach. See cockroachdb/pebble/issues/1329 for motivation. This control on compactions is not currently implemented and is future work (though the prototype in cockroachdb#82813 had code for it). The reader should start with disk_bandwidth.go, consisting of - diskLoadWatcher: which computes load levels. - diskBandwidthLimiter: It composes the previous two objects and uses load information to limit write tokens for elastic writes and limit compactions. There is significant refactoring and changes in granter.go and work_queue.go. This is driven by the fact that: - Previously the tokens were for L0 and now we need to support tokens for bytes into L0 and tokens for bytes into the LSM (the former being a subset of the latter). - Elastic work is in a different WorkQueue than regular work, but they are competing for the same tokens. The latter is handled by allowing kvSlotGranter to multiplex across multiple requesters, via multiple child granters. A number of interfaces are adjusted to make this viable. In general, the GrantCoordinator is now slightly dumber and some of that logic is moved into the granters. For the former (handling two kinds of tokens), I considered adding multiple resource dimensions to the granter-requester interaction but found it too complicated. Instead we rely on the observation that we request tokens based on the total incoming bytes of the request (not just L0), and when the request is completed, tell the granter how many bytes went into L0. The latter allows us to return tokens to L0. So at the time the request is completed, we can account separately for the L0 tokens and these new tokens for all incoming bytes (which we are calling disk bandwidth tokens, since they are constrained based on disk bandwidth). This is a cleaned up version of the prototype in cockroachdb#82813 which contains the experimental results. The plumbing from the KV layer to populate the disk reads, writes and provisioned bandwidth is absent in this PR, and will be added in a subsequent PR. Disk bandwidth bottlenecks are considered only if both the following are true: - DiskStats.ProvisionedBandwidth is non-zero. - The cluster setting admission.disk_bandwidth_tokens.elastic.enabled is true (defaults to true). Informs cockroachdb#82898 Release note: None (the cluster setting mentioned earlier is useless since the integration with CockroachDB will be in a future PR).
We assume that: - There is a provisioned known limit on the sum of read and write bandwidth. This limit is allowed to change. - Admission control can only shape the rate of admission of writes. Writes also cause reads, since compactions do reads and writes. There are multiple challenges: - We are unable to precisely track the causes of disk read bandwidth, since we do not have observability into what reads missed the OS page cache. That is, we don't know how much of the reads were due to incoming reads (that we don't shape) and how much due to compaction read bandwidth. - We don't shape incoming reads. - There can be a large time lag between the shaping of incoming writes, and when it affects actual writes in the system, since compaction backlog can build up in various levels of the LSM store. - Signals of overload are coarse, since we cannot view all the internal queues that can build up due to resource overload. For instance, different examples of bandwidth saturation exhibit different latency effects, presumably because the queue buildup is different. So it is non-trivial to approach full utilization without risking high latency. Due to these challenges, and previous design attempts that were quite complicated (and incomplete), we adopt a goal of simplicity of design, and strong abstraction boundaries. - The disk load is abstracted using an enum. The diskLoadWatcher can be evolved independently. - The approach uses easy to understand additive increase and multiplicative decrease, (unlike what we do for flush and compaction tokens, where we try to more precisely calculate the sustainable rates). Since we are using a simple approach that is somewhat coarse in its behavior, we start by limiting its application to two kinds of writes: - Incoming writes that are deemed "elastic": This can be done by introducing a work-class (in addition to admissionpb.WorkPriority), or by implying a work-class from the priority (e.g. priorities < NormalPri are deemed elastic). This prototype does the latter. - Optional compactions: We assume that the LSM store is configured with a ceiling on number of regular concurrent compactions, and if it needs more it can request resources for additional (optional) compactions. These latter compactions can be limited by this approach. See cockroachdb/pebble/issues/1329 for motivation. This control on compactions is not currently implemented and is future work (though the prototype in cockroachdb#82813 had code for it). The reader should start with disk_bandwidth.go, consisting of - diskLoadWatcher: which computes load levels. - diskBandwidthLimiter: It composes the previous two objects and uses load information to limit write tokens for elastic writes and limit compactions. There is significant refactoring and changes in granter.go and work_queue.go. This is driven by the fact that: - Previously the tokens were for L0 and now we need to support tokens for bytes into L0 and tokens for bytes into the LSM (the former being a subset of the latter). - Elastic work is in a different WorkQueue than regular work, but they are competing for the same tokens. The latter is handled by allowing kvSlotGranter to multiplex across multiple requesters, via multiple child granters. A number of interfaces are adjusted to make this viable. In general, the GrantCoordinator is now slightly dumber and some of that logic is moved into the granters. For the former (handling two kinds of tokens), I considered adding multiple resource dimensions to the granter-requester interaction but found it too complicated. Instead we rely on the observation that we request tokens based on the total incoming bytes of the request (not just L0), and when the request is completed, tell the granter how many bytes went into L0. The latter allows us to return tokens to L0. So at the time the request is completed, we can account separately for the L0 tokens and these new tokens for all incoming bytes (which we are calling disk bandwidth tokens, since they are constrained based on disk bandwidth). This is a cleaned up version of the prototype in cockroachdb#82813 which contains the experimental results. The plumbing from the KV layer to populate the disk reads, writes and provisioned bandwidth is absent in this PR, and will be added in a subsequent PR. Disk bandwidth bottlenecks are considered only if both the following are true: - DiskStats.ProvisionedBandwidth is non-zero. - The cluster setting admission.disk_bandwidth_tokens.elastic.enabled is true (defaults to true). Informs cockroachdb#82898 Release note: None (the cluster setting mentioned earlier is useless since the integration with CockroachDB will be in a future PR).
We assume that: - There is a provisioned known limit on the sum of read and write bandwidth. This limit is allowed to change. - Admission control can only shape the rate of admission of writes. Writes also cause reads, since compactions do reads and writes. There are multiple challenges: - We are unable to precisely track the causes of disk read bandwidth, since we do not have observability into what reads missed the OS page cache. That is, we don't know how much of the reads were due to incoming reads (that we don't shape) and how much due to compaction read bandwidth. - We don't shape incoming reads. - There can be a large time lag between the shaping of incoming writes, and when it affects actual writes in the system, since compaction backlog can build up in various levels of the LSM store. - Signals of overload are coarse, since we cannot view all the internal queues that can build up due to resource overload. For instance, different examples of bandwidth saturation exhibit different latency effects, presumably because the queue buildup is different. So it is non-trivial to approach full utilization without risking high latency. Due to these challenges, and previous design attempts that were quite complicated (and incomplete), we adopt a goal of simplicity of design, and strong abstraction boundaries. - The disk load is abstracted using an enum. The diskLoadWatcher can be evolved independently. - The approach uses easy to understand additive increase and multiplicative decrease, (unlike what we do for flush and compaction tokens, where we try to more precisely calculate the sustainable rates). Since we are using a simple approach that is somewhat coarse in its behavior, we start by limiting its application to two kinds of writes: - Incoming writes that are deemed "elastic": This can be done by introducing a work-class (in addition to admissionpb.WorkPriority), or by implying a work-class from the priority (e.g. priorities < NormalPri are deemed elastic). This prototype does the latter. - Optional compactions: We assume that the LSM store is configured with a ceiling on number of regular concurrent compactions, and if it needs more it can request resources for additional (optional) compactions. These latter compactions can be limited by this approach. See cockroachdb/pebble/issues/1329 for motivation. This control on compactions is not currently implemented and is future work (though the prototype in cockroachdb#82813 had code for it). The reader should start with disk_bandwidth.go, consisting of - diskLoadWatcher: which computes load levels. - diskBandwidthLimiter: It composes the previous two objects and uses load information to limit write tokens for elastic writes and limit compactions. There is significant refactoring and changes in granter.go and work_queue.go. This is driven by the fact that: - Previously the tokens were for L0 and now we need to support tokens for bytes into L0 and tokens for bytes into the LSM (the former being a subset of the latter). - Elastic work is in a different WorkQueue than regular work, but they are competing for the same tokens. The latter is handled by allowing kvSlotGranter to multiplex across multiple requesters, via multiple child granters. A number of interfaces are adjusted to make this viable. In general, the GrantCoordinator is now slightly dumber and some of that logic is moved into the granters. For the former (handling two kinds of tokens), I considered adding multiple resource dimensions to the granter-requester interaction but found it too complicated. Instead we rely on the observation that we request tokens based on the total incoming bytes of the request (not just L0), and when the request is completed, tell the granter how many bytes went into L0. The latter allows us to return tokens to L0. So at the time the request is completed, we can account separately for the L0 tokens and these new tokens for all incoming bytes (which we are calling disk bandwidth tokens, since they are constrained based on disk bandwidth). This is a cleaned up version of the prototype in cockroachdb#82813 which contains the experimental results. The plumbing from the KV layer to populate the disk reads, writes and provisioned bandwidth is absent in this PR, and will be added in a subsequent PR. Disk bandwidth bottlenecks are considered only if both the following are true: - DiskStats.ProvisionedBandwidth is non-zero. - The cluster setting admission.disk_bandwidth_tokens.elastic.enabled is true (defaults to true). Informs cockroachdb#82898 Release note: None (the cluster setting mentioned earlier is useless since the integration with CockroachDB will be in a future PR).
We assume that: - There is a provisioned known limit on the sum of read and write bandwidth. This limit is allowed to change. - Admission control can only shape the rate of admission of writes. Writes also cause reads, since compactions do reads and writes. There are multiple challenges: - We are unable to precisely track the causes of disk read bandwidth, since we do not have observability into what reads missed the OS page cache. That is, we don't know how much of the reads were due to incoming reads (that we don't shape) and how much due to compaction read bandwidth. - We don't shape incoming reads. - There can be a large time lag between the shaping of incoming writes, and when it affects actual writes in the system, since compaction backlog can build up in various levels of the LSM store. - Signals of overload are coarse, since we cannot view all the internal queues that can build up due to resource overload. For instance, different examples of bandwidth saturation exhibit different latency effects, presumably because the queue buildup is different. So it is non-trivial to approach full utilization without risking high latency. Due to these challenges, and previous design attempts that were quite complicated (and incomplete), we adopt a goal of simplicity of design, and strong abstraction boundaries. - The disk load is abstracted using an enum. The diskLoadWatcher can be evolved independently. - The approach uses easy to understand additive increase and multiplicative decrease, (unlike what we do for flush and compaction tokens, where we try to more precisely calculate the sustainable rates). Since we are using a simple approach that is somewhat coarse in its behavior, we start by limiting its application to two kinds of writes: - Incoming writes that are deemed "elastic": This can be done by introducing a work-class (in addition to admissionpb.WorkPriority), or by implying a work-class from the priority (e.g. priorities < NormalPri are deemed elastic). This prototype does the latter. - Optional compactions: We assume that the LSM store is configured with a ceiling on number of regular concurrent compactions, and if it needs more it can request resources for additional (optional) compactions. These latter compactions can be limited by this approach. See cockroachdb/pebble/issues/1329 for motivation. This control on compactions is not currently implemented and is future work (though the prototype in cockroachdb#82813 had code for it). The reader should start with disk_bandwidth.go, consisting of - diskLoadWatcher: which computes load levels. - diskBandwidthLimiter: It composes the previous two objects and uses load information to limit write tokens for elastic writes and limit compactions. There is significant refactoring and changes in granter.go and work_queue.go. This is driven by the fact that: - Previously the tokens were for L0 and now we need to support tokens for bytes into L0 and tokens for bytes into the LSM (the former being a subset of the latter). - Elastic work is in a different WorkQueue than regular work, but they are competing for the same tokens. The latter is handled by allowing kvSlotGranter to multiplex across multiple requesters, via multiple child granters. A number of interfaces are adjusted to make this viable. In general, the GrantCoordinator is now slightly dumber and some of that logic is moved into the granters. For the former (handling two kinds of tokens), I considered adding multiple resource dimensions to the granter-requester interaction but found it too complicated. Instead we rely on the observation that we request tokens based on the total incoming bytes of the request (not just L0), and when the request is completed, tell the granter how many bytes went into L0. The latter allows us to return tokens to L0. So at the time the request is completed, we can account separately for the L0 tokens and these new tokens for all incoming bytes (which we are calling disk bandwidth tokens, since they are constrained based on disk bandwidth). This is a cleaned up version of the prototype in cockroachdb#82813 which contains the experimental results. The plumbing from the KV layer to populate the disk reads, writes and provisioned bandwidth is absent in this PR, and will be added in a subsequent PR. Disk bandwidth bottlenecks are considered only if both the following are true: - DiskStats.ProvisionedBandwidth is non-zero. - The cluster setting admission.disk_bandwidth_tokens.elastic.enabled is true (defaults to true). Informs cockroachdb#82898 Release note: None (the cluster setting mentioned earlier is useless since the integration with CockroachDB will be in a future PR).
We assume that: - There is a provisioned known limit on the sum of read and write bandwidth. This limit is allowed to change. - Admission control can only shape the rate of admission of writes. Writes also cause reads, since compactions do reads and writes. There are multiple challenges: - We are unable to precisely track the causes of disk read bandwidth, since we do not have observability into what reads missed the OS page cache. That is, we don't know how much of the reads were due to incoming reads (that we don't shape) and how much due to compaction read bandwidth. - We don't shape incoming reads. - There can be a large time lag between the shaping of incoming writes, and when it affects actual writes in the system, since compaction backlog can build up in various levels of the LSM store. - Signals of overload are coarse, since we cannot view all the internal queues that can build up due to resource overload. For instance, different examples of bandwidth saturation exhibit different latency effects, presumably because the queue buildup is different. So it is non-trivial to approach full utilization without risking high latency. Due to these challenges, and previous design attempts that were quite complicated (and incomplete), we adopt a goal of simplicity of design, and strong abstraction boundaries. - The disk load is abstracted using an enum. The diskLoadWatcher can be evolved independently. - The approach uses easy to understand additive increase and multiplicative decrease, (unlike what we do for flush and compaction tokens, where we try to more precisely calculate the sustainable rates). Since we are using a simple approach that is somewhat coarse in its behavior, we start by limiting its application to two kinds of writes: - Incoming writes that are deemed "elastic": This can be done by introducing a work-class (in addition to admissionpb.WorkPriority), or by implying a work-class from the priority (e.g. priorities < NormalPri are deemed elastic). This prototype does the latter. - Optional compactions: We assume that the LSM store is configured with a ceiling on number of regular concurrent compactions, and if it needs more it can request resources for additional (optional) compactions. These latter compactions can be limited by this approach. See cockroachdb/pebble/issues/1329 for motivation. This control on compactions is not currently implemented and is future work (though the prototype in cockroachdb#82813 had code for it). The reader should start with disk_bandwidth.go, consisting of - diskLoadWatcher: which computes load levels. - diskBandwidthLimiter: It used the load level computed by diskLoadWatcher to limit write tokens for elastic writes and in the future will also limit compactions. There is significant refactoring and changes in granter.go and work_queue.go. This is driven by the fact that: - Previously the tokens were for L0 and now we need to support tokens for bytes into L0 and tokens for bytes into the LSM (the former being a subset of the latter). - Elastic work is in a different WorkQueue than regular work, but they are competing for the same tokens. The latter is handled by allowing kvSlotGranter to multiplex across multiple requesters, via multiple child granters. A number of interfaces are adjusted to make this viable. In general, the GrantCoordinator is now slightly dumber and some of that logic is moved into the granters. For the former (handling two kinds of tokens), I considered adding multiple resource dimensions to the granter-requester interaction but found it too complicated. Instead we rely on the observation that we request tokens based on the total incoming bytes of the request (not just L0), and when the request is completed, tell the granter how many bytes went into L0. The latter allows us to return tokens to L0. So at the time the request is completed, we can account separately for the L0 tokens and these new tokens for all incoming bytes (which we are calling disk bandwidth tokens, since they are constrained based on disk bandwidth). This is a cleaned up version of the prototype in cockroachdb#82813 which contains the experimental results. The plumbing from the KV layer to populate the disk reads, writes and provisioned bandwidth is absent in this PR, and will be added in a subsequent PR. Disk bandwidth bottlenecks are considered only if both the following are true: - DiskStats.ProvisionedBandwidth is non-zero. - The cluster setting admission.disk_bandwidth_tokens.elastic.enabled is true (defaults to true). Informs cockroachdb#82898 Release note: None (the cluster setting mentioned earlier is useless since the integration with CockroachDB will be in a future PR).
We assume that: - There is a provisioned known limit on the sum of read and write bandwidth. This limit is allowed to change. - Admission control can only shape the rate of admission of writes. Writes also cause reads, since compactions do reads and writes. There are multiple challenges: - We are unable to precisely track the causes of disk read bandwidth, since we do not have observability into what reads missed the OS page cache. That is, we don't know how much of the reads were due to incoming reads (that we don't shape) and how much due to compaction read bandwidth. - We don't shape incoming reads. - There can be a large time lag between the shaping of incoming writes, and when it affects actual writes in the system, since compaction backlog can build up in various levels of the LSM store. - Signals of overload are coarse, since we cannot view all the internal queues that can build up due to resource overload. For instance, different examples of bandwidth saturation exhibit different latency effects, presumably because the queue buildup is different. So it is non-trivial to approach full utilization without risking high latency. Due to these challenges, and previous design attempts that were quite complicated (and incomplete), we adopt a goal of simplicity of design, and strong abstraction boundaries. - The disk load is abstracted using an enum. The diskLoadWatcher can be evolved independently. - The approach uses easy to understand additive increase and multiplicative decrease, (unlike what we do for flush and compaction tokens, where we try to more precisely calculate the sustainable rates). Since we are using a simple approach that is somewhat coarse in its behavior, we start by limiting its application to two kinds of writes: - Incoming writes that are deemed "elastic": This can be done by introducing a work-class (in addition to admissionpb.WorkPriority), or by implying a work-class from the priority (e.g. priorities < NormalPri are deemed elastic). This prototype does the latter. - Optional compactions: We assume that the LSM store is configured with a ceiling on number of regular concurrent compactions, and if it needs more it can request resources for additional (optional) compactions. These latter compactions can be limited by this approach. See cockroachdb/pebble/issues/1329 for motivation. This control on compactions is not currently implemented and is future work (though the prototype in cockroachdb#82813 had code for it). The reader should start with disk_bandwidth.go, consisting of - diskLoadWatcher: which computes load levels. - diskBandwidthLimiter: It used the load level computed by diskLoadWatcher to limit write tokens for elastic writes and in the future will also limit compactions. There is significant refactoring and changes in granter.go and work_queue.go. This is driven by the fact that: - Previously the tokens were for L0 and now we need to support tokens for bytes into L0 and tokens for bytes into the LSM (the former being a subset of the latter). - Elastic work is in a different WorkQueue than regular work, but they are competing for the same tokens. The latter is handled by allowing kvSlotGranter to multiplex across multiple requesters, via multiple child granters. A number of interfaces are adjusted to make this viable. In general, the GrantCoordinator is now slightly dumber and some of that logic is moved into the granters. For the former (handling two kinds of tokens), I considered adding multiple resource dimensions to the granter-requester interaction but found it too complicated. Instead we rely on the observation that we request tokens based on the total incoming bytes of the request (not just L0), and when the request is completed, tell the granter how many bytes went into L0. The latter allows us to return tokens to L0. So at the time the request is completed, we can account separately for the L0 tokens and these new tokens for all incoming bytes (which we are calling disk bandwidth tokens, since they are constrained based on disk bandwidth). This is a cleaned up version of the prototype in cockroachdb#82813 which contains the experimental results. The plumbing from the KV layer to populate the disk reads, writes and provisioned bandwidth is absent in this PR, and will be added in a subsequent PR. Disk bandwidth bottlenecks are considered only if both the following are true: - DiskStats.ProvisionedBandwidth is non-zero. - The cluster setting admission.disk_bandwidth_tokens.elastic.enabled is true (defaults to true). Informs cockroachdb#82898 Release note: None (the cluster setting mentioned earlier is useless since the integration with CockroachDB will be in a future PR).
We assume that: - There is a provisioned known limit on the sum of read and write bandwidth. This limit is allowed to change. - Admission control can only shape the rate of admission of writes. Writes also cause reads, since compactions do reads and writes. There are multiple challenges: - We are unable to precisely track the causes of disk read bandwidth, since we do not have observability into what reads missed the OS page cache. That is, we don't know how much of the reads were due to incoming reads (that we don't shape) and how much due to compaction read bandwidth. - We don't shape incoming reads. - There can be a large time lag between the shaping of incoming writes, and when it affects actual writes in the system, since compaction backlog can build up in various levels of the LSM store. - Signals of overload are coarse, since we cannot view all the internal queues that can build up due to resource overload. For instance, different examples of bandwidth saturation exhibit different latency effects, presumably because the queue buildup is different. So it is non-trivial to approach full utilization without risking high latency. Due to these challenges, and previous design attempts that were quite complicated (and incomplete), we adopt a goal of simplicity of design, and strong abstraction boundaries. - The disk load is abstracted using an enum. The diskLoadWatcher can be evolved independently. - The approach uses easy to understand additive increase and multiplicative decrease, (unlike what we do for flush and compaction tokens, where we try to more precisely calculate the sustainable rates). Since we are using a simple approach that is somewhat coarse in its behavior, we start by limiting its application to two kinds of writes: - Incoming writes that are deemed "elastic": This can be done by introducing a work-class (in addition to admissionpb.WorkPriority), or by implying a work-class from the priority (e.g. priorities < NormalPri are deemed elastic). This prototype does the latter. - Optional compactions: We assume that the LSM store is configured with a ceiling on number of regular concurrent compactions, and if it needs more it can request resources for additional (optional) compactions. These latter compactions can be limited by this approach. See cockroachdb/pebble/issues/1329 for motivation. This control on compactions is not currently implemented and is future work (though the prototype in cockroachdb#82813 had code for it). The reader should start with disk_bandwidth.go, consisting of - diskLoadWatcher: which computes load levels. - diskBandwidthLimiter: It used the load level computed by diskLoadWatcher to limit write tokens for elastic writes and in the future will also limit compactions. There is significant refactoring and changes in granter.go and work_queue.go. This is driven by the fact that: - Previously the tokens were for L0 and now we need to support tokens for bytes into L0 and tokens for bytes into the LSM (the former being a subset of the latter). - Elastic work is in a different WorkQueue than regular work, but they are competing for the same tokens. The latter is handled by allowing kvSlotGranter to multiplex across multiple requesters, via multiple child granters. A number of interfaces are adjusted to make this viable. In general, the GrantCoordinator is now slightly dumber and some of that logic is moved into the granters. For the former (handling two kinds of tokens), I considered adding multiple resource dimensions to the granter-requester interaction but found it too complicated. Instead we rely on the observation that we request tokens based on the total incoming bytes of the request (not just L0), and when the request is completed, tell the granter how many bytes went into L0. The latter allows us to return tokens to L0. So at the time the request is completed, we can account separately for the L0 tokens and these new tokens for all incoming bytes (which we are calling disk bandwidth tokens, since they are constrained based on disk bandwidth). This is a cleaned up version of the prototype in cockroachdb#82813 which contains the experimental results. The plumbing from the KV layer to populate the disk reads, writes and provisioned bandwidth is absent in this PR, and will be added in a subsequent PR. Disk bandwidth bottlenecks are considered only if both the following are true: - DiskStats.ProvisionedBandwidth is non-zero. - The cluster setting admission.disk_bandwidth_tokens.elastic.enabled is true (defaults to true). Informs cockroachdb#82898 Release note: None (the cluster setting mentioned earlier is useless since the integration with CockroachDB will be in a future PR).
We assume that: - There is a provisioned known limit on the sum of read and write bandwidth. This limit is allowed to change. - Admission control can only shape the rate of admission of writes. Writes also cause reads, since compactions do reads and writes. There are multiple challenges: - We are unable to precisely track the causes of disk read bandwidth, since we do not have observability into what reads missed the OS page cache. That is, we don't know how much of the reads were due to incoming reads (that we don't shape) and how much due to compaction read bandwidth. - We don't shape incoming reads. - There can be a large time lag between the shaping of incoming writes, and when it affects actual writes in the system, since compaction backlog can build up in various levels of the LSM store. - Signals of overload are coarse, since we cannot view all the internal queues that can build up due to resource overload. For instance, different examples of bandwidth saturation exhibit different latency effects, presumably because the queue buildup is different. So it is non-trivial to approach full utilization without risking high latency. Due to these challenges, and previous design attempts that were quite complicated (and incomplete), we adopt a goal of simplicity of design, and strong abstraction boundaries. - The disk load is abstracted using an enum. The diskLoadWatcher can be evolved independently. - The approach uses easy to understand small multiplicative increase and large multiplicative decrease, (unlike what we do for flush and compaction tokens, where we try to more precisely calculate the sustainable rates). Since we are using a simple approach that is somewhat coarse in its behavior, we start by limiting its application to two kinds of writes: - Incoming writes that are deemed "elastic": This can be done by introducing a work-class (in addition to admissionpb.WorkPriority), or by implying a work-class from the priority (e.g. priorities < NormalPri are deemed elastic). This prototype does the latter. - Optional compactions: We assume that the LSM store is configured with a ceiling on number of regular concurrent compactions, and if it needs more it can request resources for additional (optional) compactions. These latter compactions can be limited by this approach. See cockroachdb/pebble/issues/1329 for motivation. This control on compactions is not currently implemented and is future work (though the prototype in cockroachdb#82813 had code for it). The reader should start with disk_bandwidth.go, consisting of - diskLoadWatcher: which computes load levels. - diskBandwidthLimiter: It used the load level computed by diskLoadWatcher to limit write tokens for elastic writes and in the future will also limit compactions. There is significant refactoring and changes in granter.go and work_queue.go. This is driven by the fact that: - Previously the tokens were for L0 and now we need to support tokens for bytes into L0 and tokens for bytes into the LSM (the former being a subset of the latter). - Elastic work is in a different WorkQueue than regular work, but they are competing for the same tokens. A different WorkQueue is needed to prevent a situation where elastic work for one tenant is queued ahead of regualar work from another tenant, and stops the latter from making progress due to lack of elastic tokens. The latter is handled by allowing kvSlotGranter to multiplex across multiple requesters, via multiple child granters. A number of interfaces are adjusted to make this viable. In general, the GrantCoordinator is now slightly dumber and some of that logic is moved into the granters. For the former (handling two kinds of tokens), I considered adding multiple resource dimensions to the granter-requester interaction but found it too complicated. Instead we rely on the observation that we request tokens based on the total incoming bytes of the request (not just L0), and when the request is completed, tell the granter how many bytes went into L0. The latter allows us to return tokens to L0. So at the time the request is completed, we can account separately for the L0 tokens and these new tokens for all incoming bytes (which we are calling disk bandwidth tokens, since they are constrained based on disk bandwidth). This is a cleaned up version of the prototype in cockroachdb#82813 which contains the experimental results. The plumbing from the KV layer to populate the disk reads, writes and provisioned bandwidth is absent in this PR, and will be added in a subsequent PR. Disk bandwidth bottlenecks are considered only if both the following are true: - DiskStats.ProvisionedBandwidth is non-zero. - The cluster setting admission.disk_bandwidth_tokens.elastic.enabled is true (defaults to true). Informs cockroachdb#82898 Release note: None (the cluster setting mentioned earlier is useless since the integration with CockroachDB will be in a future PR).
85722: admission: add support for disk bandwidth as a bottleneck resource r=tbg,irfansharif a=sumeerbhola We assume that: - There is a provisioned known limit on the sum of read and write bandwidth. This limit is allowed to change. - Admission control can only shape the rate of admission of writes. Writes also cause reads, since compactions do reads and writes. There are multiple challenges: - We are unable to precisely track the causes of disk read bandwidth, since we do not have observability into what reads missed the OS page cache. That is, we don't know how much of the reads were due to incoming reads (that we don't shape) and how much due to compaction read bandwidth. - We don't shape incoming reads. - There can be a large time lag between the shaping of incoming writes, and when it affects actual writes in the system, since compaction backlog can build up in various levels of the LSM store. - Signals of overload are coarse, since we cannot view all the internal queues that can build up due to resource overload. For instance, different examples of bandwidth saturation exhibit different latency effects, presumably because the queue buildup is different. So it is non-trivial to approach full utilization without risking high latency. Due to these challenges, and previous design attempts that were quite complicated (and incomplete), we adopt a goal of simplicity of design, and strong abstraction boundaries. - The disk load is abstracted using an enum. The diskLoadWatcher can be evolved independently. - The approach uses easy to understand small multiplicative increase and large multiplicative decrease, (unlike what we do for flush and compaction tokens, where we try to more precisely calculate the sustainable rates). Since we are using a simple approach that is somewhat coarse in its behavior, we start by limiting its application to two kinds of writes: - Incoming writes that are deemed "elastic": This can be done by introducing a work-class (in addition to admissionpb.WorkPriority), or by implying a work-class from the priority (e.g. priorities < NormalPri are deemed elastic). This prototype does the latter. - Optional compactions: We assume that the LSM store is configured with a ceiling on number of regular concurrent compactions, and if it needs more it can request resources for additional (optional) compactions. These latter compactions can be limited by this approach. See cockroachdb/pebble#1329 for motivation. This control on compactions is not currently implemented and is future work (though the prototype in #82813 had code for it). The reader should start with disk_bandwidth.go, consisting of - diskLoadWatcher: which computes load levels. - diskBandwidthLimiter: It used the load level computed by diskLoadWatcher to limit write tokens for elastic writes and in the future will also limit compactions. There is significant refactoring and changes in granter.go and work_queue.go. This is driven by the fact that: - Previously the tokens were for L0 and now we need to support tokens for bytes into L0 and tokens for bytes into the LSM (the former being a subset of the latter). - Elastic work is in a different WorkQueue than regular work, but they are competing for the same tokens. A different WorkQueue is needed to prevent a situation where elastic work for one tenant is queued ahead of regualar work from another tenant, and stops the latter from making progress due to lack of elastic tokens. The latter is handled by allowing kvSlotGranter to multiplex across multiple requesters, via multiple child granters. A number of interfaces are adjusted to make this viable. In general, the GrantCoordinator is now slightly dumber and some of that logic is moved into the granters. For the former (handling two kinds of tokens), I considered adding multiple resource dimensions to the granter-requester interaction but found it too complicated. Instead we rely on the observation that we request tokens based on the total incoming bytes of the request (not just L0), and when the request is completed, tell the granter how many bytes went into L0. The latter allows us to return tokens to L0. So at the time the request is completed, we can account separately for the L0 tokens and these new tokens for all incoming bytes (which we are calling disk bandwidth tokens, since they are constrained based on disk bandwidth). This is a cleaned up version of the prototype in #82813 which contains the experimental results. The plumbing from the KV layer to populate the disk reads, writes and provisioned bandwidth is absent in this PR, and will be added in a subsequent PR. Disk bandwidth bottlenecks are considered only if both the following are true: - DiskStats.ProvisionedBandwidth is non-zero. - The cluster setting admission.disk_bandwidth_tokens.elastic.enabled is true (defaults to true). Informs #82898 Release note: None (the cluster setting mentioned earlier is useless since the integration with CockroachDB will be in a future PR). 85786: sql: support UDFs with named args, strictness, and volatility r=mgartner a=mgartner #### sql: UDF with empty result should evaluate to NULL If the last statement in a UDF returns no rows, the UDF will evaluate to NULL. Prior to this commit the evaluation of the UDF would panic. Release note: None #### sql: support UDFs with named arguments UDFs with named arguments can now be evaluated. During query planning, statements in the function body are built with a scope that includes the named arguments for the function as columns. This allows references to arguments to be resolved as variables. During evaluation, the input expressions are first evaluated into datums. When a plan is built for each statement in the UDF, the argument columns in the expression are replaced with the input datums before the expression is optimized. Note that anonymous arguments and integer references to arguments (e.g., `$1`) are not yet supported. Also, the formatting of `UDFExpr`s has been improved to show argument columns and input expressions. Release note: None #### sql: do not evaluate strict UDFs if any input values are NULL A UDF can have one of two behaviors when it is invoked with NULL inputs: 1. If the UDF is `CALLED ON NULL INPUT` (the default) then the function is evaluated regardless of whether or not any of the input values are NULL. 2. If the UDF `RETURNS NULL ON NULL INPUT` or is `STRICT` then the function is not evaluated if any of the input values are NULL. Instead, the function directly results in NULL. This commit implements these two behaviors. In the future, we can add a normalization rule that folds a strict UDF if any of its inputs are constant NULL values. Release note: None #### sql: make mutations visible to volatile UDFs The volatility of a UDF affects the visibility of mutations made by the statement calling the function. A volatile function will see these mutations. Also, statements within a volatile function's body will see changes made by previous statements the function body (note that this is left untested in this commit because we do not currently support mutations within UDF bodies). In contrast, a stable, immutable, or leakproof function will see a snapshot of the data as of the start of the statement calling the function. Release note: None Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com> Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com>
Integration test for disk bandwidth tokens, copying over what we ran in \cockroachdb#82813. Part of cockroachdb#86857 Release note: None
The first commit is from 82440
We assume that:
bandwidth. This limit is allowed to change.
also cause reads, since compactions do reads and writes.
There are multiple challenges:
we do not have observability into what reads missed the OS page cache.
That is, we don't know how much of the reads were due to incoming reads
(that we don't shape) and how much due to compaction read bandwidth.
it affects actual writes in the system, since compaction backlog can
build up in various levels of the LSM store.
queues that can build up due to resource overload. For instance,
different examples of bandwidth saturation exhibit wildly different
latency effects, presumably because the queue buildup is different. So it
is non-trivial to approach full utilization without risking high latency.
Due to these challenges, and previous design attempts that were quite
complicated (and incomplete), we adopt a goal of simplicity of design, and strong
abstraction boundaries.
evolved independently.
decrease, (unlike what we do for flush and compaction tokens, where we
try to more precisely calculate the sustainable rates).
Since we are using a simple approach that is somewhat coarse in its behavior,
we start by limiting its application to two kinds of writes:
introducing a work-class (in addition to admissionpb.WorkPriority), or by
implying a work-class from the priority (e.g. priorities < NormalPri are
deemed elastic). This prototype does the latter.
ceiling on number of regular concurrent compactions, and if it needs more
it can request resources for additional (optional) compactions. These
latter compactions can be limited by this approach. See
db: automatically tune compaction concurrency based on available CPU/disk headroom and read-amp pebble#1329 for motivation.
The reader should start with disk_bandwidth.go, consisting of
optional compactions.
uses load information to limit write tokens for elastic writes
and limit compactions.
There is significant refactoring and changes in granter.go and
work_queue.go. This is driven by the fact that:
bytes into L0 and tokens for bytes into the LSM (the former being a subset
of the latter).
are competing for the same tokens.
The latter is handled by allowing kvSlotGranter to multiplex across
multiple requesters, via multiple child granters. A number of interfaces
are adjusted to make this viable. In general, the GrantCoordinator
is now slightly dumber and some of that logic is moved into the granters.
For the former (two kinds of tokens), I considered adding multiple
resource dimensions to the granter-requester interaction but found it
too complicated. Instead we rely on the observation that we can request
tokens based on the total incoming bytes of the request (not just L0),
and when the request is completed, can tell the granter how many bytes
went into L0. The latter allows us to return tokens to L0. There was
also the (unrelated) realization that we can use the information
of the size of the batch in the call to AdmittedWorkDone and fix
estimation that we had to make pre-evaluation. This resulted in a
bunch of changes to how we do estimation to adjust the tokens consumed:
we now estimate how much we need to compensate what is being asked
for at (a) admission time, (b) work done time, for the bytes added
to the LSM, (c) work done time, for the bytes added to L0. Since we
are askinf for tokens at admission time based on the full incoming
bytes, the estimation for what fraction of an ingest goes into L0 is
eliminated. This had the consequence of simplifying some of the
estimation logic that was distinguishing writes from ingests.
There are no tests (and breaks existing tests) so this code is probably littered with bugs.
Next steps:
StoreWriteWorkInfo.{WriteBytes,IngestRequest} for ingestions and
StoreWorkDoneInfo.{ActualBytes,ActualBytesIntoL0} for writes and
ingestions.
Some experimental results with artificially set provisioned bandwidth limit of 95MiB and a kv0 workload with 4KB writes that are all considered elastic traffic. There were 4 runs: the first one has no provisioned bw limit and the subsequent ones are iterations over heuristics. The last one is the latest code: it is tuned to not increase load if we have reached 70% of provisioned bandwidth.
The challenge in doing better is the sharp transitions from < 0.7 fraction bandwidth utilization to > 0.95, due to the lag in compactions. For example:
Release note: None