Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
admission,storage: introduce flush tokens to constrain write admission
In addition to byte tokens for writes computed based on compaction rate out of L0, we now compute byte tokens based on how fast the system can flush memtables into L0. The motivation is that writing to the memtable, or creating memtables faster than the system can flush results in write stalls due to memtables, that create a latency hiccup for all write traffic. We have observed write stalls that lasted > 100ms. The approach taken here for flush tokens is straightforward (there is justification based on experiments, mentioned in code comments): - Measure and smooth the peak rate that the flush loop can operate on. This relies on the recently added pebble.InternalIntervalMetrics. - The peak rate causes 100% utilization of the single flush thread, and that is potentially too high to prevent write stalls (depending on how long it takes to do a single flush). So we multiply the smoothed peak rate by a utilization-target-fraction which is dynamically adjusted and by default is constrained to the interval [0.5, 1.5]. There is additive increase and decrease of this fraction: - High usage of tokens and no write stalls cause an additive increase. - Write stalls cause an additive decrease. A small multiplier is used if there are multiple write stalls, so that the probing falls more in the region where there are no write stalls. Note that this probing scheme cannot eliminate all write stalls. For now we are ok with a reduction in write stalls. For convenience, and some additional justification mentioned in a code comment, the scheme uses the minimum of the flush and compaction tokens for writes to L0. This means that sstable ingestion into L0 is also subject to such tokens. The periodic token computation continues to be done at 15s intervals. However, instead of giving out these tokens at 1s intervals, we now give them out at 250ms intervals. This is to reduce the burstiness, since that can cause write stalls. There is a new metric, storage.write-stall-nanos, that measures the cumulative duration of write stalls, since it gives a more intuitive feel for how the system is behaving, compared to a write stall count. The scheme can be disabled by increasing the cluster setting admission.min_flush_util_fraction, which defaults to 0.5 (corresponding to the 0.5 lower bound mentioned earluer), to a high value, say 10. The scheme was evaluated using a single node cluster with the node having a high CPU count, such that CPU was not a bottleneck, even with max compaction concurrency set to 8. A kv0 workload with high concurrency and 4KB writes was used to overload the store. Due to the high compaction concurrency, L0 stayed below the unhealthy thresholds, and the resource bottleneck became the total bandwidth provisioned for the disk. This setup was evaluated under both: - early-life: when the store had 10-20GB of data, when the compaction backlog was not very heavy, so there was less queueing for the limited disk bandwidth (it was still usually saturated). - later-life: when the store had around 150GB of data. In both cases, turning off flush tokens increased the duration of write stalls by > 5x. For the early-life case, ~750ms per second was spent in a write stall with flush-tokens off. The later-life case had ~200ms per second of write stalls with flush-tokens off. The lower value of the latter is paradoxically due to the worse bandwidth saturation: fsync latency rose from 2-4ms with flush-tokens on, to 11-20ms with flush-tokens off. This increase imposed a natural backpressure on writes due to the need to sync the WAL. In contrast the fsync latency was low in the early-life case, though it did increase from 0.125ms to 0.25ms when flush-tokens were turned off. In both cases, the admission throughput did not increase when turning off flush-tokens. That is, the system cannot sustain more throughput, but by turning on flush tokens, we shift queueing from the disk layer the admission control layer (where we have the capability to reorder work). Fixes #77357 Release note (ops change): I/O admission control now reduces the likelihood of storage layer write stalls, which can be caused by memtable flushes becoming a bottleneck. This is done by limiting write tokens based on flush throughput, so as to reduce storage layer write stalls. Consequently, write tokens are now limited both by flush throughput, and by compaction throughput out of L0. This behavior is enabled by default. The new cluster setting admission.min_flush_util_fraction, defaulting to 0.5, can be used to disable or tune flush throughput based admission tokens. Setting it to a value much much greater than 1, say 10, will disable flush based tokens. Tuning the behavior, without disabling it, should be done only on the recommendation of a domain expert.
- Loading branch information