admission control: consider write stalls due to memtable count #77357

sumeerbhola · 2022-03-03T21:30:08Z

This came up in the investigation of #72375 (comment)

The write overload protection for a store looks at number of files and sub-levels in L0. In the aforementioned roachtest we have seen write overload resulting in write stalls due to high memtable counts. This results in high latency, after a request is admitted.

Doing something for this is tricky: we need to know how fast the system can flush, since that is needed for the dynamic token computation. Using tokens that are based on bytes, and not 1 token per request, would also help here. Our current compaction based token calculation is easier: it allows us more margin for error since there is no sudden hiccup in writing if the calculation admits too many requests -- we can correct it in the next interval.

Jira issue: CRDB-13548

Epic CRDB-14607

This switch to byte tokens will result in better accounting for large writes, including ingests based on whether their bytes land in L0 or elsewhere. It is also a precurson to taking into account flush capacity (in bytes). The store write admission control path now uses a StoreWorkQueue which wraps a WorkQueue and provides additional functionality: - Work can specify WriteBytes and whether it is an IngestRequest. This is used to decide how many byte tokens to consume. - Done work specifies how many bytes were ingested into L0, so token consumption can be adjusted. The main framework change is that a single work item can consume multiple (byte) tokens, which ripples through the various interfaces including requester, granter. There is associated cleanup: kvGranter that was handling both slots and tokens is eliminated since in practice it was only doing one or the other. Instead, for the slot case the slotGranter is reused. For the token case there is a new kvStoreTokenGranter. The main logic change is in ioLoadListener which computes byte tokens and various estimates. The change is (mostly) neutral if no write provides WriteBytes, since the usual estimation will take over. The integration changes in this PR are superficial in that the requests don't provide WriteBytes. Improvements to the integration, along with experimental results, will happen in future PRs. Informs cockroachdb#79092 Informs cockroachdb#77357 Release note: None

This switch to byte tokens will result in better accounting for large writes, including ingests based on whether their bytes land in L0 or elsewhere. It is also a precursor to taking into account flush capacity (in bytes). The store write admission control path now uses a StoreWorkQueue which wraps a WorkQueue and provides additional functionality: - Work can specify WriteBytes and whether it is an IngestRequest. This is used to decide how many byte tokens to consume. - Done work specifies how many bytes were ingested into L0, so token consumption can be adjusted. The main framework change is that a single work item can consume multiple (byte) tokens, which ripples through the various interfaces including requester, granter. There is associated cleanup: kvGranter that was handling both slots and tokens is eliminated since in practice it was only doing one or the other. Instead, for the slot case the slotGranter is reused. For the token case there is a new kvStoreTokenGranter. The main logic change is in ioLoadListener which computes byte tokens and various estimates. The change is (mostly) neutral if no write provides WriteBytes, since the usual estimation will take over. The integration changes in this PR are superficial in that the requests don't provide WriteBytes. Improvements to the integration, along with experimental results, will happen in future PRs. Informs cockroachdb#79092 Informs cockroachdb#77357 Release note: None

80480: admission: change store write admission control to use byte tokens r=ajwerner a=sumeerbhola This switch to byte tokens will result in better accounting for large writes, including ingests based on whether their bytes land in L0 or elsewhere. It is also a precurson to taking into account flush capacity (in bytes). The store write admission control path now uses a StoreWorkQueue which wraps a WorkQueue and provides additional functionality: - Work can specify WriteBytes and whether it is an IngestRequest. This is used to decide how many byte tokens to consume. - Done work specifies how many bytes were ingested into L0, so token consumption can be adjusted. The main framework change is that a single work item can consume multiple (byte) tokens, which ripples through the various interfaces including requester, granter. There is associated cleanup: kvGranter that was handling both slots and tokens is eliminated since in practice it was only doing one or the other. Instead, for the slot case the slotGranter is reused. For the token case there is a new kvStoreTokenGranter. The main logic change is in ioLoadListener which computes byte tokens and various estimates. The change is (mostly) neutral if no write provides WriteBytes, since the usual estimation will take over. The integration changes in this PR are superficial in that the requests don't provide WriteBytes. Improvements to the integration, along with experimental results, will happen in future PRs. Informs #79092 Informs #77357 Release note: None 80892: hlc: move ParseHLC / DecimalToHLC to util/hlc r=ajwerner a=otan util/hlc needs util/log, but ParseHLC/DecimalToHLC doesn't need to be in tree. To remove the tree dependency on util/log, move the mentioned functions. Release note: None 80898: util/cache: remove dependency on util/log r=ajwerner a=otan Use a hook in IntervalCache to log any errors instead. Release note: None Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com> Co-authored-by: Oliver Tan <otan@cockroachlabs.com>

In addition to byte tokens for writes computed based on compaction rate out of L0, we now compute byte tokens based on how fast the system can flush memtables into L0. The motivation is that writing to the memtable, or creating memtables faster than the system can flush results in write stalls due to memtables, that create a latency hiccup for all write traffic. We have observed write stalls that lasted > 100ms. The approach taken here for flush tokens is straightforward (there is justification based on experiments, mentioned in code comments): - Measure and smooth the peak rate that the flush loop can operate on. This relies on the recently added pebble.InternalIntervalMetrics. - The peak rate causes 100% utilization of the single flush thread, and that is potentially too high to prevent write stalls (depending on how long it takes to do a single flush). So we multiply the smoothed peak rate by a utilization-target-fraction which is dynamically adjusted and by default is constrained to the interval [0.5, 1.5]. There is additive increase and decrease of this fraction: - High usage of tokens and no write stalls cause an additive increase. - Write stalls cause an additive decrease. A small multiplier is used if there are multiple write stalls, so that the probing falls more in the region where there are no write stalls. Note that this probing scheme cannot eliminate all write stalls. For now we are ok with a reduction in write stalls. For convenience, and some additional justification mentioned in a code comment, the scheme uses the minimum of the flush and compaction tokens for writes to L0. This means that sstable ingestion into L0 is also subject to such tokens. The periodic token computation continues to be done at 15s intervals. However, instead of giving out these tokens at 1s intervals, we now give them out at 250ms intervals. This is to reduce the burstiness, since that can cause write stalls. There is a new metric, storage.write-stall-nanos, that measures the cumulative duration of write stalls, since it gives a more intuitive feel for how the system is behaving, compared to a write stall count. The scheme can be disabled by increasing the cluster setting admission.min_flush_util_percent, which defaults to 50% (corresponding to the 0.5 lower bound mentioned earluer), to a high value, say 1000%. The scheme was evaluated using a single node cluster with the node having a high CPU count, such that CPU was not a bottleneck, even with max compaction concurrency set to 8. A kv0 workload with high concurrency and 4KB writes was used to overload the store. Due to the high compaction concurrency, L0 stayed below the unhealthy thresholds, and the resource bottleneck became the total bandwidth provisioned for the disk. This setup was evaluated under both: - early-life: when the store had 10-20GB of data, when the compaction backlog was not very heavy, so there was less queueing for the limited disk bandwidth (it was still usually saturated). - later-life: when the store had around 150GB of data. In both cases, turning off flush tokens increased the duration of write stalls by > 5x. For the early-life case, ~750ms per second was spent in a write stall with flush-tokens off. The later-life case had ~200ms per second of write stalls with flush-tokens off. The lower value of the latter is paradoxically due to the worse bandwidth saturation: fsync latency rose from 2-4ms with flush-tokens on, to 11-20ms with flush-tokens off. This increase imposed a natural backpressure on writes due to the need to sync the WAL. In contrast the fsync latency was low in the early-life case, though it did increase from 0.125ms to 0.25ms when flush-tokens were turned off. In both cases, the admission throughput did not increase when turning off flush-tokens. That is, the system cannot sustain more throughput, but by turning on flush tokens, we shift queueing from the disk layer the admission control layer (where we have the capability to reorder work). Fixes cockroachdb#77357 Release note (ops change): The cluster setting admission.min_flush_util_percent can be used to disable or tune flush throughput based admission tokens, for writes to a store. Tokens based on flush throughput attempt to reduce storage layer write stalls.

In addition to byte tokens for writes computed based on compaction rate out of L0, we now compute byte tokens based on how fast the system can flush memtables into L0. The motivation is that writing to the memtable, or creating memtables faster than the system can flush results in write stalls due to memtables, that create a latency hiccup for all write traffic. We have observed write stalls that lasted > 100ms. The approach taken here for flush tokens is straightforward (there is justification based on experiments, mentioned in code comments): - Measure and smooth the peak rate that the flush loop can operate on. This relies on the recently added pebble.InternalIntervalMetrics. - The peak rate causes 100% utilization of the single flush thread, and that is potentially too high to prevent write stalls (depending on how long it takes to do a single flush). So we multiply the smoothed peak rate by a utilization-target-fraction which is dynamically adjusted and by default is constrained to the interval [0.5, 1.5]. There is additive increase and decrease of this fraction: - High usage of tokens and no write stalls cause an additive increase. - Write stalls cause an additive decrease. A small multiplier is used if there are multiple write stalls, so that the probing falls more in the region where there are no write stalls. Note that this probing scheme cannot eliminate all write stalls. For now we are ok with a reduction in write stalls. For convenience, and some additional justification mentioned in a code comment, the scheme uses the minimum of the flush and compaction tokens for writes to L0. This means that sstable ingestion into L0 is also subject to such tokens. The periodic token computation continues to be done at 15s intervals. However, instead of giving out these tokens at 1s intervals, we now give them out at 250ms intervals. This is to reduce the burstiness, since that can cause write stalls. There is a new metric, storage.write-stall-nanos, that measures the cumulative duration of write stalls, since it gives a more intuitive feel for how the system is behaving, compared to a write stall count. The scheme can be disabled by increasing the cluster setting admission.min_flush_util_fraction, which defaults to 0.5 (corresponding to the 0.5 lower bound mentioned earluer), to a high value, say 10. The scheme was evaluated using a single node cluster with the node having a high CPU count, such that CPU was not a bottleneck, even with max compaction concurrency set to 8. A kv0 workload with high concurrency and 4KB writes was used to overload the store. Due to the high compaction concurrency, L0 stayed below the unhealthy thresholds, and the resource bottleneck became the total bandwidth provisioned for the disk. This setup was evaluated under both: - early-life: when the store had 10-20GB of data, when the compaction backlog was not very heavy, so there was less queueing for the limited disk bandwidth (it was still usually saturated). - later-life: when the store had around 150GB of data. In both cases, turning off flush tokens increased the duration of write stalls by > 5x. For the early-life case, ~750ms per second was spent in a write stall with flush-tokens off. The later-life case had ~200ms per second of write stalls with flush-tokens off. The lower value of the latter is paradoxically due to the worse bandwidth saturation: fsync latency rose from 2-4ms with flush-tokens on, to 11-20ms with flush-tokens off. This increase imposed a natural backpressure on writes due to the need to sync the WAL. In contrast the fsync latency was low in the early-life case, though it did increase from 0.125ms to 0.25ms when flush-tokens were turned off. In both cases, the admission throughput did not increase when turning off flush-tokens. That is, the system cannot sustain more throughput, but by turning on flush tokens, we shift queueing from the disk layer the admission control layer (where we have the capability to reorder work). Fixes cockroachdb#77357 Release note (ops change): Write tokens are now also limited based on flush throughput, so as to reduce storage layer write stalls. This behavior is enabled by default. The cluster setting admission.min_flush_util_fraction, defaulting to 0.5, can be used to disable or tune flush throughput based admission tokens, for writes to a store. Setting to a value much greater than 1, say 10, will disable flush based tokens.

In addition to byte tokens for writes computed based on compaction rate out of L0, we now compute byte tokens based on how fast the system can flush memtables into L0. The motivation is that writing to the memtable, or creating memtables faster than the system can flush results in write stalls due to memtables, that create a latency hiccup for all write traffic. We have observed write stalls that lasted > 100ms. The approach taken here for flush tokens is straightforward (there is justification based on experiments, mentioned in code comments): - Measure and smooth the peak rate that the flush loop can operate on. This relies on the recently added pebble.InternalIntervalMetrics. - The peak rate causes 100% utilization of the single flush thread, and that is potentially too high to prevent write stalls (depending on how long it takes to do a single flush). So we multiply the smoothed peak rate by a utilization-target-fraction which is dynamically adjusted and by default is constrained to the interval [0.5, 1.5]. There is additive increase and decrease of this fraction: - High usage of tokens and no write stalls cause an additive increase. - Write stalls cause an additive decrease. A small multiplier is used if there are multiple write stalls, so that the probing falls more in the region where there are no write stalls. Note that this probing scheme cannot eliminate all write stalls. For now we are ok with a reduction in write stalls. For convenience, and some additional justification mentioned in a code comment, the scheme uses the minimum of the flush and compaction tokens for writes to L0. This means that sstable ingestion into L0 is also subject to such tokens. The periodic token computation continues to be done at 15s intervals. However, instead of giving out these tokens at 1s intervals, we now give them out at 250ms intervals. This is to reduce the burstiness, since that can cause write stalls. There is a new metric, storage.write-stall-nanos, that measures the cumulative duration of write stalls, since it gives a more intuitive feel for how the system is behaving, compared to a write stall count. The scheme can be disabled by increasing the cluster setting admission.min_flush_util_fraction, which defaults to 0.5 (corresponding to the 0.5 lower bound mentioned earluer), to a high value, say 10. The scheme was evaluated using a single node cluster with the node having a high CPU count, such that CPU was not a bottleneck, even with max compaction concurrency set to 8. A kv0 workload with high concurrency and 4KB writes was used to overload the store. Due to the high compaction concurrency, L0 stayed below the unhealthy thresholds, and the resource bottleneck became the total bandwidth provisioned for the disk. This setup was evaluated under both: - early-life: when the store had 10-20GB of data, when the compaction backlog was not very heavy, so there was less queueing for the limited disk bandwidth (it was still usually saturated). - later-life: when the store had around 150GB of data. In both cases, turning off flush tokens increased the duration of write stalls by > 5x. For the early-life case, ~750ms per second was spent in a write stall with flush-tokens off. The later-life case had ~200ms per second of write stalls with flush-tokens off. The lower value of the latter is paradoxically due to the worse bandwidth saturation: fsync latency rose from 2-4ms with flush-tokens on, to 11-20ms with flush-tokens off. This increase imposed a natural backpressure on writes due to the need to sync the WAL. In contrast the fsync latency was low in the early-life case, though it did increase from 0.125ms to 0.25ms when flush-tokens were turned off. In both cases, the admission throughput did not increase when turning off flush-tokens. That is, the system cannot sustain more throughput, but by turning on flush tokens, we shift queueing from the disk layer the admission control layer (where we have the capability to reorder work). Fixes cockroachdb#77357 Release note (ops change): I/O admission control now reduces the likelihood of storage layer write stalls, which can be caused by memtable flushes becoming a bottleneck. This is done by limiting write tokens based on flush throughput, so as to reduce storage layer write stalls. Consequently, write tokens are now limited both by flush throughput, and by compaction throughput out of L0. This behavior is enabled by default. The new cluster setting admission.min_flush_util_fraction, defaulting to 0.5, can be used to disable or tune flush throughput based admission tokens. Setting it to a value much much greater than 1, say 10, will disable flush based tokens. Tuning the behavior, without disabling it, should be done only on the recommendation of a domain expert.

82440: admission,storage: introduce flush tokens to constrain write admission r=tbg,irfansharif a=sumeerbhola In addition to byte tokens for writes computed based on compaction rate out of L0, we now compute byte tokens based on how fast the system can flush memtables into L0. The motivation is that writing to the memtable, or creating memtables faster than the system can flush results in write stalls due to memtables, that create a latency hiccup for all write traffic. We have observed write stalls that lasted > 100ms. The approach taken here for flush tokens is straightforward (there is justification based on experiments, mentioned in code comments): - Measure and smooth the peak rate that the flush loop can operate on. This relies on the recently added pebble.InternalIntervalMetrics. - The peak rate causes 100% utilization of the single flush thread, and that is potentially too high to prevent write stalls (depending on how long it takes to do a single flush). So we multiply the smoothed peak rate by a utilization-target-fraction which is dynamically adjusted and by default is constrained to the interval [0.5, 1.5]. There is additive increase and decrease of this fraction: - High usage of tokens and no write stalls cause an additive increase - Write stalls cause an additive decrease. A small multiplier is used if there are multiple write stalls, so that the probing falls more in the region where there are no write stalls. Note that this probing scheme cannot eliminate all write stalls. For now we are ok with a reduction in write stalls. For convenience, and some additional justification mentioned in a code comment, the scheme uses the minimum of the flush and compaction tokens for writes to L0. This means that sstable ingestion into L0 is also subject to such tokens. The periodic token computation continues to be done at 15s intervals. However, instead of giving out these tokens at 1s intervals, we now give them out at 250ms intervals. This is to reduce the burstiness, since that can cause write stalls. There is a new metric, storage.write-stall-nanos, that measures the cumulative duration of write stalls, since it gives a more intuitive feel for how the system is behaving, compared to a write stall count. The scheme can be disabled by increasing the cluster setting admission.min_flush_util_fraction, which defaults to 0.5 (corresponding to the 0.5 lower bound mentioned earluer), to a high value, say 10. The scheme was evaluated using a single node cluster with the node having a high CPU count, such that CPU was not a bottleneck, even with max compaction concurrency set to 8. A kv0 workload with high concurrency and 4KB writes was used to overload the store. Due to the high compaction concurrency, L0 stayed below the unhealthy thresholds, and the resource bottleneck became the total bandwidth provisioned for the disk. This setup was evaluated under both: - early-life: when the store had 10-20GB of data, when the compaction backlog was not very heavy, so there was less queueing for the limited disk bandwidth (it was still usually saturated). - later-life: when the store had around 150GB of data. In both cases, turning off flush tokens increased the duration of write stalls by > 5x. For the early-life case, ~750ms per second was spent in a write stall with flush-tokens off. The later-life case had ~200ms per second of write stalls with flush-tokens off. The lower value of the latter is paradoxically due to the worse bandwidth saturation: fsync latency rose from 2-4ms with flush-tokens on, to 11-20ms with flush-tokens off. This increase imposed a natural backpressure on writes due to the need to sync the WAL. In contrast the fsync latency was low in the early-life case, though it did increase from 0.125ms to 0.25ms when flush-tokens were turned off. In both cases, the admission throughput did not increase when turning off flush-tokens. That is, the system cannot sustain more throughput, but by turning on flush tokens, we shift queueing from the disk layer the admission control layer (where we have the capability to reorder work). Screenshots for early-life: Flush tokens were turned off at 22:32:30. Prior to that the flush utilization-target-fraction was 0.625. <img width="655" alt="Screen Shot 2022-06-03 at 6 35 14 PM" src="https://user-images.githubusercontent.com/54990988/171970564-ba833e1f-b6e2-4fcd-9ee2-25228341065c.png"> <img width="663" alt="Screen Shot 2022-06-03 at 6 35 28 PM" src="https://user-images.githubusercontent.com/54990988/171970574-13e6114a-2467-48e2-a238-3b01ea32a5d6.png"> Screenshots for later-life: Flush tokens were turned off at 22:03:20. Prior to that the flush utilization-target-fraction was 0.875. <img width="665" alt="Screen Shot 2022-06-03 at 6 07 50 PM" src="https://user-images.githubusercontent.com/54990988/171970732-09b60827-7687-46de-964e-a9f97388c4fc.png"> <img width="658" alt="Screen Shot 2022-06-03 at 6 08 03 PM" src="https://user-images.githubusercontent.com/54990988/171970738-efe7a1fd-cbfd-450d-a3ac-06f681b1d190.png"> These results were produced by running ``` roachprod create -n 2 --clouds aws --aws-machine-type=c5d.9xlarge --local-ssd=false --aws-ebs-volume-type=gp2 sumeer-io roachprod put sumeer-io:1 cockroach ./cockroach roachprod put sumeer-io:2 workload ./workload roachprod start sumeer-io --env "COCKROACH_ROCKSDB_CONCURRENCY=8" roachprod run sumeer-io:2 -- ./workload run kv --init --histograms=perf/stats.json --concurrency=1024 --splits=1000 --duration=30m0s --read-percent=0 --min-block-bytes=4096 --max-block-bytes=4096 {pgurl:1-1} ``` Fixes #77357 Release note (ops change): Write tokens are now also limited based on flush throughput, so as to reduce storage layer write stalls. This behavior is enabled by default. The cluster setting admission.min_flush_util_fraction, defaulting to 0.5, can be used to disable or tune flush throughput based admission tokens, for writes to a store. Setting to a value much greater than 1, say 10, will disable flush based tokens. Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com>

sumeerbhola added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-storage Relating to our storage engine (Pebble) on-disk storage. A-admission-control T-storage Storage Team labels Mar 3, 2022

jbowens mentioned this issue Mar 14, 2022

roachtest: kv0/enc=false/nodes=1/size=64kb/conc=4096 failed [admission control] #72375

Closed

joshimhoff added the O-sre For issues SRE opened or otherwise cares about tracking. label Mar 21, 2022

sumeerbhola mentioned this issue Apr 25, 2022

admission: change store write admission control to use byte tokens #80480

Merged

sumeerbhola mentioned this issue Jun 4, 2022

admission,storage: introduce flush tokens to constrain write admission #82440

Merged

nicktrav assigned sumeerbhola Jul 5, 2022

craig bot closed this as completed in 6cad2d5 Jul 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

admission control: consider write stalls due to memtable count #77357

admission control: consider write stalls due to memtable count #77357

sumeerbhola commented Mar 3, 2022 •

edited by exalate-issue-sync bot

Loading

admission control: consider write stalls due to memtable count #77357

admission control: consider write stalls due to memtable count #77357

Comments

sumeerbhola commented Mar 3, 2022 • edited by exalate-issue-sync bot Loading

sumeerbhola commented Mar 3, 2022 •

edited by exalate-issue-sync bot

Loading