Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

admission control: consider write stalls due to memtable count #77357

Closed
sumeerbhola opened this issue Mar 3, 2022 · 0 comments · Fixed by #82440
Closed

admission control: consider write stalls due to memtable count #77357

sumeerbhola opened this issue Mar 3, 2022 · 0 comments · Fixed by #82440
Assignees
Labels
A-admission-control A-storage Relating to our storage engine (Pebble) on-disk storage. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-sre For issues SRE opened or otherwise cares about tracking. T-storage Storage Team

Comments

@sumeerbhola
Copy link
Collaborator

sumeerbhola commented Mar 3, 2022

This came up in the investigation of #72375 (comment)

The write overload protection for a store looks at number of files and sub-levels in L0. In the aforementioned roachtest we have seen write overload resulting in write stalls due to high memtable counts. This results in high latency, after a request is admitted.

Doing something for this is tricky: we need to know how fast the system can flush, since that is needed for the dynamic token computation. Using tokens that are based on bytes, and not 1 token per request, would also help here. Our current compaction based token calculation is easier: it allows us more margin for error since there is no sudden hiccup in writing if the calculation admits too many requests -- we can correct it in the next interval.

Jira issue: CRDB-13548

Epic CRDB-14607

@sumeerbhola sumeerbhola added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-storage Relating to our storage engine (Pebble) on-disk storage. A-admission-control T-storage Storage Team labels Mar 3, 2022
@joshimhoff joshimhoff added the O-sre For issues SRE opened or otherwise cares about tracking. label Mar 21, 2022
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Apr 25, 2022
This switch to byte tokens will result in better accounting for
large writes, including ingests based on whether their bytes land
in L0 or elsewhere. It is also a precurson to taking into account
flush capacity (in bytes).

The store write admission control path now uses a StoreWorkQueue
which wraps a WorkQueue and provides additional functionality:
- Work can specify WriteBytes and whether it is an IngestRequest.
  This is used to decide how many byte tokens to consume.
- Done work specifies how many bytes were ingested into L0, so
  token consumption can be adjusted.

The main framework change is that a single work item can consume
multiple (byte) tokens, which ripples through the various
interfaces including requester, granter. There is associated
cleanup: kvGranter that was handling both slots and tokens is
eliminated since in practice it was only doing one or the other.
Instead, for the slot case the slotGranter is reused. For the token
case there is a new kvStoreTokenGranter.

The main logic change is in ioLoadListener which computes byte
tokens and various estimates. The change is (mostly) neutral if
no write provides WriteBytes, since the usual estimation will
take over. The integration changes in this PR are superficial in
that the requests don't provide WriteBytes. Improvements to the
integration, along with experimental results, will happen in
future PRs.

Informs cockroachdb#79092
Informs cockroachdb#77357

Release note: None
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Apr 28, 2022
This switch to byte tokens will result in better accounting for
large writes, including ingests based on whether their bytes land
in L0 or elsewhere. It is also a precursor to taking into account
flush capacity (in bytes).

The store write admission control path now uses a StoreWorkQueue
which wraps a WorkQueue and provides additional functionality:
- Work can specify WriteBytes and whether it is an IngestRequest.
  This is used to decide how many byte tokens to consume.
- Done work specifies how many bytes were ingested into L0, so
  token consumption can be adjusted.

The main framework change is that a single work item can consume
multiple (byte) tokens, which ripples through the various
interfaces including requester, granter. There is associated
cleanup: kvGranter that was handling both slots and tokens is
eliminated since in practice it was only doing one or the other.
Instead, for the slot case the slotGranter is reused. For the token
case there is a new kvStoreTokenGranter.

The main logic change is in ioLoadListener which computes byte
tokens and various estimates. The change is (mostly) neutral if
no write provides WriteBytes, since the usual estimation will
take over. The integration changes in this PR are superficial in
that the requests don't provide WriteBytes. Improvements to the
integration, along with experimental results, will happen in
future PRs.

Informs cockroachdb#79092
Informs cockroachdb#77357

Release note: None
craig bot pushed a commit that referenced this issue May 3, 2022
80480: admission: change store write admission control to use byte tokens r=ajwerner a=sumeerbhola

This switch to byte tokens will result in better accounting for
large writes, including ingests based on whether their bytes land
in L0 or elsewhere. It is also a precurson to taking into account
flush capacity (in bytes).

The store write admission control path now uses a StoreWorkQueue
which wraps a WorkQueue and provides additional functionality:
- Work can specify WriteBytes and whether it is an IngestRequest.
  This is used to decide how many byte tokens to consume.
- Done work specifies how many bytes were ingested into L0, so
  token consumption can be adjusted.

The main framework change is that a single work item can consume
multiple (byte) tokens, which ripples through the various
interfaces including requester, granter. There is associated
cleanup: kvGranter that was handling both slots and tokens is
eliminated since in practice it was only doing one or the other.
Instead, for the slot case the slotGranter is reused. For the token
case there is a new kvStoreTokenGranter.

The main logic change is in ioLoadListener which computes byte
tokens and various estimates. The change is (mostly) neutral if
no write provides WriteBytes, since the usual estimation will
take over. The integration changes in this PR are superficial in
that the requests don't provide WriteBytes. Improvements to the
integration, along with experimental results, will happen in
future PRs.

Informs #79092
Informs #77357

Release note: None

80892: hlc: move ParseHLC / DecimalToHLC to util/hlc r=ajwerner a=otan

util/hlc needs util/log, but ParseHLC/DecimalToHLC doesn't need to be
in tree. To remove the tree dependency on util/log, move the
mentioned functions.

Release note: None

80898: util/cache: remove dependency on util/log r=ajwerner a=otan

Use a hook in IntervalCache to log any errors instead.

Release note: None

Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com>
Co-authored-by: Oliver Tan <otan@cockroachlabs.com>
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Jun 4, 2022
In addition to byte tokens for writes computed based on compaction rate
out of L0, we now compute byte tokens based on how fast the system can
flush memtables into L0. The motivation is that writing to the memtable,
or creating memtables faster than the system can flush results in write
stalls due to memtables, that create a latency hiccup for all write
traffic. We have observed write stalls that lasted > 100ms.

The approach taken here for flush tokens is straightforward (there is
justification based on experiments, mentioned in code comments):
- Measure and smooth the peak rate that the flush loop can operate on.
  This relies on the recently added pebble.InternalIntervalMetrics.
- The peak rate causes 100% utilization of the single flush thread,
  and that is potentially too high to prevent write stalls (depending
  on how long it takes to do a single flush). So we multiply the
  smoothed peak rate by a utilization-target-fraction which is
  dynamically adjusted and by default is constrained to the interval
  [0.5, 1.5]. There is additive increase and decrease of this
  fraction:
  - High usage of tokens and no write stalls cause an additive increase.
  - Write stalls cause an additive decrease. A small multiplier is used
    if there are multiple write stalls, so that the probing falls
    more in the region where there are no write stalls.

Note that this probing scheme cannot eliminate all write stalls. For
now we are ok with a reduction in write stalls.

For convenience, and some additional justification mentioned in a code
comment, the scheme uses the minimum of the flush and compaction tokens
for writes to L0. This means that sstable ingestion into L0 is also
subject to such tokens. The periodic token computation continues to be
done at 15s intervals. However, instead of giving out these tokens at
1s intervals, we now give them out at 250ms intervals. This is to
reduce the burstiness, since that can cause write stalls.

There is a new metric, storage.write-stall-nanos, that measures the
cumulative duration of write stalls, since it gives a more intuitive
feel for how the system is behaving, compared to a write stall count.

The scheme can be disabled by increasing the cluster setting
admission.min_flush_util_percent, which defaults to 50% (corresponding
to the 0.5 lower bound mentioned earluer), to a high value, say
1000%.

The scheme was evaluated using a single node cluster with the node
having a high CPU count, such that CPU was not a bottleneck, even
with max compaction concurrency set to 8. A kv0 workload with high
concurrency and 4KB writes was used to overload the store. Due
to the high compaction concurrency, L0 stayed below the unhealthy
thresholds, and the resource bottleneck became the total bandwidth
provisioned for the disk. This setup was evaluated under both:
- early-life: when the store had 10-20GB of data, when the compaction
  backlog was not very heavy, so there was less queueing for the
  limited disk bandwidth (it was still usually saturated).
- later-life: when the store had around 150GB of data.

In both cases, turning off flush tokens increased the duration of
write stalls by > 5x. For the early-life case, ~750ms per second was
spent in a write stall with flush-tokens off. The later-life case had
~200ms per second of write stalls with flush-tokens off. The lower
value of the latter is paradoxically due to the worse bandwidth
saturation: fsync latency rose from 2-4ms with flush-tokens on, to
11-20ms with flush-tokens off. This increase imposed a natural
backpressure on writes due to the need to sync the WAL. In contrast
the fsync latency was low in the early-life case, though it did
increase from 0.125ms to 0.25ms when flush-tokens were turned off.

In both cases, the admission throughput did not increase when turning
off flush-tokens. That is, the system cannot sustain more throughput,
but by turning on flush tokens, we shift queueing from the disk layer
the admission control layer (where we have the capability to reorder
work).

Fixes cockroachdb#77357

Release note (ops change): The cluster setting
admission.min_flush_util_percent can be used to disable or tune flush
throughput based admission tokens, for writes to a store. Tokens
based on flush throughput attempt to reduce storage layer write stalls.
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Jun 10, 2022
In addition to byte tokens for writes computed based on compaction rate
out of L0, we now compute byte tokens based on how fast the system can
flush memtables into L0. The motivation is that writing to the memtable,
or creating memtables faster than the system can flush results in write
stalls due to memtables, that create a latency hiccup for all write
traffic. We have observed write stalls that lasted > 100ms.

The approach taken here for flush tokens is straightforward (there is
justification based on experiments, mentioned in code comments):
- Measure and smooth the peak rate that the flush loop can operate on.
  This relies on the recently added pebble.InternalIntervalMetrics.
- The peak rate causes 100% utilization of the single flush thread,
  and that is potentially too high to prevent write stalls (depending
  on how long it takes to do a single flush). So we multiply the
  smoothed peak rate by a utilization-target-fraction which is
  dynamically adjusted and by default is constrained to the interval
  [0.5, 1.5]. There is additive increase and decrease of this
  fraction:
  - High usage of tokens and no write stalls cause an additive increase.
  - Write stalls cause an additive decrease. A small multiplier is used
    if there are multiple write stalls, so that the probing falls
    more in the region where there are no write stalls.

Note that this probing scheme cannot eliminate all write stalls. For
now we are ok with a reduction in write stalls.

For convenience, and some additional justification mentioned in a code
comment, the scheme uses the minimum of the flush and compaction tokens
for writes to L0. This means that sstable ingestion into L0 is also
subject to such tokens. The periodic token computation continues to be
done at 15s intervals. However, instead of giving out these tokens at
1s intervals, we now give them out at 250ms intervals. This is to
reduce the burstiness, since that can cause write stalls.

There is a new metric, storage.write-stall-nanos, that measures the
cumulative duration of write stalls, since it gives a more intuitive
feel for how the system is behaving, compared to a write stall count.

The scheme can be disabled by increasing the cluster setting
admission.min_flush_util_percent, which defaults to 50% (corresponding
to the 0.5 lower bound mentioned earluer), to a high value, say
1000%.

The scheme was evaluated using a single node cluster with the node
having a high CPU count, such that CPU was not a bottleneck, even
with max compaction concurrency set to 8. A kv0 workload with high
concurrency and 4KB writes was used to overload the store. Due
to the high compaction concurrency, L0 stayed below the unhealthy
thresholds, and the resource bottleneck became the total bandwidth
provisioned for the disk. This setup was evaluated under both:
- early-life: when the store had 10-20GB of data, when the compaction
  backlog was not very heavy, so there was less queueing for the
  limited disk bandwidth (it was still usually saturated).
- later-life: when the store had around 150GB of data.

In both cases, turning off flush tokens increased the duration of
write stalls by > 5x. For the early-life case, ~750ms per second was
spent in a write stall with flush-tokens off. The later-life case had
~200ms per second of write stalls with flush-tokens off. The lower
value of the latter is paradoxically due to the worse bandwidth
saturation: fsync latency rose from 2-4ms with flush-tokens on, to
11-20ms with flush-tokens off. This increase imposed a natural
backpressure on writes due to the need to sync the WAL. In contrast
the fsync latency was low in the early-life case, though it did
increase from 0.125ms to 0.25ms when flush-tokens were turned off.

In both cases, the admission throughput did not increase when turning
off flush-tokens. That is, the system cannot sustain more throughput,
but by turning on flush tokens, we shift queueing from the disk layer
the admission control layer (where we have the capability to reorder
work).

Fixes cockroachdb#77357

Release note (ops change): The cluster setting
admission.min_flush_util_percent can be used to disable or tune flush
throughput based admission tokens, for writes to a store. Tokens
based on flush throughput attempt to reduce storage layer write stalls.
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Jun 14, 2022
In addition to byte tokens for writes computed based on compaction rate
out of L0, we now compute byte tokens based on how fast the system can
flush memtables into L0. The motivation is that writing to the memtable,
or creating memtables faster than the system can flush results in write
stalls due to memtables, that create a latency hiccup for all write
traffic. We have observed write stalls that lasted > 100ms.

The approach taken here for flush tokens is straightforward (there is
justification based on experiments, mentioned in code comments):
- Measure and smooth the peak rate that the flush loop can operate on.
  This relies on the recently added pebble.InternalIntervalMetrics.
- The peak rate causes 100% utilization of the single flush thread,
  and that is potentially too high to prevent write stalls (depending
  on how long it takes to do a single flush). So we multiply the
  smoothed peak rate by a utilization-target-fraction which is
  dynamically adjusted and by default is constrained to the interval
  [0.5, 1.5]. There is additive increase and decrease of this
  fraction:
  - High usage of tokens and no write stalls cause an additive increase.
  - Write stalls cause an additive decrease. A small multiplier is used
    if there are multiple write stalls, so that the probing falls
    more in the region where there are no write stalls.

Note that this probing scheme cannot eliminate all write stalls. For
now we are ok with a reduction in write stalls.

For convenience, and some additional justification mentioned in a code
comment, the scheme uses the minimum of the flush and compaction tokens
for writes to L0. This means that sstable ingestion into L0 is also
subject to such tokens. The periodic token computation continues to be
done at 15s intervals. However, instead of giving out these tokens at
1s intervals, we now give them out at 250ms intervals. This is to
reduce the burstiness, since that can cause write stalls.

There is a new metric, storage.write-stall-nanos, that measures the
cumulative duration of write stalls, since it gives a more intuitive
feel for how the system is behaving, compared to a write stall count.

The scheme can be disabled by increasing the cluster setting
admission.min_flush_util_fraction, which defaults to 0.5 (corresponding
to the 0.5 lower bound mentioned earluer), to a high value, say
10.

The scheme was evaluated using a single node cluster with the node
having a high CPU count, such that CPU was not a bottleneck, even
with max compaction concurrency set to 8. A kv0 workload with high
concurrency and 4KB writes was used to overload the store. Due
to the high compaction concurrency, L0 stayed below the unhealthy
thresholds, and the resource bottleneck became the total bandwidth
provisioned for the disk. This setup was evaluated under both:
- early-life: when the store had 10-20GB of data, when the compaction
  backlog was not very heavy, so there was less queueing for the
  limited disk bandwidth (it was still usually saturated).
- later-life: when the store had around 150GB of data.

In both cases, turning off flush tokens increased the duration of
write stalls by > 5x. For the early-life case, ~750ms per second was
spent in a write stall with flush-tokens off. The later-life case had
~200ms per second of write stalls with flush-tokens off. The lower
value of the latter is paradoxically due to the worse bandwidth
saturation: fsync latency rose from 2-4ms with flush-tokens on, to
11-20ms with flush-tokens off. This increase imposed a natural
backpressure on writes due to the need to sync the WAL. In contrast
the fsync latency was low in the early-life case, though it did
increase from 0.125ms to 0.25ms when flush-tokens were turned off.

In both cases, the admission throughput did not increase when turning
off flush-tokens. That is, the system cannot sustain more throughput,
but by turning on flush tokens, we shift queueing from the disk layer
the admission control layer (where we have the capability to reorder
work).

Fixes cockroachdb#77357

Release note (ops change): Write tokens are now also limited based on
flush throughput, so as to reduce storage layer write stalls. This
behavior is enabled by default. The cluster setting
admission.min_flush_util_fraction, defaulting to 0.5, can be used to
disable or tune flush throughput based admission tokens, for writes
to a store. Setting to a value much greater than 1, say 10, will
disable flush based tokens.
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Jul 6, 2022
In addition to byte tokens for writes computed based on compaction rate
out of L0, we now compute byte tokens based on how fast the system can
flush memtables into L0. The motivation is that writing to the memtable,
or creating memtables faster than the system can flush results in write
stalls due to memtables, that create a latency hiccup for all write
traffic. We have observed write stalls that lasted > 100ms.

The approach taken here for flush tokens is straightforward (there is
justification based on experiments, mentioned in code comments):
- Measure and smooth the peak rate that the flush loop can operate on.
  This relies on the recently added pebble.InternalIntervalMetrics.
- The peak rate causes 100% utilization of the single flush thread,
  and that is potentially too high to prevent write stalls (depending
  on how long it takes to do a single flush). So we multiply the
  smoothed peak rate by a utilization-target-fraction which is
  dynamically adjusted and by default is constrained to the interval
  [0.5, 1.5]. There is additive increase and decrease of this
  fraction:
  - High usage of tokens and no write stalls cause an additive increase.
  - Write stalls cause an additive decrease. A small multiplier is used
    if there are multiple write stalls, so that the probing falls
    more in the region where there are no write stalls.

Note that this probing scheme cannot eliminate all write stalls. For
now we are ok with a reduction in write stalls.

For convenience, and some additional justification mentioned in a code
comment, the scheme uses the minimum of the flush and compaction tokens
for writes to L0. This means that sstable ingestion into L0 is also
subject to such tokens. The periodic token computation continues to be
done at 15s intervals. However, instead of giving out these tokens at
1s intervals, we now give them out at 250ms intervals. This is to
reduce the burstiness, since that can cause write stalls.

There is a new metric, storage.write-stall-nanos, that measures the
cumulative duration of write stalls, since it gives a more intuitive
feel for how the system is behaving, compared to a write stall count.

The scheme can be disabled by increasing the cluster setting
admission.min_flush_util_fraction, which defaults to 0.5 (corresponding
to the 0.5 lower bound mentioned earluer), to a high value, say
10.

The scheme was evaluated using a single node cluster with the node
having a high CPU count, such that CPU was not a bottleneck, even
with max compaction concurrency set to 8. A kv0 workload with high
concurrency and 4KB writes was used to overload the store. Due
to the high compaction concurrency, L0 stayed below the unhealthy
thresholds, and the resource bottleneck became the total bandwidth
provisioned for the disk. This setup was evaluated under both:
- early-life: when the store had 10-20GB of data, when the compaction
  backlog was not very heavy, so there was less queueing for the
  limited disk bandwidth (it was still usually saturated).
- later-life: when the store had around 150GB of data.

In both cases, turning off flush tokens increased the duration of
write stalls by > 5x. For the early-life case, ~750ms per second was
spent in a write stall with flush-tokens off. The later-life case had
~200ms per second of write stalls with flush-tokens off. The lower
value of the latter is paradoxically due to the worse bandwidth
saturation: fsync latency rose from 2-4ms with flush-tokens on, to
11-20ms with flush-tokens off. This increase imposed a natural
backpressure on writes due to the need to sync the WAL. In contrast
the fsync latency was low in the early-life case, though it did
increase from 0.125ms to 0.25ms when flush-tokens were turned off.

In both cases, the admission throughput did not increase when turning
off flush-tokens. That is, the system cannot sustain more throughput,
but by turning on flush tokens, we shift queueing from the disk layer
the admission control layer (where we have the capability to reorder
work).

Fixes cockroachdb#77357

Release note (ops change): Write tokens are now also limited based on
flush throughput, so as to reduce storage layer write stalls. This
behavior is enabled by default. The cluster setting
admission.min_flush_util_fraction, defaulting to 0.5, can be used to
disable or tune flush throughput based admission tokens, for writes
to a store. Setting to a value much greater than 1, say 10, will
disable flush based tokens.
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Jul 13, 2022
In addition to byte tokens for writes computed based on compaction rate
out of L0, we now compute byte tokens based on how fast the system can
flush memtables into L0. The motivation is that writing to the memtable,
or creating memtables faster than the system can flush results in write
stalls due to memtables, that create a latency hiccup for all write
traffic. We have observed write stalls that lasted > 100ms.

The approach taken here for flush tokens is straightforward (there is
justification based on experiments, mentioned in code comments):
- Measure and smooth the peak rate that the flush loop can operate on.
  This relies on the recently added pebble.InternalIntervalMetrics.
- The peak rate causes 100% utilization of the single flush thread,
  and that is potentially too high to prevent write stalls (depending
  on how long it takes to do a single flush). So we multiply the
  smoothed peak rate by a utilization-target-fraction which is
  dynamically adjusted and by default is constrained to the interval
  [0.5, 1.5]. There is additive increase and decrease of this
  fraction:
  - High usage of tokens and no write stalls cause an additive increase.
  - Write stalls cause an additive decrease. A small multiplier is used
    if there are multiple write stalls, so that the probing falls
    more in the region where there are no write stalls.

Note that this probing scheme cannot eliminate all write stalls. For
now we are ok with a reduction in write stalls.

For convenience, and some additional justification mentioned in a code
comment, the scheme uses the minimum of the flush and compaction tokens
for writes to L0. This means that sstable ingestion into L0 is also
subject to such tokens. The periodic token computation continues to be
done at 15s intervals. However, instead of giving out these tokens at
1s intervals, we now give them out at 250ms intervals. This is to
reduce the burstiness, since that can cause write stalls.

There is a new metric, storage.write-stall-nanos, that measures the
cumulative duration of write stalls, since it gives a more intuitive
feel for how the system is behaving, compared to a write stall count.

The scheme can be disabled by increasing the cluster setting
admission.min_flush_util_fraction, which defaults to 0.5 (corresponding
to the 0.5 lower bound mentioned earluer), to a high value, say
10.

The scheme was evaluated using a single node cluster with the node
having a high CPU count, such that CPU was not a bottleneck, even
with max compaction concurrency set to 8. A kv0 workload with high
concurrency and 4KB writes was used to overload the store. Due
to the high compaction concurrency, L0 stayed below the unhealthy
thresholds, and the resource bottleneck became the total bandwidth
provisioned for the disk. This setup was evaluated under both:
- early-life: when the store had 10-20GB of data, when the compaction
  backlog was not very heavy, so there was less queueing for the
  limited disk bandwidth (it was still usually saturated).
- later-life: when the store had around 150GB of data.

In both cases, turning off flush tokens increased the duration of
write stalls by > 5x. For the early-life case, ~750ms per second was
spent in a write stall with flush-tokens off. The later-life case had
~200ms per second of write stalls with flush-tokens off. The lower
value of the latter is paradoxically due to the worse bandwidth
saturation: fsync latency rose from 2-4ms with flush-tokens on, to
11-20ms with flush-tokens off. This increase imposed a natural
backpressure on writes due to the need to sync the WAL. In contrast
the fsync latency was low in the early-life case, though it did
increase from 0.125ms to 0.25ms when flush-tokens were turned off.

In both cases, the admission throughput did not increase when turning
off flush-tokens. That is, the system cannot sustain more throughput,
but by turning on flush tokens, we shift queueing from the disk layer
the admission control layer (where we have the capability to reorder
work).

Fixes cockroachdb#77357

Release note (ops change): Write tokens are now also limited based on
flush throughput, so as to reduce storage layer write stalls. This
behavior is enabled by default. The cluster setting
admission.min_flush_util_fraction, defaulting to 0.5, can be used to
disable or tune flush throughput based admission tokens, for writes
to a store. Setting to a value much greater than 1, say 10, will
disable flush based tokens.
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Jul 13, 2022
In addition to byte tokens for writes computed based on compaction rate
out of L0, we now compute byte tokens based on how fast the system can
flush memtables into L0. The motivation is that writing to the memtable,
or creating memtables faster than the system can flush results in write
stalls due to memtables, that create a latency hiccup for all write
traffic. We have observed write stalls that lasted > 100ms.

The approach taken here for flush tokens is straightforward (there is
justification based on experiments, mentioned in code comments):
- Measure and smooth the peak rate that the flush loop can operate on.
  This relies on the recently added pebble.InternalIntervalMetrics.
- The peak rate causes 100% utilization of the single flush thread,
  and that is potentially too high to prevent write stalls (depending
  on how long it takes to do a single flush). So we multiply the
  smoothed peak rate by a utilization-target-fraction which is
  dynamically adjusted and by default is constrained to the interval
  [0.5, 1.5]. There is additive increase and decrease of this
  fraction:
  - High usage of tokens and no write stalls cause an additive increase.
  - Write stalls cause an additive decrease. A small multiplier is used
    if there are multiple write stalls, so that the probing falls
    more in the region where there are no write stalls.

Note that this probing scheme cannot eliminate all write stalls. For
now we are ok with a reduction in write stalls.

For convenience, and some additional justification mentioned in a code
comment, the scheme uses the minimum of the flush and compaction tokens
for writes to L0. This means that sstable ingestion into L0 is also
subject to such tokens. The periodic token computation continues to be
done at 15s intervals. However, instead of giving out these tokens at
1s intervals, we now give them out at 250ms intervals. This is to
reduce the burstiness, since that can cause write stalls.

There is a new metric, storage.write-stall-nanos, that measures the
cumulative duration of write stalls, since it gives a more intuitive
feel for how the system is behaving, compared to a write stall count.

The scheme can be disabled by increasing the cluster setting
admission.min_flush_util_fraction, which defaults to 0.5 (corresponding
to the 0.5 lower bound mentioned earluer), to a high value, say
10.

The scheme was evaluated using a single node cluster with the node
having a high CPU count, such that CPU was not a bottleneck, even
with max compaction concurrency set to 8. A kv0 workload with high
concurrency and 4KB writes was used to overload the store. Due
to the high compaction concurrency, L0 stayed below the unhealthy
thresholds, and the resource bottleneck became the total bandwidth
provisioned for the disk. This setup was evaluated under both:
- early-life: when the store had 10-20GB of data, when the compaction
  backlog was not very heavy, so there was less queueing for the
  limited disk bandwidth (it was still usually saturated).
- later-life: when the store had around 150GB of data.

In both cases, turning off flush tokens increased the duration of
write stalls by > 5x. For the early-life case, ~750ms per second was
spent in a write stall with flush-tokens off. The later-life case had
~200ms per second of write stalls with flush-tokens off. The lower
value of the latter is paradoxically due to the worse bandwidth
saturation: fsync latency rose from 2-4ms with flush-tokens on, to
11-20ms with flush-tokens off. This increase imposed a natural
backpressure on writes due to the need to sync the WAL. In contrast
the fsync latency was low in the early-life case, though it did
increase from 0.125ms to 0.25ms when flush-tokens were turned off.

In both cases, the admission throughput did not increase when turning
off flush-tokens. That is, the system cannot sustain more throughput,
but by turning on flush tokens, we shift queueing from the disk layer
the admission control layer (where we have the capability to reorder
work).

Fixes cockroachdb#77357

Release note (ops change): I/O admission control now reduces the
likelihood of storage layer write stalls, which can be caused by
memtable flushes becoming a bottleneck. This is done by limiting
write tokens based on flush throughput, so as to reduce storage
layer write stalls. Consequently, write tokens are now limited
both by flush throughput, and by compaction throughput out of L0.
This behavior is enabled by default. The new cluster setting
admission.min_flush_util_fraction, defaulting to 0.5, can be used to
disable or tune flush throughput based admission tokens. Setting
it to a value much much greater than 1, say 10, will disable flush
based tokens. Tuning the behavior, without disabling it, should
be done only on the recommendation of a domain expert.
craig bot pushed a commit that referenced this issue Jul 14, 2022
82440: admission,storage: introduce flush tokens to constrain write admission r=tbg,irfansharif a=sumeerbhola

In addition to byte tokens for writes computed based on compaction rate
out of L0, we now compute byte tokens based on how fast the system can
flush memtables into L0. The motivation is that writing to the memtable,
or creating memtables faster than the system can flush results in write
stalls due to memtables, that create a latency hiccup for all write
traffic. We have observed write stalls that lasted > 100ms.

The approach taken here for flush tokens is straightforward (there is
justification based on experiments, mentioned in code comments):
- Measure and smooth the peak rate that the flush loop can operate on.
  This relies on the recently added pebble.InternalIntervalMetrics.
- The peak rate causes 100% utilization of the single flush thread,
  and that is potentially too high to prevent write stalls (depending
  on how long it takes to do a single flush). So we multiply the
  smoothed peak rate by a utilization-target-fraction which is
  dynamically adjusted and by default is constrained to the interval
  [0.5, 1.5]. There is additive increase and decrease of this
  fraction:
  - High usage of tokens and no write stalls cause an additive increase
  - Write stalls cause an additive decrease. A small multiplier is used
    if there are multiple write stalls, so that the probing falls
    more in the region where there are no write stalls.

Note that this probing scheme cannot eliminate all write stalls. For
now we are ok with a reduction in write stalls.

For convenience, and some additional justification mentioned in a code
comment, the scheme uses the minimum of the flush and compaction tokens
for writes to L0. This means that sstable ingestion into L0 is also
subject to such tokens. The periodic token computation continues to be
done at 15s intervals. However, instead of giving out these tokens at
1s intervals, we now give them out at 250ms intervals. This is to
reduce the burstiness, since that can cause write stalls.

There is a new metric, storage.write-stall-nanos, that measures the
cumulative duration of write stalls, since it gives a more intuitive
feel for how the system is behaving, compared to a write stall count.

The scheme can be disabled by increasing the cluster setting
admission.min_flush_util_fraction, which defaults to 0.5 (corresponding
to the 0.5 lower bound mentioned earluer), to a high value, say
10.

The scheme was evaluated using a single node cluster with the node
having a high CPU count, such that CPU was not a bottleneck, even
with max compaction concurrency set to 8. A kv0 workload with high
concurrency and 4KB writes was used to overload the store. Due
to the high compaction concurrency, L0 stayed below the unhealthy
thresholds, and the resource bottleneck became the total bandwidth
provisioned for the disk. This setup was evaluated under both:
- early-life: when the store had 10-20GB of data, when the compaction
  backlog was not very heavy, so there was less queueing for the
  limited disk bandwidth (it was still usually saturated).
- later-life: when the store had around 150GB of data.

In both cases, turning off flush tokens increased the duration of
write stalls by > 5x. For the early-life case, ~750ms per second was
spent in a write stall with flush-tokens off. The later-life case had
~200ms per second of write stalls with flush-tokens off. The lower
value of the latter is paradoxically due to the worse bandwidth
saturation: fsync latency rose from 2-4ms with flush-tokens on, to
11-20ms with flush-tokens off. This increase imposed a natural
backpressure on writes due to the need to sync the WAL. In contrast
the fsync latency was low in the early-life case, though it did
increase from 0.125ms to 0.25ms when flush-tokens were turned off.

In both cases, the admission throughput did not increase when turning
off flush-tokens. That is, the system cannot sustain more throughput,
but by turning on flush tokens, we shift queueing from the disk layer
the admission control layer (where we have the capability to reorder
work).

Screenshots for early-life: Flush tokens were turned off at 22:32:30. Prior to that the flush utilization-target-fraction was 0.625.
<img width="655" alt="Screen Shot 2022-06-03 at 6 35 14 PM" src="https://user-images.githubusercontent.com/54990988/171970564-ba833e1f-b6e2-4fcd-9ee2-25228341065c.png">
<img width="663" alt="Screen Shot 2022-06-03 at 6 35 28 PM" src="https://user-images.githubusercontent.com/54990988/171970574-13e6114a-2467-48e2-a238-3b01ea32a5d6.png">

Screenshots for later-life: Flush tokens were turned off at 22:03:20. Prior to that the flush utilization-target-fraction was 0.875.
<img width="665" alt="Screen Shot 2022-06-03 at 6 07 50 PM" src="https://user-images.githubusercontent.com/54990988/171970732-09b60827-7687-46de-964e-a9f97388c4fc.png">
<img width="658" alt="Screen Shot 2022-06-03 at 6 08 03 PM" src="https://user-images.githubusercontent.com/54990988/171970738-efe7a1fd-cbfd-450d-a3ac-06f681b1d190.png">

These results were produced by running
```
roachprod create -n 2 --clouds aws  --aws-machine-type=c5d.9xlarge --local-ssd=false --aws-ebs-volume-type=gp2 sumeer-io
roachprod put sumeer-io:1 cockroach ./cockroach
roachprod put sumeer-io:2 workload ./workload
roachprod start sumeer-io --env "COCKROACH_ROCKSDB_CONCURRENCY=8"
roachprod run sumeer-io:2 -- ./workload run kv --init --histograms=perf/stats.json --concurrency=1024 --splits=1000 --duration=30m0s --read-percent=0 --min-block-bytes=4096 --max-block-bytes=4096 {pgurl:1-1}
```

Fixes #77357

Release note (ops change): Write tokens are now also limited based on
flush throughput, so as to reduce storage layer write stalls. This
behavior is enabled by default. The cluster setting
admission.min_flush_util_fraction, defaulting to 0.5, can be used to
disable or tune flush throughput based admission tokens, for writes
to a store. Setting to a value much greater than 1, say 10, will
disable flush based tokens.

Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com>
@craig craig bot closed this as completed in 6cad2d5 Jul 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-admission-control A-storage Relating to our storage engine (Pebble) on-disk storage. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-sre For issues SRE opened or otherwise cares about tracking. T-storage Storage Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants