db: add DB.ApplyNoSyncWait for asynchronous apply #2117

sumeerbhola · 2022-11-09T23:50:38Z

ApplyNoSyncWait must only be used when WriteOptions.Sync is true. It enqueues
the Batch to the WAL, adds to the memtable, and waits until the batch is
visible in the memtable, and then returns to the caller. The caller is
responsible for calling Batch.SyncWait to wait until the write to the
WAL is fsynced.

This change required splitting the WaitGroup in the Batch into two
WaitGroups, so waiting for the visibility can happen separately from
waiting for the WAL write. Additionally, the channel used as a semaphore
for reserving space in the two lock-free queues is split into two channels,
since dequeueing from these queues can happen in arbitrary order.

Benchmarks indicate that the overhead of pushing and popping on an extra
channel is tolerable. Benchmarks were run on a macbook pro -- note these are
not doing an actual sync since they use io.Discard, and are only benchmarking
the commit pipeline.

Sync wait on master (old) vs this branch (new):

name                                               old time/op    new time/op    delta
CommitPipeline/no-sync-wait=false/parallel=1-10      1.09µs ± 6%    1.15µs ± 9%    ~     (p=0.310 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=2-10      1.53µs ± 4%    1.54µs ± 2%    ~     (p=0.841 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=4-10      1.54µs ± 1%    1.59µs ± 1%  +2.87%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=8-10      1.52µs ± 1%    1.55µs ± 1%  +2.43%  (p=0.008 n=5+5)

name                                               old speed      new speed      delta
CommitPipeline/no-sync-wait=false/parallel=1-10    14.7MB/s ± 5%  13.9MB/s ±10%    ~     (p=0.310 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=2-10    10.5MB/s ± 4%  10.4MB/s ± 2%    ~     (p=0.841 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=4-10    10.4MB/s ± 1%  10.1MB/s ± 1%  -2.78%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=8-10    10.5MB/s ± 1%  10.3MB/s ± 1%  -2.35%  (p=0.008 n=5+5)

name                                               old alloc/op   new alloc/op   delta
CommitPipeline/no-sync-wait=false/parallel=1-10      1.37kB ± 0%    1.40kB ± 0%  +2.31%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=2-10      1.37kB ± 0%    1.40kB ± 0%  +2.31%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=4-10      1.37kB ± 0%    1.40kB ± 0%  +2.15%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=8-10      1.37kB ± 0%    1.40kB ± 0%  +2.34%  (p=0.008 n=5+5)

name                                               old allocs/op  new allocs/op  delta
CommitPipeline/no-sync-wait=false/parallel=1-10        2.00 ± 0%      2.00 ± 0%    ~     (all equal)
CommitPipeline/no-sync-wait=false/parallel=2-10        2.00 ± 0%      2.00 ± 0%    ~     (all equal)
CommitPipeline/no-sync-wait=false/parallel=4-10        2.00 ± 0%      2.00 ± 0%    ~     (all equal)
CommitPipeline/no-sync-wait=false/parallel=8-10        2.00 ± 0%      2.00 ± 0%    ~     (all equal)

Sync wait on this branch (old) vs async wait on this branch (new):

name                            old time/op    new time/op    delta
CommitPipeline/parallel=1-10      1.15µs ± 9%    1.20µs ± 7%     ~     (p=0.421 n=5+5)
CommitPipeline/parallel=2-10      1.54µs ± 2%    1.59µs ± 2%   +3.50%  (p=0.008 n=5+5)
CommitPipeline/parallel=4-10      1.59µs ± 1%    1.58µs ± 1%     ~     (p=0.802 n=5+5)
CommitPipeline/parallel=8-10      1.55µs ± 1%    1.56µs ± 1%     ~     (p=0.452 n=5+5)

name                            old speed      new speed      delta
CommitPipeline/parallel=1-10    13.9MB/s ±10%  13.3MB/s ± 7%     ~     (p=0.421 n=5+5)
CommitPipeline/parallel=2-10    10.4MB/s ± 2%  10.1MB/s ± 2%   -3.36%  (p=0.008 n=5+5)
CommitPipeline/parallel=4-10    10.1MB/s ± 1%  10.1MB/s ± 1%     ~     (p=0.786 n=5+5)
CommitPipeline/parallel=8-10    10.3MB/s ± 1%  10.3MB/s ± 1%     ~     (p=0.452 n=5+5)

name                            old alloc/op   new alloc/op   delta
CommitPipeline/parallel=1-10      1.40kB ± 0%    1.40kB ± 0%     ~     (p=0.651 n=5+5)
CommitPipeline/parallel=2-10      1.40kB ± 0%    1.39kB ± 0%   -0.21%  (p=0.008 n=5+5)
CommitPipeline/parallel=4-10      1.40kB ± 0%    1.40kB ± 0%     ~     (p=0.706 n=5+5)
CommitPipeline/parallel=8-10      1.40kB ± 0%    1.40kB ± 0%     ~     (p=0.587 n=5+5)

name                            old allocs/op  new allocs/op  delta
CommitPipeline/parallel=1-10        2.00 ± 0%      2.00 ± 0%     ~     (all equal)
CommitPipeline/parallel=2-10        2.00 ± 0%      2.00 ± 0%     ~     (all equal)
CommitPipeline/parallel=4-10        2.00 ± 0%      2.00 ± 0%     ~     (all equal)
CommitPipeline/parallel=8-10        2.00 ± 0%      2.00 ± 0%     ~     (all equal)

Informs cockroachdb/cockroach#17500

See discussion thread cockroachdb/cockroach#87050 (review)

cockroach-teamcity · 2022-11-09T23:50:45Z

This change is

sumeerbhola

Reviewable status: 0 of 8 files reviewed, 1 unresolved discussion (waiting on @jbowens and @nvanbenschoten)

db.go line 738 at r1 (raw file):

// first calling Batch.SyncWait.
// RECOMMENDATION: Prefer using Apply unless you really understand why you
// need ApplyNoSyncWait.

@nvanbenschoten this is the interface.

sumeerbhola

Reviewable status: 0 of 8 files reviewed, 2 unresolved discussions (waiting on @jbowens and @nvanbenschoten)

commit.go line 275 at r1 (raw file):

	// TODO(sumeer): will block here if SyncConcurrency is insufficient. Monitor
	// and increase if needed.

@nvanbenschoten the potential bottleneck

nvanbenschoten

Thanks for building this Sumeer. I've integrated it into a pair of different prototypes that extend from cockroachdb/cockroach#87050 (review). The first splits off a pool of raft log appender goroutines that use this API to avoid waiting in lockstep with one another. The second simply calls this new API directly from the existing raft state machine handler goroutine and pushes the callback handling onto a separate goroutine. Both have demonstrated the expected latency improvement and show a near-identical throughput-latency curve.

I'll be posting a set of results and recommendations on cockroachdb/cockroach#17500 either today or tomorrow. Regardless of the approach that we take in 23.1, I think we'll want to use this API.

Reviewed 5 of 8 files at r1, 2 of 2 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @jbowens and @sumeerbhola)

commit.go line 227 at r3 (raw file):

	// These lock-free queues have a fixed capacity. And since they are
	// lock-free, we cannot do blocking waits when pushing onto these queues,
	// when case they are full. Additionally, adding to these queues happens

s/when case/in case/

commit.go line 254 at r3 (raw file):

		// memtable and publishing. Then the LogWriter.flusher.syncQ could be
		// sized as 0.5*5 of the commitPipeling.pending queue. So we could size
		// these based on typical experimental results.

How do you imagine that such experiments would work? Do we have the observability in place to understand when a workload is being throttled by these bounded queue sizes? It doesn't look like it. Would you recommend that I run with a patched version of Pebble that detects when sends on either of these channels block? Or perhaps a blocking profile could be used to observe the same thing.

Related, should we make SyncConcurrency configurable so that a user of pebble who finds that their desired amount of concurrency causes these queues to reach their capacity can resolve this?

db.go line 736 at r3 (raw file):

The caller must not Close the batch without first calling Batch.SyncWait.

Does this contract cause the Batch to hold on to resources for longer than strictly necessary?

sumeerbhola

Both have demonstrated the expected latency improvement and show a near-identical throughput-latency curve.

Excellent!

Regardless of the approach that we take in 23.1, I think we'll want to use this API.

I'll need to microbenchmark the change here. Worst case, we could time things such that the change here, plus the corresponding CockroachDB changes for async raft log writes, land in the same release so that any microbenchmarking loss here is made up for with the other benefits.

Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @jbowens and @nvanbenschoten)

commit.go line 254 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

How do you imagine that such experiments would work? Do we have the observability in place to understand when a workload is being throttled by these bounded queue sizes? It doesn't look like it. Would you recommend that I run with a patched version of Pebble that detects when sends on either of these channels block? Or perhaps a blocking profile could be used to observe the same thing.

Related, should we make SyncConcurrency configurable so that a user of pebble who finds that their desired amount of concurrency causes these queues to reach their capacity can resolve this?

We have some LogWriterMetrics here

pebble/record/log_writer.go

Lines 735 to 736 in b9289d7

    
           PendingBufferLen base.GaugeSampleMetric 
        
           SyncQueueLen     base.GaugeSampleMetric

that are exposed in pebble Metrics here

pebble/metrics.go

Line 235 in b9289d7

record.LogWriterMetrics

.
One could export these as metrics from kvserver.StoreMetrics.updateEngineMetrics. These are cumulative samples of the length, so one needs to take a delta and then compute the mean -- we can only tell if the mean is high. We could also count in LogWriter whenever one of these queues is full -- it knows when it is full when dequeuing.
Which reminds me that the maximum number of pending blocks (CapAllocatedBlocks) can also be a bottleneck (which is what this pending metric is measuring -- not the commitPipeline.pending).

I am hopeful we don't need to make these sizes configurable, once we measure. We have arrays with these sizes and I am wary about changing them to slices.

db.go line 736 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

The caller must not Close the batch without first calling Batch.SyncWait.

Does this contract cause the Batch to hold on to resources for longer than strictly necessary?

Yes. It was the minimum viable thing to do. Do you think there will be enough bytes in the "pipeline" that this will be an issue? One could separate out the 2 fields in Batch that are needed for the LogWriter and discard the rest.

Fixes cockroachdb#17500. Waiting on github.com/cockroachdb/pebble/pull/2117. This commit integrates with the `AsyncStorageWrites` functionality that we added to Raft in github.com/etcd-io/raft/pull/8. \## Approach The commit makes the minimal changes needed to integrate with async storage writes and pull fsyncs out of the raft state machine loop. It does not make an effort to extract the non-durable portion of raft log writes or raft log application onto separate goroutine pools, as was described in cockroachdb#17500. Those changes will also be impactful, but they're non trivial and bump into a pipelining vs. batching trade-off, so they are left as future work items (TODO(nvanbenschoten): open new issues). With this change, asynchronous Raft log syncs are enabled by the new `DB.ApplyNoSyncWait` Pebble API introduced in github.com/cockroachdb/pebble/pull/2117. The `handleRaftReady` state machine loop continues to initiate Raft log writes, but it uses the Pebble API to offload waiting on durability to a separate goroutine. This separate goroutine then sends the corresponding `MsgStorageAppend`'s response messages where they need to go (locally and/or to the Raft leader) when the fsync completes. The async storage writes functionality in Raft makes this all safe. \## Benchmark Results The result of this change is reduced interference between Raft proposals. As a result, it reduces end-to-end commit latency. github.com/etcd-io/raft/pull/8 presented a collection of benchmark results captured from integrating async storage writes with rafttoy. When integrated into CockroachDB, we see similar improvements to average and tail latency. However, it doesn't provide the throughput improvements at the top end because log appends and state machine application have not yet been extracted into separate goroutine pools, which would facilitate increased opportunity for batching. TODO: add images ---- Release note (performance improvement): The Raft proposal pipeline has been optimized to reduce interference between Raft proposals. This improves average and tail write latency at high concurrency.

sumeerbhola

Reviewable status: 0 of 8 files reviewed, 4 unresolved discussions (waiting on @jbowens and @nvanbenschoten)

commit.go line 275 at r1 (raw file):

Previously, sumeerbhola wrote…

@nvanbenschoten the potential bottleneck

removed this TODO comment since the earlier comment subsumes it.

commit.go line 227 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

s/when case/in case/

Done

commit.go line 254 at r3 (raw file):

Previously, sumeerbhola wrote…

We have some LogWriterMetrics here

pebble/record/log_writer.go

Lines 735 to 736 in b9289d7

PendingBufferLen base.GaugeSampleMetric

SyncQueueLen base.GaugeSampleMetric

that are exposed in pebble Metrics here

pebble/metrics.go

Line 235 in b9289d7

record.LogWriterMetrics

.
One could export these as metrics from kvserver.StoreMetrics.updateEngineMetrics. These are cumulative samples of the length, so one needs to take a delta and then compute the mean -- we can only tell if the mean is high. We could also count in LogWriter whenever one of these queues is full -- it knows when it is full when dequeuing.
Which reminds me that the maximum number of pending blocks (CapAllocatedBlocks) can also be a bottleneck (which is what this pending metric is measuring -- not the commitPipeline.pending).

I am hopeful we don't need to make these sizes configurable, once we measure. We have arrays with these sizes and I am wary about changing them to slices.

Updated the code comment.

db.go line 736 at r3 (raw file):

Previously, sumeerbhola wrote…

Yes. It was the minimum viable thing to do. Do you think there will be enough bytes in the "pipeline" that this will be an issue? One could separate out the 2 fields in Batch that are needed for the LogWriter and discard the rest.

I've added a TODO in batch.go to optimize the memory in the future, if it becomes an issue. I am not eager to do it now since having another struct is likely to cause some cpu performance regression.

sumeerbhola

Added microbenchmark. Looks ok to me.
And improved some testing.

This is ready for a proper review.

Reviewable status: 0 of 8 files reviewed, 4 unresolved discussions (waiting on @jbowens, @nicktrav, and @nvanbenschoten)

nvanbenschoten

thanks!

Someone from Storage should also give this a pass.

Reviewed 2 of 8 files at r4, 6 of 6 files at r5, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @jbowens, @nicktrav, and @sumeerbhola)

-- commits line 24 at r5:
5 iterations per variant is the minimum that benchstat will consider to be statistically significant, but it's still not very many and may be hiding smaller changes in performance. Consider increasing the iteration count. Or at least do so in the future.

Also, in case you aren't familiar, benchdiff makes this kind of microbenchmark comparison quite easy to run. This would just be (when run from this branch):

benchdiff --count=20 --run='CommitPipeline/(1|2|4|8)' .

jbowens

Reviewed 2 of 8 files at r4, 6 of 6 files at r5, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @nicktrav and @sumeerbhola)

batch.go line 270 at r5 (raw file):

	// into a separate struct that is allocated separately (using another
	// sync.Pool), and only that struct needs to outlive Batch.Close (which
	// could then be called immediately after ApplyNoSncWait).

nit: s/ApplyNoSncWait/ApplyNoSyncWait/

commit.go line 418 at r5 (raw file):

	// commitCount represents the waiting needed for publish, and optionally the
	// waiting needed for the WAL sync.
	b.commit.Add(commitCount)

nit: maybe now it'd read more clearly to use an explicit constant in each .Add call up above. something like

switch {
case !syncWAL:
    // Only need to wait for the publish.
    b.commit.Add(1)
case syncWAL && noSyncWait:
    syncErr = &b.commitErr
    syncWG = &b.fsyncWait
    // Only need to wait synchronously for the publish; Asynchronously the user
    // will wait on the batch's fsyncWait.
    b.commit.Add(1)
    b.fsyncWait.Add(1)
case syncWAL && !noSyncWait:
    syncErr = &b.commitErr
    syncWG = &b.commit
    // Must wait for both the publish and the synchronous WAL fsync.
    b.commit.Add(2)
}

?

sumeerbhola

TFTRs!

Reviewable status: 5 of 8 files reviewed, 1 unresolved discussion (waiting on @jbowens, @nicktrav, and @nvanbenschoten)

-- commits line 24 at r5:

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

5 iterations per variant is the minimum that benchstat will consider to be statistically significant, but it's still not very many and may be hiding smaller changes in performance. Consider increasing the iteration count. Or at least do so in the future.

Also, in case you aren't familiar, benchdiff makes this kind of microbenchmark comparison quite easy to run. This would just be (when run from this branch):
benchdiff --count=20 --run='CommitPipeline/(1|2|4|8)' .

Thanks for the benchdiff pointer.

I increased to 20, and ran it on gceworker. Oddly, this branch comes out ahead of master. I'm going to interpret this as no difference.

synchronous: old is master and new is this branch

name                          old time/op    new time/op    delta
CommitPipeline/parallel=1-16    1.45µs ± 3%    1.43µs ± 2%   -1.76%  (p=0.000 n=20+19)
CommitPipeline/parallel=2-16    1.39µs ± 1%    1.37µs ± 4%   -2.10%  (p=0.000 n=20+20)
CommitPipeline/parallel=4-16    1.38µs ± 2%    1.34µs ± 3%   -2.95%  (p=0.000 n=20+20)
CommitPipeline/parallel=8-16    1.52µs ±27%    1.34µs ± 4%  -11.57%  (p=0.000 n=20+20)

name                          old speed      new speed      delta
CommitPipeline/parallel=1-16  11.0MB/s ± 3%  11.2MB/s ± 2%   +1.78%  (p=0.000 n=20+19)
CommitPipeline/parallel=2-16  11.5MB/s ± 1%  11.7MB/s ± 4%   +2.14%  (p=0.000 n=20+20)
CommitPipeline/parallel=4-16  11.6MB/s ± 2%  12.0MB/s ± 3%   +3.05%  (p=0.000 n=20+20)
CommitPipeline/parallel=8-16  10.7MB/s ±22%  11.9MB/s ± 4%  +11.31%  (p=0.000 n=20+20)

name                          old alloc/op   new alloc/op   delta
CommitPipeline/parallel=1-16    1.37kB ± 0%    1.40kB ± 0%   +2.29%  (p=0.000 n=20+20)
CommitPipeline/parallel=2-16    1.37kB ± 0%    1.40kB ± 0%   +2.31%  (p=0.000 n=20+20)
CommitPipeline/parallel=4-16    1.37kB ± 0%    1.40kB ± 0%   +2.27%  (p=0.000 n=20+20)
CommitPipeline/parallel=8-16    1.37kB ± 0%    1.40kB ± 0%   +2.30%  (p=0.000 n=20+20)

name                          old allocs/op  new allocs/op  delta
CommitPipeline/parallel=1-16      2.00 ± 0%      2.00 ± 0%     ~     (all equal)
CommitPipeline/parallel=2-16      2.00 ± 0%      2.00 ± 0%     ~     (all equal)
CommitPipeline/parallel=4-16      2.00 ± 0%      2.00 ± 0%     ~     (all equal)
CommitPipeline/parallel=8-16      2.00 ± 0%      2.00 ± 0%     ~     (all equal)

old is branch with synchronous and new is async:

name                          old time/op    new time/op    delta
CommitPipeline/parallel=1-16    1.43µs ± 2%    1.42µs ± 3%  -0.80%  (p=0.029 n=19+19)
CommitPipeline/parallel=2-16    1.37µs ± 4%    1.37µs ± 2%    ~     (p=0.545 n=20+19)
CommitPipeline/parallel=4-16    1.34µs ± 3%    1.36µs ± 3%  +1.37%  (p=0.008 n=20+19)
CommitPipeline/parallel=8-16    1.34µs ± 4%    1.36µs ± 3%  +1.10%  (p=0.018 n=20+20)

name                          old speed      new speed      delta
CommitPipeline/parallel=1-16  11.2MB/s ± 2%  11.3MB/s ± 2%  +0.81%  (p=0.022 n=19+19)
CommitPipeline/parallel=2-16  11.7MB/s ± 4%  11.7MB/s ± 2%    ~     (p=0.545 n=20+19)
CommitPipeline/parallel=4-16  12.0MB/s ± 3%  11.8MB/s ± 3%  -1.37%  (p=0.009 n=20+19)
CommitPipeline/parallel=8-16  11.9MB/s ± 4%  11.8MB/s ± 3%  -1.09%  (p=0.018 n=20+20)

name                          old alloc/op   new alloc/op   delta
CommitPipeline/parallel=1-16    1.40kB ± 0%    1.40kB ± 0%    ~     (p=0.085 n=20+20)
CommitPipeline/parallel=2-16    1.40kB ± 0%    1.40kB ± 0%    ~     (p=0.076 n=20+20)
CommitPipeline/parallel=4-16    1.40kB ± 0%    1.40kB ± 0%    ~     (p=0.233 n=20+20)
CommitPipeline/parallel=8-16    1.40kB ± 0%    1.40kB ± 0%    ~     (p=0.228 n=20+20)

name                          old allocs/op  new allocs/op  delta
CommitPipeline/parallel=1-16      2.00 ± 0%      2.00 ± 0%    ~     (all equal)
CommitPipeline/parallel=2-16      2.00 ± 0%      2.00 ± 0%    ~     (all equal)
CommitPipeline/parallel=4-16      2.00 ± 0%      2.00 ± 0%    ~     (all equal)
CommitPipeline/parallel=8-16      2.00 ± 0%      2.00 ± 0%    ~     (all equal)

batch.go line 270 at r5 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: s/ApplyNoSncWait/ApplyNoSyncWait/

Done

commit.go line 418 at r5 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: maybe now it'd read more clearly to use an explicit constant in each .Add call up above. something like

switch {
case !syncWAL:
    // Only need to wait for the publish.
    b.commit.Add(1)
case syncWAL && noSyncWait:
    syncErr = &b.commitErr
    syncWG = &b.fsyncWait
    // Only need to wait synchronously for the publish; Asynchronously the user
    // will wait on the batch's fsyncWait.
    b.commit.Add(1)
    b.fsyncWait.Add(1)
case syncWAL && !noSyncWait:
    syncErr = &b.commitErr
    syncWG = &b.commit
    // Must wait for both the publish and the synchronous WAL fsync.
    b.commit.Add(2)
}

?

Done

sumeerbhola

Reviewable status: 5 of 10 files reviewed, 2 unresolved discussions (waiting on @jbowens, @nicktrav, and @nvanbenschoten)

internal/metamorphic/ops.go line 76 at r6 (raw file):

	w := t.getWriter(o.writerID)
	var err error
	if o.writerID.tag() == dbTag && t.testOpts.asyncApplyToDB && t.writeOpts.Sync {

added metamorphic test support

ApplyNoSyncWait must only be used when WriteOptions.Sync is true. It enqueues the Batch to the WAL, adds to the memtable, and waits until the batch is visible in the memtable, and then returns to the caller. The caller is responsible for calling Batch.SyncWait to wait until the write to the WAL is fsynced. This change required splitting the WaitGroup in the Batch into two WaitGroups, so waiting for the visibility can happen separately from waiting for the WAL write. Additionally, the channel used as a semaphore for reserving space in the two lock-free queues is split into two channels, since dequeueing from these queues can happen in arbitrary order. Benchmarks indicate that the overhead of pushing and popping on an extra channel is tolerable. Benchmarks were run on a macbook pro -- note these are not doing an actual sync since they use io.Discard, and are only benchmarking the commit pipeline. Sync wait on master (old) vs this branch (new): name old time/op new time/op delta CommitPipeline/no-sync-wait=false/parallel=1-10 1.09µs ± 6% 1.15µs ± 9% ~ (p=0.310 n=5+5) CommitPipeline/no-sync-wait=false/parallel=2-10 1.53µs ± 4% 1.54µs ± 2% ~ (p=0.841 n=5+5) CommitPipeline/no-sync-wait=false/parallel=4-10 1.54µs ± 1% 1.59µs ± 1% +2.87% (p=0.008 n=5+5) CommitPipeline/no-sync-wait=false/parallel=8-10 1.52µs ± 1% 1.55µs ± 1% +2.43% (p=0.008 n=5+5) name old speed new speed delta CommitPipeline/no-sync-wait=false/parallel=1-10 14.7MB/s ± 5% 13.9MB/s ±10% ~ (p=0.310 n=5+5) CommitPipeline/no-sync-wait=false/parallel=2-10 10.5MB/s ± 4% 10.4MB/s ± 2% ~ (p=0.841 n=5+5) CommitPipeline/no-sync-wait=false/parallel=4-10 10.4MB/s ± 1% 10.1MB/s ± 1% -2.78% (p=0.008 n=5+5) CommitPipeline/no-sync-wait=false/parallel=8-10 10.5MB/s ± 1% 10.3MB/s ± 1% -2.35% (p=0.008 n=5+5) name old alloc/op new alloc/op delta CommitPipeline/no-sync-wait=false/parallel=1-10 1.37kB ± 0% 1.40kB ± 0% +2.31% (p=0.008 n=5+5) CommitPipeline/no-sync-wait=false/parallel=2-10 1.37kB ± 0% 1.40kB ± 0% +2.31% (p=0.008 n=5+5) CommitPipeline/no-sync-wait=false/parallel=4-10 1.37kB ± 0% 1.40kB ± 0% +2.15% (p=0.008 n=5+5) CommitPipeline/no-sync-wait=false/parallel=8-10 1.37kB ± 0% 1.40kB ± 0% +2.34% (p=0.008 n=5+5) name old allocs/op new allocs/op delta CommitPipeline/no-sync-wait=false/parallel=1-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) CommitPipeline/no-sync-wait=false/parallel=2-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) CommitPipeline/no-sync-wait=false/parallel=4-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) CommitPipeline/no-sync-wait=false/parallel=8-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) Sync wait on this branch (old) vs async wait on this branch (new): name old time/op new time/op delta CommitPipeline/parallel=1-10 1.15µs ± 9% 1.20µs ± 7% ~ (p=0.421 n=5+5) CommitPipeline/parallel=2-10 1.54µs ± 2% 1.59µs ± 2% +3.50% (p=0.008 n=5+5) CommitPipeline/parallel=4-10 1.59µs ± 1% 1.58µs ± 1% ~ (p=0.802 n=5+5) CommitPipeline/parallel=8-10 1.55µs ± 1% 1.56µs ± 1% ~ (p=0.452 n=5+5) name old speed new speed delta CommitPipeline/parallel=1-10 13.9MB/s ±10% 13.3MB/s ± 7% ~ (p=0.421 n=5+5) CommitPipeline/parallel=2-10 10.4MB/s ± 2% 10.1MB/s ± 2% -3.36% (p=0.008 n=5+5) CommitPipeline/parallel=4-10 10.1MB/s ± 1% 10.1MB/s ± 1% ~ (p=0.786 n=5+5) CommitPipeline/parallel=8-10 10.3MB/s ± 1% 10.3MB/s ± 1% ~ (p=0.452 n=5+5) name old alloc/op new alloc/op delta CommitPipeline/parallel=1-10 1.40kB ± 0% 1.40kB ± 0% ~ (p=0.651 n=5+5) CommitPipeline/parallel=2-10 1.40kB ± 0% 1.39kB ± 0% -0.21% (p=0.008 n=5+5) CommitPipeline/parallel=4-10 1.40kB ± 0% 1.40kB ± 0% ~ (p=0.706 n=5+5) CommitPipeline/parallel=8-10 1.40kB ± 0% 1.40kB ± 0% ~ (p=0.587 n=5+5) name old allocs/op new allocs/op delta CommitPipeline/parallel=1-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) CommitPipeline/parallel=2-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) CommitPipeline/parallel=4-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) CommitPipeline/parallel=8-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) Informs cockroachdb/cockroach#17500 See discussion thread cockroachdb/cockroach#87050 (review)

Extracted from cockroachdb#94165. Picks up github.com/cockroachdb/pebble/pull/2117. Release note: None Epic: None

95862: storage: add CommitNoSyncWait and SyncWait to Batch interface r=sumeerbhola a=nvanbenschoten Extracted from #94165. Picks up cockroachdb/pebble/pull/2117. Release note: None Epic: None Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

Fixes cockroachdb#17500. Waiting on github.com/cockroachdb/pebble/pull/2117. This commit integrates with the `AsyncStorageWrites` functionality that we added to Raft in github.com/etcd-io/raft/pull/8. \## Approach The commit makes the minimal changes needed to integrate with async storage writes and pull fsyncs out of the raft state machine loop. It does not make an effort to extract the non-durable portion of raft log writes or raft log application onto separate goroutine pools, as was described in cockroachdb#17500. Those changes will also be impactful, but they're non trivial and bump into a pipelining vs. batching trade-off, so they are left as future work items (TODO(nvanbenschoten): open new issues). With this change, asynchronous Raft log syncs are enabled by the new `DB.ApplyNoSyncWait` Pebble API introduced in github.com/cockroachdb/pebble/pull/2117. The `handleRaftReady` state machine loop continues to initiate Raft log writes, but it uses the Pebble API to offload waiting on durability to a separate goroutine. This separate goroutine then sends the corresponding `MsgStorageAppend`'s response messages where they need to go (locally and/or to the Raft leader) when the fsync completes. The async storage writes functionality in Raft makes this all safe. \## Benchmark Results The result of this change is reduced interference between Raft proposals. As a result, it reduces end-to-end commit latency. github.com/etcd-io/raft/pull/8 presented a collection of benchmark results captured from integrating async storage writes with rafttoy. When integrated into CockroachDB, we see similar improvements to average and tail latency. However, it doesn't provide the throughput improvements at the top end because log appends and state machine application have not yet been extracted into separate goroutine pools, which would facilitate increased opportunity for batching. TODO: add images ---- Release note (performance improvement): The Raft proposal pipeline has been optimized to reduce interference between Raft proposals. This improves average and tail write latency at high concurrency.

94165: kv: integrate raft async storage writes r=nvanbenschoten a=nvanbenschoten Fixes #17500. Epic: CRDB-22644 This commit integrates with the `AsyncStorageWrites` functionality that we added to Raft in etcd-io/raft/pull/8. ## Approach The commit makes the minimal changes needed to integrate with async storage writes and pull fsyncs out of the raft state machine loop. It does not make an effort to extract the non-durable portion of raft log writes or raft log application onto separate goroutine pools, as was described in #17500. Those changes will also be impactful, but they're non trivial and bump into a pipelining vs. batching trade-off, so they are left as future work items. See #94853 and #94854. With this change, asynchronous Raft log syncs are enabled by the new `DB.ApplyNoSyncWait` Pebble API introduced in cockroachdb/pebble/pull/2117. The `handleRaftReady` state machine loop continues to initiate Raft log writes, but it uses the Pebble API to offload waiting on durability to a separate goroutine. This separate goroutine then sends the corresponding `MsgStorageAppend`'s response messages where they need to go (locally and/or to the Raft leader) when the fsync completes. The async storage writes functionality in Raft makes this all safe. ## Benchmark Results The result of this change is reduced interference between Raft proposals. As a result, it reduces end-to-end commit latency. etcd-io/raft/pull/8 presented a collection of benchmark results captured from integrating async storage writes with rafttoy. When integrated into CockroachDB, we see similar improvements to average and tail latency. However, it doesn't provide the throughput improvements at the top end because log appends and state machine application have not yet been extracted into separate goroutine pools, which would facilitate an increased opportunity for batching. To visualize the impact on latency, consider the following test. The experiment uses a 3-node GCP cluster with n2-standard-32 instances spread across three availability zones. It runs kv0 (write-only) against the cluster with 64-byte values. It then ramps up concurrency to compare throughput vs. average and tail latency. _NOTE: log scales on x and y axes_ ![Throughput vs average latency of write-only workload](https://user-images.githubusercontent.com/5438456/209210719-bec842f6-1093-48cd-8be7-05a3d79c2a71.svg) ![Throughput vs tail latency of write-only workload](https://user-images.githubusercontent.com/5438456/209210777-670a4d25-9516-41a2-b7e7-86b402004536.svg) Async storage writes impacts latency by different amounts at different throughputs, ranging from an improvement of 20% to 40% when the system is "well utilized". However, it increases latency by 5% to 10% when the system is over-saturated and CPU bound, presumably because of the extra goroutine handoff to the log append fsync callback, which will be impacted by elevated goroutine scheduling latency. | Throughput (B/s) | Throughput (qps) | Avg. Latency Δ | p99 Latency Δ | | ---------------- | ---------------- | -------------- | ------------- | | 63 KB/s | 1,000 | -10.5% | -8.8% | | 125 KB/s | 2,000 | -7.1% | -10.4% | | 250 KB/s | 4,000 | -20% | -11.2% | | 500 KB/s | 8,000 | -16.6% | -25.3% | | 1 MB/s | 16,000 | -30.8% | -44.0% | | 2 MB/s | 32,000 | -38.2% | -30.9% | | 4 MB/s | 64,000 | 5.9% | 9.4% | ### Other benchmark results ```bash name old ops/s new ops/s delta # 50% read, 50% update ycsb/A/nodes=3 16.0k ± 5% 16.2k ± 4% ~ (p=0.353 n=10+10) ycsb/A/nodes=3/cpu=32 28.7k ± 5% 33.8k ± 2% +17.57% (p=0.000 n=9+9) # 95% read, 5% update ycsb/B/nodes=3 29.9k ± 3% 30.2k ± 3% ~ (p=0.278 n=9+10) ycsb/B/nodes=3/cpu=32 101k ± 1% 100k ± 3% ~ (p=0.274 n=8+10) # 100% read ycsb/C/nodes=3 40.4k ± 3% 40.0k ± 3% ~ (p=0.190 n=10+10) ycsb/C/nodes=3/cpu=32 135k ± 1% 137k ± 1% +0.87% (p=0.011 n=9+9) # 95% read, 5% insert ycsb/D/nodes=3 33.6k ± 3% 33.8k ± 3% ~ (p=0.315 n=10+10) ycsb/D/nodes=3/cpu=32 108k ± 1% 106k ± 6% ~ (p=0.739 n=10+10) # 95% scan, 5% insert ycsb/E/nodes=3 3.79k ± 1% 3.73k ± 1% -1.42% (p=0.000 n=9+9) ycsb/E/nodes=3/cpu=32 6.31k ± 5% 6.48k ± 6% ~ (p=0.123 n=10+10) # 50% read, 50% read-modify-write ycsb/F/nodes=3 7.68k ± 2% 7.99k ± 2% +4.11% (p=0.000 n=10+10) ycsb/F/nodes=3/cpu=32 15.6k ± 4% 18.1k ± 3% +16.14% (p=0.000 n=8+10) name old avg(ms) new avg(ms) delta ycsb/A/nodes=3 6.01 ± 5% 5.95 ± 4% ~ (p=0.460 n=10+10) ycsb/A/nodes=3/cpu=32 5.01 ± 4% 4.25 ± 4% -15.19% (p=0.000 n=9+10) ycsb/B/nodes=3 4.80 ± 0% 4.77 ± 4% ~ (p=0.586 n=7+10) ycsb/B/nodes=3/cpu=32 1.90 ± 0% 1.90 ± 0% ~ (all equal) ycsb/C/nodes=3 3.56 ± 2% 3.61 ± 3% ~ (p=0.180 n=10+10) ycsb/C/nodes=3/cpu=32 1.40 ± 0% 1.40 ± 0% ~ (all equal) ycsb/D/nodes=3 2.87 ± 2% 2.85 ± 2% ~ (p=0.650 n=10+10) ycsb/D/nodes=3/cpu=32 1.30 ± 0% 1.34 ± 4% ~ (p=0.087 n=10+10) ycsb/E/nodes=3 25.3 ± 0% 25.7 ± 1% +1.38% (p=0.000 n=8+8) ycsb/E/nodes=3/cpu=32 22.9 ± 5% 22.2 ± 6% ~ (p=0.109 n=10+10) ycsb/F/nodes=3 12.5 ± 2% 12.0 ± 1% -3.72% (p=0.000 n=10+9) ycsb/F/nodes=3/cpu=32 9.27 ± 4% 7.98 ± 3% -13.96% (p=0.000 n=8+10) name old p99(ms) new p99(ms) delta ycsb/A/nodes=3 45.7 ±15% 35.7 ± 6% -21.90% (p=0.000 n=10+8) ycsb/A/nodes=3/cpu=32 67.6 ±13% 55.3 ± 5% -18.10% (p=0.000 n=9+10) ycsb/B/nodes=3 30.5 ±24% 29.4 ± 7% ~ (p=0.589 n=10+10) ycsb/B/nodes=3/cpu=32 12.8 ± 2% 13.3 ± 7% ~ (p=0.052 n=10+8) ycsb/C/nodes=3 14.0 ± 3% 14.2 ± 0% ~ (p=0.294 n=10+8) ycsb/C/nodes=3/cpu=32 5.80 ± 0% 5.70 ± 5% ~ (p=0.233 n=7+10) ycsb/D/nodes=3 12.4 ± 2% 11.7 ± 3% -5.32% (p=0.001 n=10+10) ycsb/D/nodes=3/cpu=32 6.30 ± 0% 5.96 ± 6% -5.40% (p=0.001 n=10+10) ycsb/E/nodes=3 81.0 ± 4% 83.9 ± 0% +3.63% (p=0.012 n=10+7) ycsb/E/nodes=3/cpu=32 139 ±19% 119 ±12% -14.46% (p=0.021 n=10+10) ycsb/F/nodes=3 122 ±17% 103 ±10% -15.48% (p=0.002 n=10+8) ycsb/F/nodes=3/cpu=32 146 ±20% 133 ± 7% -8.89% (p=0.025 n=10+10) ``` The way to interpret these results is that async raft storage writes reduce latency and, as a result of the closed loop natured workload, also increase throughput for the YCSB variants that perform writes and aren't already CPU saturated. Variants that are read-only are unaffected. Variants that are CPU-saturated do not benefit from the change because they are already bottlenecked on CPU resources and cannot push any more load (see above). ---- Release note (performance improvement): The Raft proposal pipeline has been optimized to reduce interference between Raft proposals. This improves average and tail write latency at high concurrency. 96458: sql: fixes statement contention count metric r=j82w a=j82w Fixes a bug introduced in #94750 where the metric count was counting transaction that hit contention events instead of the statement count. closes: #96429 Release note: none Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: j82w <jwilley@cockroachlabs.com>

sumeerbhola requested review from jbowens, nvanbenschoten and a team November 9, 2022 23:50

sumeerbhola commented Nov 9, 2022

View reviewed changes

sumeerbhola force-pushed the async_commit branch 2 times, most recently from 48c0da3 to b14cf8c Compare November 10, 2022 22:32

nvanbenschoten reviewed Nov 17, 2022

View reviewed changes

sumeerbhola commented Nov 18, 2022

View reviewed changes

sumeerbhola mentioned this pull request Dec 5, 2022

Add mechanism to indicate when a batch is enqueued #2164

Closed

nvanbenschoten mentioned this pull request Dec 22, 2022

kv: integrate raft async storage writes cockroachdb/cockroach#94165

Merged

sumeerbhola force-pushed the async_commit branch from b14cf8c to c12ba6a Compare January 9, 2023 18:56

sumeerbhola changed the title ~~[DNM] db: add DB.ApplyNoSyncWait for asynchronous apply~~ db: add DB.ApplyNoSyncWait for asynchronous apply Jan 9, 2023

sumeerbhola force-pushed the async_commit branch 2 times, most recently from 46e6408 to 35e19bf Compare January 9, 2023 19:45

sumeerbhola commented Jan 9, 2023

View reviewed changes

sumeerbhola requested a review from nicktrav January 9, 2023 19:50

sumeerbhola commented Jan 9, 2023

View reviewed changes

sumeerbhola force-pushed the async_commit branch from 35e19bf to 6d20eb4 Compare January 9, 2023 19:55

nvanbenschoten approved these changes Jan 9, 2023

View reviewed changes

jbowens approved these changes Jan 10, 2023

View reviewed changes

sumeerbhola force-pushed the async_commit branch 2 times, most recently from 16be9da to 7c85119 Compare January 10, 2023 20:02

sumeerbhola commented Jan 10, 2023

View reviewed changes

sumeerbhola force-pushed the async_commit branch from 7c85119 to ac7ea76 Compare January 10, 2023 21:31

sumeerbhola commented Jan 10, 2023

View reviewed changes

sumeerbhola force-pushed the async_commit branch from ac7ea76 to 82206bd Compare January 10, 2023 23:20

sumeerbhola force-pushed the async_commit branch from 82206bd to f0f2a16 Compare January 11, 2023 02:27

sumeerbhola merged commit 06fbe6a into cockroachdb:master Jan 11, 2023

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Jan 25, 2023

storage: add CommitNoSyncWait and SyncWait to Batch interface.

166fbbe

Extracted from cockroachdb#94165. Picks up github.com/cockroachdb/pebble/pull/2117. Release note: None Epic: None

nvanbenschoten mentioned this pull request Jan 25, 2023

storage: add CommitNoSyncWait and SyncWait to Batch interface cockroachdb/cockroach#95862

Merged

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Jan 26, 2023

storage: add CommitNoSyncWait and SyncWait to Batch interface

efe12b8

Extracted from cockroachdb#94165. Picks up github.com/cockroachdb/pebble/pull/2117. Release note: None Epic: None

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

db: add DB.ApplyNoSyncWait for asynchronous apply #2117

db: add DB.ApplyNoSyncWait for asynchronous apply #2117

sumeerbhola commented Nov 9, 2022 •

edited

Loading

cockroach-teamcity commented Nov 9, 2022

sumeerbhola left a comment

sumeerbhola left a comment

nvanbenschoten left a comment

sumeerbhola left a comment

sumeerbhola left a comment

sumeerbhola left a comment

nvanbenschoten left a comment

jbowens left a comment

sumeerbhola left a comment

sumeerbhola left a comment

	PendingBufferLen base.GaugeSampleMetric
	SyncQueueLen base.GaugeSampleMetric

db: add DB.ApplyNoSyncWait for asynchronous apply #2117

db: add DB.ApplyNoSyncWait for asynchronous apply #2117

Conversation

sumeerbhola commented Nov 9, 2022 • edited Loading

cockroach-teamcity commented Nov 9, 2022

sumeerbhola left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

jbowens left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

sumeerbhola commented Nov 9, 2022 •

edited

Loading