Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

db: add DB.ApplyNoSyncWait for asynchronous apply #2117

Merged
merged 1 commit into from
Jan 11, 2023

Conversation

sumeerbhola
Copy link
Collaborator

@sumeerbhola sumeerbhola commented Nov 9, 2022

ApplyNoSyncWait must only be used when WriteOptions.Sync is true. It enqueues
the Batch to the WAL, adds to the memtable, and waits until the batch is
visible in the memtable, and then returns to the caller. The caller is
responsible for calling Batch.SyncWait to wait until the write to the
WAL is fsynced.

This change required splitting the WaitGroup in the Batch into two
WaitGroups, so waiting for the visibility can happen separately from
waiting for the WAL write. Additionally, the channel used as a semaphore
for reserving space in the two lock-free queues is split into two channels,
since dequeueing from these queues can happen in arbitrary order.

Benchmarks indicate that the overhead of pushing and popping on an extra
channel is tolerable. Benchmarks were run on a macbook pro -- note these are
not doing an actual sync since they use io.Discard, and are only benchmarking
the commit pipeline.

Sync wait on master (old) vs this branch (new):

name                                               old time/op    new time/op    delta
CommitPipeline/no-sync-wait=false/parallel=1-10      1.09µs ± 6%    1.15µs ± 9%    ~     (p=0.310 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=2-10      1.53µs ± 4%    1.54µs ± 2%    ~     (p=0.841 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=4-10      1.54µs ± 1%    1.59µs ± 1%  +2.87%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=8-10      1.52µs ± 1%    1.55µs ± 1%  +2.43%  (p=0.008 n=5+5)

name                                               old speed      new speed      delta
CommitPipeline/no-sync-wait=false/parallel=1-10    14.7MB/s ± 5%  13.9MB/s ±10%    ~     (p=0.310 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=2-10    10.5MB/s ± 4%  10.4MB/s ± 2%    ~     (p=0.841 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=4-10    10.4MB/s ± 1%  10.1MB/s ± 1%  -2.78%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=8-10    10.5MB/s ± 1%  10.3MB/s ± 1%  -2.35%  (p=0.008 n=5+5)

name                                               old alloc/op   new alloc/op   delta
CommitPipeline/no-sync-wait=false/parallel=1-10      1.37kB ± 0%    1.40kB ± 0%  +2.31%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=2-10      1.37kB ± 0%    1.40kB ± 0%  +2.31%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=4-10      1.37kB ± 0%    1.40kB ± 0%  +2.15%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=8-10      1.37kB ± 0%    1.40kB ± 0%  +2.34%  (p=0.008 n=5+5)

name                                               old allocs/op  new allocs/op  delta
CommitPipeline/no-sync-wait=false/parallel=1-10        2.00 ± 0%      2.00 ± 0%    ~     (all equal)
CommitPipeline/no-sync-wait=false/parallel=2-10        2.00 ± 0%      2.00 ± 0%    ~     (all equal)
CommitPipeline/no-sync-wait=false/parallel=4-10        2.00 ± 0%      2.00 ± 0%    ~     (all equal)
CommitPipeline/no-sync-wait=false/parallel=8-10        2.00 ± 0%      2.00 ± 0%    ~     (all equal)

Sync wait on this branch (old) vs async wait on this branch (new):

name                            old time/op    new time/op    delta
CommitPipeline/parallel=1-10      1.15µs ± 9%    1.20µs ± 7%     ~     (p=0.421 n=5+5)
CommitPipeline/parallel=2-10      1.54µs ± 2%    1.59µs ± 2%   +3.50%  (p=0.008 n=5+5)
CommitPipeline/parallel=4-10      1.59µs ± 1%    1.58µs ± 1%     ~     (p=0.802 n=5+5)
CommitPipeline/parallel=8-10      1.55µs ± 1%    1.56µs ± 1%     ~     (p=0.452 n=5+5)

name                            old speed      new speed      delta
CommitPipeline/parallel=1-10    13.9MB/s ±10%  13.3MB/s ± 7%     ~     (p=0.421 n=5+5)
CommitPipeline/parallel=2-10    10.4MB/s ± 2%  10.1MB/s ± 2%   -3.36%  (p=0.008 n=5+5)
CommitPipeline/parallel=4-10    10.1MB/s ± 1%  10.1MB/s ± 1%     ~     (p=0.786 n=5+5)
CommitPipeline/parallel=8-10    10.3MB/s ± 1%  10.3MB/s ± 1%     ~     (p=0.452 n=5+5)

name                            old alloc/op   new alloc/op   delta
CommitPipeline/parallel=1-10      1.40kB ± 0%    1.40kB ± 0%     ~     (p=0.651 n=5+5)
CommitPipeline/parallel=2-10      1.40kB ± 0%    1.39kB ± 0%   -0.21%  (p=0.008 n=5+5)
CommitPipeline/parallel=4-10      1.40kB ± 0%    1.40kB ± 0%     ~     (p=0.706 n=5+5)
CommitPipeline/parallel=8-10      1.40kB ± 0%    1.40kB ± 0%     ~     (p=0.587 n=5+5)

name                            old allocs/op  new allocs/op  delta
CommitPipeline/parallel=1-10        2.00 ± 0%      2.00 ± 0%     ~     (all equal)
CommitPipeline/parallel=2-10        2.00 ± 0%      2.00 ± 0%     ~     (all equal)
CommitPipeline/parallel=4-10        2.00 ± 0%      2.00 ± 0%     ~     (all equal)
CommitPipeline/parallel=8-10        2.00 ± 0%      2.00 ± 0%     ~     (all equal)

Informs cockroachdb/cockroach#17500

See discussion thread cockroachdb/cockroach#87050 (review)

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 8 files reviewed, 1 unresolved discussion (waiting on @jbowens and @nvanbenschoten)


db.go line 738 at r1 (raw file):

// first calling Batch.SyncWait.
// RECOMMENDATION: Prefer using Apply unless you really understand why you
// need ApplyNoSyncWait.

@nvanbenschoten this is the interface.

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 8 files reviewed, 2 unresolved discussions (waiting on @jbowens and @nvanbenschoten)


commit.go line 275 at r1 (raw file):

	// TODO(sumeer): will block here if SyncConcurrency is insufficient. Monitor
	// and increase if needed.

@nvanbenschoten the potential bottleneck

@sumeerbhola sumeerbhola force-pushed the async_commit branch 2 times, most recently from 48c0da3 to b14cf8c Compare November 10, 2022 22:32
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for building this Sumeer. I've integrated it into a pair of different prototypes that extend from cockroachdb/cockroach#87050 (review). The first splits off a pool of raft log appender goroutines that use this API to avoid waiting in lockstep with one another. The second simply calls this new API directly from the existing raft state machine handler goroutine and pushes the callback handling onto a separate goroutine. Both have demonstrated the expected latency improvement and show a near-identical throughput-latency curve.

I'll be posting a set of results and recommendations on cockroachdb/cockroach#17500 either today or tomorrow. Regardless of the approach that we take in 23.1, I think we'll want to use this API.

Reviewed 5 of 8 files at r1, 2 of 2 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @jbowens and @sumeerbhola)


commit.go line 227 at r3 (raw file):

	// These lock-free queues have a fixed capacity. And since they are
	// lock-free, we cannot do blocking waits when pushing onto these queues,
	// when case they are full. Additionally, adding to these queues happens

s/when case/in case/


commit.go line 254 at r3 (raw file):

		// memtable and publishing. Then the LogWriter.flusher.syncQ could be
		// sized as 0.5*5 of the commitPipeling.pending queue. So we could size
		// these based on typical experimental results.

How do you imagine that such experiments would work? Do we have the observability in place to understand when a workload is being throttled by these bounded queue sizes? It doesn't look like it. Would you recommend that I run with a patched version of Pebble that detects when sends on either of these channels block? Or perhaps a blocking profile could be used to observe the same thing.

Related, should we make SyncConcurrency configurable so that a user of pebble who finds that their desired amount of concurrency causes these queues to reach their capacity can resolve this?


db.go line 736 at r3 (raw file):

The caller must not Close the batch without first calling Batch.SyncWait.

Does this contract cause the Batch to hold on to resources for longer than strictly necessary?

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both have demonstrated the expected latency improvement and show a near-identical throughput-latency curve.

Excellent!

Regardless of the approach that we take in 23.1, I think we'll want to use this API.

I'll need to microbenchmark the change here. Worst case, we could time things such that the change here, plus the corresponding CockroachDB changes for async raft log writes, land in the same release so that any microbenchmarking loss here is made up for with the other benefits.

Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @jbowens and @nvanbenschoten)


commit.go line 254 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

How do you imagine that such experiments would work? Do we have the observability in place to understand when a workload is being throttled by these bounded queue sizes? It doesn't look like it. Would you recommend that I run with a patched version of Pebble that detects when sends on either of these channels block? Or perhaps a blocking profile could be used to observe the same thing.

Related, should we make SyncConcurrency configurable so that a user of pebble who finds that their desired amount of concurrency causes these queues to reach their capacity can resolve this?

We have some LogWriterMetrics here

pebble/record/log_writer.go

Lines 735 to 736 in b9289d7

PendingBufferLen base.GaugeSampleMetric
SyncQueueLen base.GaugeSampleMetric
that are exposed in pebble Metrics here
record.LogWriterMetrics
.
One could export these as metrics from kvserver.StoreMetrics.updateEngineMetrics. These are cumulative samples of the length, so one needs to take a delta and then compute the mean -- we can only tell if the mean is high. We could also count in LogWriter whenever one of these queues is full -- it knows when it is full when dequeuing.
Which reminds me that the maximum number of pending blocks (CapAllocatedBlocks) can also be a bottleneck (which is what this pending metric is measuring -- not the commitPipeline.pending).

I am hopeful we don't need to make these sizes configurable, once we measure. We have arrays with these sizes and I am wary about changing them to slices.


db.go line 736 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

The caller must not Close the batch without first calling Batch.SyncWait.

Does this contract cause the Batch to hold on to resources for longer than strictly necessary?

Yes. It was the minimum viable thing to do. Do you think there will be enough bytes in the "pipeline" that this will be an issue? One could separate out the 2 fields in Batch that are needed for the LogWriter and discard the rest.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Jan 6, 2023
Fixes cockroachdb#17500.
Waiting on github.com/cockroachdb/pebble/pull/2117.

This commit integrates with the `AsyncStorageWrites` functionality that
we added to Raft in github.com/etcd-io/raft/pull/8.

\## Approach

The commit makes the minimal changes needed to integrate with async
storage writes and pull fsyncs out of the raft state machine loop. It
does not make an effort to extract the non-durable portion of raft log
writes or raft log application onto separate goroutine pools, as was
described in cockroachdb#17500. Those changes will also be impactful, but they're
non trivial and bump into a pipelining vs. batching trade-off, so they
are left as future work items (TODO(nvanbenschoten): open new issues).

With this change, asynchronous Raft log syncs are enabled by the new
`DB.ApplyNoSyncWait` Pebble API introduced in github.com/cockroachdb/pebble/pull/2117.
The `handleRaftReady` state machine loop continues to initiate Raft log
writes, but it uses the Pebble API to offload waiting on durability to a
separate goroutine. This separate goroutine then sends the corresponding
`MsgStorageAppend`'s response messages where they need to go (locally
and/or to the Raft leader) when the fsync completes. The async storage
writes functionality in Raft makes this all safe.

\## Benchmark Results

The result of this change is reduced interference between Raft
proposals. As a result, it reduces end-to-end commit latency.

github.com/etcd-io/raft/pull/8 presented a collection of benchmark
results captured from integrating async storage writes with rafttoy.

When integrated into CockroachDB, we see similar improvements to average
and tail latency. However, it doesn't provide the throughput
improvements at the top end because log appends and state machine
application have not yet been extracted into separate goroutine pools,
which would facilitate increased opportunity for batching.

TODO: add images

----

Release note (performance improvement): The Raft proposal pipeline
has been optimized to reduce interference between Raft proposals.
This improves average and tail write latency at high concurrency.
@sumeerbhola sumeerbhola changed the title [DNM] db: add DB.ApplyNoSyncWait for asynchronous apply db: add DB.ApplyNoSyncWait for asynchronous apply Jan 9, 2023
@sumeerbhola sumeerbhola force-pushed the async_commit branch 2 times, most recently from 46e6408 to 35e19bf Compare January 9, 2023 19:45
Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 8 files reviewed, 4 unresolved discussions (waiting on @jbowens and @nvanbenschoten)


commit.go line 275 at r1 (raw file):

Previously, sumeerbhola wrote…

@nvanbenschoten the potential bottleneck

removed this TODO comment since the earlier comment subsumes it.


commit.go line 227 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

s/when case/in case/

Done


commit.go line 254 at r3 (raw file):

Previously, sumeerbhola wrote…

We have some LogWriterMetrics here

pebble/record/log_writer.go

Lines 735 to 736 in b9289d7

PendingBufferLen base.GaugeSampleMetric
SyncQueueLen base.GaugeSampleMetric
that are exposed in pebble Metrics here
record.LogWriterMetrics
.
One could export these as metrics from kvserver.StoreMetrics.updateEngineMetrics. These are cumulative samples of the length, so one needs to take a delta and then compute the mean -- we can only tell if the mean is high. We could also count in LogWriter whenever one of these queues is full -- it knows when it is full when dequeuing.
Which reminds me that the maximum number of pending blocks (CapAllocatedBlocks) can also be a bottleneck (which is what this pending metric is measuring -- not the commitPipeline.pending).

I am hopeful we don't need to make these sizes configurable, once we measure. We have arrays with these sizes and I am wary about changing them to slices.

Updated the code comment.


db.go line 736 at r3 (raw file):

Previously, sumeerbhola wrote…

Yes. It was the minimum viable thing to do. Do you think there will be enough bytes in the "pipeline" that this will be an issue? One could separate out the 2 fields in Batch that are needed for the LogWriter and discard the rest.

I've added a TODO in batch.go to optimize the memory in the future, if it becomes an issue. I am not eager to do it now since having another struct is likely to cause some cpu performance regression.

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added microbenchmark. Looks ok to me.
And improved some testing.

This is ready for a proper review.

Reviewable status: 0 of 8 files reviewed, 4 unresolved discussions (waiting on @jbowens, @nicktrav, and @nvanbenschoten)

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: thanks!

Someone from Storage should also give this a pass.

Reviewed 2 of 8 files at r4, 6 of 6 files at r5, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @jbowens, @nicktrav, and @sumeerbhola)


-- commits line 24 at r5:
5 iterations per variant is the minimum that benchstat will consider to be statistically significant, but it's still not very many and may be hiding smaller changes in performance. Consider increasing the iteration count. Or at least do so in the future.

Also, in case you aren't familiar, benchdiff makes this kind of microbenchmark comparison quite easy to run. This would just be (when run from this branch):

benchdiff --count=20 --run='CommitPipeline/(1|2|4|8)' .

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 8 files at r4, 6 of 6 files at r5, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @nicktrav and @sumeerbhola)


batch.go line 270 at r5 (raw file):

	// into a separate struct that is allocated separately (using another
	// sync.Pool), and only that struct needs to outlive Batch.Close (which
	// could then be called immediately after ApplyNoSncWait).

nit: s/ApplyNoSncWait/ApplyNoSyncWait/


commit.go line 418 at r5 (raw file):

	// commitCount represents the waiting needed for publish, and optionally the
	// waiting needed for the WAL sync.
	b.commit.Add(commitCount)

nit: maybe now it'd read more clearly to use an explicit constant in each .Add call up above. something like

switch {
case !syncWAL:
    // Only need to wait for the publish.
    b.commit.Add(1)
case syncWAL && noSyncWait:
    syncErr = &b.commitErr
    syncWG = &b.fsyncWait
    // Only need to wait synchronously for the publish; Asynchronously the user
    // will wait on the batch's fsyncWait.
    b.commit.Add(1)
    b.fsyncWait.Add(1)
case syncWAL && !noSyncWait:
    syncErr = &b.commitErr
    syncWG = &b.commit
    // Must wait for both the publish and the synchronous WAL fsync.
    b.commit.Add(2)
}

?

@sumeerbhola sumeerbhola force-pushed the async_commit branch 2 times, most recently from 16be9da to 7c85119 Compare January 10, 2023 20:02
Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs!

Reviewable status: 5 of 8 files reviewed, 1 unresolved discussion (waiting on @jbowens, @nicktrav, and @nvanbenschoten)


-- commits line 24 at r5:

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

5 iterations per variant is the minimum that benchstat will consider to be statistically significant, but it's still not very many and may be hiding smaller changes in performance. Consider increasing the iteration count. Or at least do so in the future.

Also, in case you aren't familiar, benchdiff makes this kind of microbenchmark comparison quite easy to run. This would just be (when run from this branch):

benchdiff --count=20 --run='CommitPipeline/(1|2|4|8)' .

Thanks for the benchdiff pointer.

I increased to 20, and ran it on gceworker. Oddly, this branch comes out ahead of master. I'm going to interpret this as no difference.

synchronous: old is master and new is this branch

name                          old time/op    new time/op    delta
CommitPipeline/parallel=1-16    1.45µs ± 3%    1.43µs ± 2%   -1.76%  (p=0.000 n=20+19)
CommitPipeline/parallel=2-16    1.39µs ± 1%    1.37µs ± 4%   -2.10%  (p=0.000 n=20+20)
CommitPipeline/parallel=4-16    1.38µs ± 2%    1.34µs ± 3%   -2.95%  (p=0.000 n=20+20)
CommitPipeline/parallel=8-16    1.52µs ±27%    1.34µs ± 4%  -11.57%  (p=0.000 n=20+20)

name                          old speed      new speed      delta
CommitPipeline/parallel=1-16  11.0MB/s ± 3%  11.2MB/s ± 2%   +1.78%  (p=0.000 n=20+19)
CommitPipeline/parallel=2-16  11.5MB/s ± 1%  11.7MB/s ± 4%   +2.14%  (p=0.000 n=20+20)
CommitPipeline/parallel=4-16  11.6MB/s ± 2%  12.0MB/s ± 3%   +3.05%  (p=0.000 n=20+20)
CommitPipeline/parallel=8-16  10.7MB/s ±22%  11.9MB/s ± 4%  +11.31%  (p=0.000 n=20+20)

name                          old alloc/op   new alloc/op   delta
CommitPipeline/parallel=1-16    1.37kB ± 0%    1.40kB ± 0%   +2.29%  (p=0.000 n=20+20)
CommitPipeline/parallel=2-16    1.37kB ± 0%    1.40kB ± 0%   +2.31%  (p=0.000 n=20+20)
CommitPipeline/parallel=4-16    1.37kB ± 0%    1.40kB ± 0%   +2.27%  (p=0.000 n=20+20)
CommitPipeline/parallel=8-16    1.37kB ± 0%    1.40kB ± 0%   +2.30%  (p=0.000 n=20+20)

name                          old allocs/op  new allocs/op  delta
CommitPipeline/parallel=1-16      2.00 ± 0%      2.00 ± 0%     ~     (all equal)
CommitPipeline/parallel=2-16      2.00 ± 0%      2.00 ± 0%     ~     (all equal)
CommitPipeline/parallel=4-16      2.00 ± 0%      2.00 ± 0%     ~     (all equal)
CommitPipeline/parallel=8-16      2.00 ± 0%      2.00 ± 0%     ~     (all equal)

old is branch with synchronous and new is async:

name                          old time/op    new time/op    delta
CommitPipeline/parallel=1-16    1.43µs ± 2%    1.42µs ± 3%  -0.80%  (p=0.029 n=19+19)
CommitPipeline/parallel=2-16    1.37µs ± 4%    1.37µs ± 2%    ~     (p=0.545 n=20+19)
CommitPipeline/parallel=4-16    1.34µs ± 3%    1.36µs ± 3%  +1.37%  (p=0.008 n=20+19)
CommitPipeline/parallel=8-16    1.34µs ± 4%    1.36µs ± 3%  +1.10%  (p=0.018 n=20+20)

name                          old speed      new speed      delta
CommitPipeline/parallel=1-16  11.2MB/s ± 2%  11.3MB/s ± 2%  +0.81%  (p=0.022 n=19+19)
CommitPipeline/parallel=2-16  11.7MB/s ± 4%  11.7MB/s ± 2%    ~     (p=0.545 n=20+19)
CommitPipeline/parallel=4-16  12.0MB/s ± 3%  11.8MB/s ± 3%  -1.37%  (p=0.009 n=20+19)
CommitPipeline/parallel=8-16  11.9MB/s ± 4%  11.8MB/s ± 3%  -1.09%  (p=0.018 n=20+20)

name                          old alloc/op   new alloc/op   delta
CommitPipeline/parallel=1-16    1.40kB ± 0%    1.40kB ± 0%    ~     (p=0.085 n=20+20)
CommitPipeline/parallel=2-16    1.40kB ± 0%    1.40kB ± 0%    ~     (p=0.076 n=20+20)
CommitPipeline/parallel=4-16    1.40kB ± 0%    1.40kB ± 0%    ~     (p=0.233 n=20+20)
CommitPipeline/parallel=8-16    1.40kB ± 0%    1.40kB ± 0%    ~     (p=0.228 n=20+20)

name                          old allocs/op  new allocs/op  delta
CommitPipeline/parallel=1-16      2.00 ± 0%      2.00 ± 0%    ~     (all equal)
CommitPipeline/parallel=2-16      2.00 ± 0%      2.00 ± 0%    ~     (all equal)
CommitPipeline/parallel=4-16      2.00 ± 0%      2.00 ± 0%    ~     (all equal)
CommitPipeline/parallel=8-16      2.00 ± 0%      2.00 ± 0%    ~     (all equal)

batch.go line 270 at r5 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: s/ApplyNoSncWait/ApplyNoSyncWait/

Done


commit.go line 418 at r5 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: maybe now it'd read more clearly to use an explicit constant in each .Add call up above. something like

switch {
case !syncWAL:
    // Only need to wait for the publish.
    b.commit.Add(1)
case syncWAL && noSyncWait:
    syncErr = &b.commitErr
    syncWG = &b.fsyncWait
    // Only need to wait synchronously for the publish; Asynchronously the user
    // will wait on the batch's fsyncWait.
    b.commit.Add(1)
    b.fsyncWait.Add(1)
case syncWAL && !noSyncWait:
    syncErr = &b.commitErr
    syncWG = &b.commit
    // Must wait for both the publish and the synchronous WAL fsync.
    b.commit.Add(2)
}

?

Done

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 5 of 10 files reviewed, 2 unresolved discussions (waiting on @jbowens, @nicktrav, and @nvanbenschoten)


internal/metamorphic/ops.go line 76 at r6 (raw file):

	w := t.getWriter(o.writerID)
	var err error
	if o.writerID.tag() == dbTag && t.testOpts.asyncApplyToDB && t.writeOpts.Sync {

added metamorphic test support

ApplyNoSyncWait must only be used when WriteOptions.Sync is true. It enqueues
the Batch to the WAL, adds to the memtable, and waits until the batch is
visible in the memtable, and then returns to the caller. The caller is
responsible for calling Batch.SyncWait to wait until the write to the
WAL is fsynced.

This change required splitting the WaitGroup in the Batch into two
WaitGroups, so waiting for the visibility can happen separately from
waiting for the WAL write. Additionally, the channel used as a semaphore
for reserving space in the two lock-free queues is split into two channels,
since dequeueing from these queues can happen in arbitrary order.

Benchmarks indicate that the overhead of pushing and popping on an extra
channel is tolerable. Benchmarks were run on a macbook pro -- note these are
not doing an actual sync since they use io.Discard, and are only benchmarking
the commit pipeline.

Sync wait on master (old) vs this branch (new):

name                                               old time/op    new time/op    delta
CommitPipeline/no-sync-wait=false/parallel=1-10      1.09µs ± 6%    1.15µs ± 9%    ~     (p=0.310 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=2-10      1.53µs ± 4%    1.54µs ± 2%    ~     (p=0.841 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=4-10      1.54µs ± 1%    1.59µs ± 1%  +2.87%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=8-10      1.52µs ± 1%    1.55µs ± 1%  +2.43%  (p=0.008 n=5+5)

name                                               old speed      new speed      delta
CommitPipeline/no-sync-wait=false/parallel=1-10    14.7MB/s ± 5%  13.9MB/s ±10%    ~     (p=0.310 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=2-10    10.5MB/s ± 4%  10.4MB/s ± 2%    ~     (p=0.841 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=4-10    10.4MB/s ± 1%  10.1MB/s ± 1%  -2.78%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=8-10    10.5MB/s ± 1%  10.3MB/s ± 1%  -2.35%  (p=0.008 n=5+5)

name                                               old alloc/op   new alloc/op   delta
CommitPipeline/no-sync-wait=false/parallel=1-10      1.37kB ± 0%    1.40kB ± 0%  +2.31%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=2-10      1.37kB ± 0%    1.40kB ± 0%  +2.31%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=4-10      1.37kB ± 0%    1.40kB ± 0%  +2.15%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=8-10      1.37kB ± 0%    1.40kB ± 0%  +2.34%  (p=0.008 n=5+5)

name                                               old allocs/op  new allocs/op  delta
CommitPipeline/no-sync-wait=false/parallel=1-10        2.00 ± 0%      2.00 ± 0%    ~     (all equal)
CommitPipeline/no-sync-wait=false/parallel=2-10        2.00 ± 0%      2.00 ± 0%    ~     (all equal)
CommitPipeline/no-sync-wait=false/parallel=4-10        2.00 ± 0%      2.00 ± 0%    ~     (all equal)
CommitPipeline/no-sync-wait=false/parallel=8-10        2.00 ± 0%      2.00 ± 0%    ~     (all equal)

Sync wait on this branch (old) vs async wait on this branch (new):

name                            old time/op    new time/op    delta
CommitPipeline/parallel=1-10      1.15µs ± 9%    1.20µs ± 7%     ~     (p=0.421 n=5+5)
CommitPipeline/parallel=2-10      1.54µs ± 2%    1.59µs ± 2%   +3.50%  (p=0.008 n=5+5)
CommitPipeline/parallel=4-10      1.59µs ± 1%    1.58µs ± 1%     ~     (p=0.802 n=5+5)
CommitPipeline/parallel=8-10      1.55µs ± 1%    1.56µs ± 1%     ~     (p=0.452 n=5+5)

name                            old speed      new speed      delta
CommitPipeline/parallel=1-10    13.9MB/s ±10%  13.3MB/s ± 7%     ~     (p=0.421 n=5+5)
CommitPipeline/parallel=2-10    10.4MB/s ± 2%  10.1MB/s ± 2%   -3.36%  (p=0.008 n=5+5)
CommitPipeline/parallel=4-10    10.1MB/s ± 1%  10.1MB/s ± 1%     ~     (p=0.786 n=5+5)
CommitPipeline/parallel=8-10    10.3MB/s ± 1%  10.3MB/s ± 1%     ~     (p=0.452 n=5+5)

name                            old alloc/op   new alloc/op   delta
CommitPipeline/parallel=1-10      1.40kB ± 0%    1.40kB ± 0%     ~     (p=0.651 n=5+5)
CommitPipeline/parallel=2-10      1.40kB ± 0%    1.39kB ± 0%   -0.21%  (p=0.008 n=5+5)
CommitPipeline/parallel=4-10      1.40kB ± 0%    1.40kB ± 0%     ~     (p=0.706 n=5+5)
CommitPipeline/parallel=8-10      1.40kB ± 0%    1.40kB ± 0%     ~     (p=0.587 n=5+5)

name                            old allocs/op  new allocs/op  delta
CommitPipeline/parallel=1-10        2.00 ± 0%      2.00 ± 0%     ~     (all equal)
CommitPipeline/parallel=2-10        2.00 ± 0%      2.00 ± 0%     ~     (all equal)
CommitPipeline/parallel=4-10        2.00 ± 0%      2.00 ± 0%     ~     (all equal)
CommitPipeline/parallel=8-10        2.00 ± 0%      2.00 ± 0%     ~     (all equal)

Informs cockroachdb/cockroach#17500

See discussion thread cockroachdb/cockroach#87050 (review)
@sumeerbhola sumeerbhola merged commit 06fbe6a into cockroachdb:master Jan 11, 2023
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Jan 25, 2023
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Jan 26, 2023
craig bot pushed a commit to cockroachdb/cockroach that referenced this pull request Jan 26, 2023
95862: storage: add CommitNoSyncWait and SyncWait to Batch interface r=sumeerbhola a=nvanbenschoten

Extracted from #94165.

Picks up cockroachdb/pebble/pull/2117.

Release note: None
Epic: None

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Jan 31, 2023
Fixes cockroachdb#17500.
Waiting on github.com/cockroachdb/pebble/pull/2117.

This commit integrates with the `AsyncStorageWrites` functionality that
we added to Raft in github.com/etcd-io/raft/pull/8.

\## Approach

The commit makes the minimal changes needed to integrate with async
storage writes and pull fsyncs out of the raft state machine loop. It
does not make an effort to extract the non-durable portion of raft log
writes or raft log application onto separate goroutine pools, as was
described in cockroachdb#17500. Those changes will also be impactful, but they're
non trivial and bump into a pipelining vs. batching trade-off, so they
are left as future work items (TODO(nvanbenschoten): open new issues).

With this change, asynchronous Raft log syncs are enabled by the new
`DB.ApplyNoSyncWait` Pebble API introduced in github.com/cockroachdb/pebble/pull/2117.
The `handleRaftReady` state machine loop continues to initiate Raft log
writes, but it uses the Pebble API to offload waiting on durability to a
separate goroutine. This separate goroutine then sends the corresponding
`MsgStorageAppend`'s response messages where they need to go (locally
and/or to the Raft leader) when the fsync completes. The async storage
writes functionality in Raft makes this all safe.

\## Benchmark Results

The result of this change is reduced interference between Raft
proposals. As a result, it reduces end-to-end commit latency.

github.com/etcd-io/raft/pull/8 presented a collection of benchmark
results captured from integrating async storage writes with rafttoy.

When integrated into CockroachDB, we see similar improvements to average
and tail latency. However, it doesn't provide the throughput
improvements at the top end because log appends and state machine
application have not yet been extracted into separate goroutine pools,
which would facilitate increased opportunity for batching.

TODO: add images

----

Release note (performance improvement): The Raft proposal pipeline
has been optimized to reduce interference between Raft proposals.
This improves average and tail write latency at high concurrency.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Feb 2, 2023
Fixes cockroachdb#17500.
Waiting on github.com/cockroachdb/pebble/pull/2117.

This commit integrates with the `AsyncStorageWrites` functionality that
we added to Raft in github.com/etcd-io/raft/pull/8.

\## Approach

The commit makes the minimal changes needed to integrate with async
storage writes and pull fsyncs out of the raft state machine loop. It
does not make an effort to extract the non-durable portion of raft log
writes or raft log application onto separate goroutine pools, as was
described in cockroachdb#17500. Those changes will also be impactful, but they're
non trivial and bump into a pipelining vs. batching trade-off, so they
are left as future work items (TODO(nvanbenschoten): open new issues).

With this change, asynchronous Raft log syncs are enabled by the new
`DB.ApplyNoSyncWait` Pebble API introduced in github.com/cockroachdb/pebble/pull/2117.
The `handleRaftReady` state machine loop continues to initiate Raft log
writes, but it uses the Pebble API to offload waiting on durability to a
separate goroutine. This separate goroutine then sends the corresponding
`MsgStorageAppend`'s response messages where they need to go (locally
and/or to the Raft leader) when the fsync completes. The async storage
writes functionality in Raft makes this all safe.

\## Benchmark Results

The result of this change is reduced interference between Raft
proposals. As a result, it reduces end-to-end commit latency.

github.com/etcd-io/raft/pull/8 presented a collection of benchmark
results captured from integrating async storage writes with rafttoy.

When integrated into CockroachDB, we see similar improvements to average
and tail latency. However, it doesn't provide the throughput
improvements at the top end because log appends and state machine
application have not yet been extracted into separate goroutine pools,
which would facilitate increased opportunity for batching.

TODO: add images

----

Release note (performance improvement): The Raft proposal pipeline
has been optimized to reduce interference between Raft proposals.
This improves average and tail write latency at high concurrency.
craig bot pushed a commit to cockroachdb/cockroach that referenced this pull request Feb 3, 2023
94165: kv: integrate raft async storage writes r=nvanbenschoten a=nvanbenschoten

Fixes #17500.
Epic: CRDB-22644

This commit integrates with the `AsyncStorageWrites` functionality that we added to Raft in etcd-io/raft/pull/8. 

## Approach

The commit makes the minimal changes needed to integrate with async storage writes and pull fsyncs out of the raft state machine loop. It does not make an effort to extract the non-durable portion of raft log writes or raft log application onto separate goroutine pools, as was described in #17500. Those changes will also be impactful, but they're non trivial and bump into a pipelining vs. batching trade-off, so they are left as future work items. See #94853 and #94854.

With this change, asynchronous Raft log syncs are enabled by the new `DB.ApplyNoSyncWait` Pebble API introduced in cockroachdb/pebble/pull/2117.  The `handleRaftReady` state machine loop continues to initiate Raft log writes, but it uses the Pebble API to offload waiting on durability to a separate goroutine. This separate goroutine then sends the corresponding `MsgStorageAppend`'s response messages where they need to go (locally and/or to the Raft leader) when the fsync completes. The async storage writes functionality in Raft makes this all safe.

## Benchmark Results

The result of this change is reduced interference between Raft proposals. As a result, it reduces end-to-end commit latency.

etcd-io/raft/pull/8 presented a collection of benchmark results captured from integrating async storage writes with rafttoy.

When integrated into CockroachDB, we see similar improvements to average and tail latency. However, it doesn't provide the throughput improvements at the top end because log appends and state machine application have not yet been extracted into separate goroutine pools, which would facilitate an increased opportunity for batching.

To visualize the impact on latency, consider the following test. The experiment uses a 3-node GCP cluster with n2-standard-32 instances spread across three availability zones. It runs kv0 (write-only) against the cluster with 64-byte values. It then ramps up concurrency to compare throughput vs. average and tail latency.

_NOTE: log scales on x and y axes_

![Throughput vs  average latency of write-only workload](https://user-images.githubusercontent.com/5438456/209210719-bec842f6-1093-48cd-8be7-05a3d79c2a71.svg)

![Throughput vs  tail latency of write-only workload](https://user-images.githubusercontent.com/5438456/209210777-670a4d25-9516-41a2-b7e7-86b402004536.svg)

Async storage writes impacts latency by different amounts at different throughputs, ranging from an improvement of 20% to 40% when the system is "well utilized". However, it increases latency by 5% to 10% when the system is over-saturated and CPU bound, presumably because of the extra goroutine handoff to the log append fsync callback, which will be impacted by elevated goroutine scheduling latency.

| Throughput (B/s) | Throughput (qps) | Avg. Latency Δ | p99 Latency Δ |
| ---------------- | ---------------- | -------------- | ------------- |
| 63  KB/s         | 1,000            | -10.5%         | -8.8%         |
| 125 KB/s         | 2,000            | -7.1%          | -10.4%        |
| 250 KB/s         | 4,000            | -20%           | -11.2%        |
| 500 KB/s         | 8,000            | -16.6%         | -25.3%        |
| 1 MB/s           | 16,000           | -30.8%         | -44.0%        |
| 2 MB/s           | 32,000           | -38.2%         | -30.9%        |
| 4 MB/s           | 64,000           | 5.9%           | 9.4%          |

### Other benchmark results
```bash
name                   old ops/s    new ops/s    delta
# 50% read, 50% update
ycsb/A/nodes=3          16.0k ± 5%   16.2k ± 4%     ~     (p=0.353 n=10+10)
ycsb/A/nodes=3/cpu=32   28.7k ± 5%   33.8k ± 2%  +17.57%  (p=0.000 n=9+9)
# 95% read, 5% update
ycsb/B/nodes=3          29.9k ± 3%   30.2k ± 3%     ~     (p=0.278 n=9+10)
ycsb/B/nodes=3/cpu=32    101k ± 1%    100k ± 3%     ~     (p=0.274 n=8+10)
# 100% read
ycsb/C/nodes=3          40.4k ± 3%   40.0k ± 3%     ~     (p=0.190 n=10+10)
ycsb/C/nodes=3/cpu=32    135k ± 1%    137k ± 1%   +0.87%  (p=0.011 n=9+9)
# 95% read, 5% insert
ycsb/D/nodes=3          33.6k ± 3%   33.8k ± 3%     ~     (p=0.315 n=10+10)
ycsb/D/nodes=3/cpu=32    108k ± 1%    106k ± 6%     ~     (p=0.739 n=10+10)
# 95% scan, 5% insert
ycsb/E/nodes=3          3.79k ± 1%   3.73k ± 1%   -1.42%  (p=0.000 n=9+9)
ycsb/E/nodes=3/cpu=32   6.31k ± 5%   6.48k ± 6%     ~     (p=0.123 n=10+10)
# 50% read, 50% read-modify-write
ycsb/F/nodes=3          7.68k ± 2%   7.99k ± 2%   +4.11%  (p=0.000 n=10+10)
ycsb/F/nodes=3/cpu=32   15.6k ± 4%   18.1k ± 3%  +16.14%  (p=0.000 n=8+10)

name                   old avg(ms)  new avg(ms)  delta
ycsb/A/nodes=3           6.01 ± 5%    5.95 ± 4%     ~     (p=0.460 n=10+10)
ycsb/A/nodes=3/cpu=32    5.01 ± 4%    4.25 ± 4%  -15.19%  (p=0.000 n=9+10)
ycsb/B/nodes=3           4.80 ± 0%    4.77 ± 4%     ~     (p=0.586 n=7+10)
ycsb/B/nodes=3/cpu=32    1.90 ± 0%    1.90 ± 0%     ~     (all equal)
ycsb/C/nodes=3           3.56 ± 2%    3.61 ± 3%     ~     (p=0.180 n=10+10)
ycsb/C/nodes=3/cpu=32    1.40 ± 0%    1.40 ± 0%     ~     (all equal)
ycsb/D/nodes=3           2.87 ± 2%    2.85 ± 2%     ~     (p=0.650 n=10+10)
ycsb/D/nodes=3/cpu=32    1.30 ± 0%    1.34 ± 4%     ~     (p=0.087 n=10+10)
ycsb/E/nodes=3           25.3 ± 0%    25.7 ± 1%   +1.38%  (p=0.000 n=8+8)
ycsb/E/nodes=3/cpu=32    22.9 ± 5%    22.2 ± 6%     ~     (p=0.109 n=10+10)
ycsb/F/nodes=3           12.5 ± 2%    12.0 ± 1%   -3.72%  (p=0.000 n=10+9)
ycsb/F/nodes=3/cpu=32    9.27 ± 4%    7.98 ± 3%  -13.96%  (p=0.000 n=8+10)

name                   old p99(ms)  new p99(ms)  delta
ycsb/A/nodes=3           45.7 ±15%    35.7 ± 6%  -21.90%  (p=0.000 n=10+8)
ycsb/A/nodes=3/cpu=32    67.6 ±13%    55.3 ± 5%  -18.10%  (p=0.000 n=9+10)
ycsb/B/nodes=3           30.5 ±24%    29.4 ± 7%     ~     (p=0.589 n=10+10)
ycsb/B/nodes=3/cpu=32    12.8 ± 2%    13.3 ± 7%     ~     (p=0.052 n=10+8)
ycsb/C/nodes=3           14.0 ± 3%    14.2 ± 0%     ~     (p=0.294 n=10+8)
ycsb/C/nodes=3/cpu=32    5.80 ± 0%    5.70 ± 5%     ~     (p=0.233 n=7+10)
ycsb/D/nodes=3           12.4 ± 2%    11.7 ± 3%   -5.32%  (p=0.001 n=10+10)
ycsb/D/nodes=3/cpu=32    6.30 ± 0%    5.96 ± 6%   -5.40%  (p=0.001 n=10+10)
ycsb/E/nodes=3           81.0 ± 4%    83.9 ± 0%   +3.63%  (p=0.012 n=10+7)
ycsb/E/nodes=3/cpu=32     139 ±19%     119 ±12%  -14.46%  (p=0.021 n=10+10)
ycsb/F/nodes=3            122 ±17%     103 ±10%  -15.48%  (p=0.002 n=10+8)
ycsb/F/nodes=3/cpu=32     146 ±20%     133 ± 7%   -8.89%  (p=0.025 n=10+10)
```

The way to interpret these results is that async raft storage writes reduce latency and, as a result of the closed loop natured workload, also increase throughput for the YCSB variants that perform writes and aren't already CPU saturated. Variants that are read-only are unaffected. Variants that are CPU-saturated do not benefit from the change because they are already bottlenecked on CPU resources and cannot push any more load (see above).

----

Release note (performance improvement): The Raft proposal pipeline has been optimized to reduce interference between Raft proposals. This improves average and tail write latency at high concurrency.

96458: sql: fixes statement contention count metric r=j82w a=j82w

Fixes a bug introduced in #94750 where the metric
count was counting transaction that hit contention events instead of the statement count.

closes: #96429

Release note: none

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: j82w <jwilley@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants