-
Notifications
You must be signed in to change notification settings - Fork 452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
db: add DB.ApplyNoSyncWait for asynchronous apply #2117
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 8 files reviewed, 1 unresolved discussion (waiting on @jbowens and @nvanbenschoten)
db.go
line 738 at r1 (raw file):
// first calling Batch.SyncWait. // RECOMMENDATION: Prefer using Apply unless you really understand why you // need ApplyNoSyncWait.
@nvanbenschoten this is the interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 8 files reviewed, 2 unresolved discussions (waiting on @jbowens and @nvanbenschoten)
commit.go
line 275 at r1 (raw file):
// TODO(sumeer): will block here if SyncConcurrency is insufficient. Monitor // and increase if needed.
@nvanbenschoten the potential bottleneck
48c0da3
to
b14cf8c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for building this Sumeer. I've integrated it into a pair of different prototypes that extend from cockroachdb/cockroach#87050 (review). The first splits off a pool of raft log appender goroutines that use this API to avoid waiting in lockstep with one another. The second simply calls this new API directly from the existing raft state machine handler goroutine and pushes the callback handling onto a separate goroutine. Both have demonstrated the expected latency improvement and show a near-identical throughput-latency curve.
I'll be posting a set of results and recommendations on cockroachdb/cockroach#17500 either today or tomorrow. Regardless of the approach that we take in 23.1, I think we'll want to use this API.
Reviewed 5 of 8 files at r1, 2 of 2 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @jbowens and @sumeerbhola)
commit.go
line 227 at r3 (raw file):
// These lock-free queues have a fixed capacity. And since they are // lock-free, we cannot do blocking waits when pushing onto these queues, // when case they are full. Additionally, adding to these queues happens
s/when case/in case/
commit.go
line 254 at r3 (raw file):
// memtable and publishing. Then the LogWriter.flusher.syncQ could be // sized as 0.5*5 of the commitPipeling.pending queue. So we could size // these based on typical experimental results.
How do you imagine that such experiments would work? Do we have the observability in place to understand when a workload is being throttled by these bounded queue sizes? It doesn't look like it. Would you recommend that I run with a patched version of Pebble that detects when sends on either of these channels block? Or perhaps a blocking profile could be used to observe the same thing.
Related, should we make SyncConcurrency
configurable so that a user of pebble who finds that their desired amount of concurrency causes these queues to reach their capacity can resolve this?
db.go
line 736 at r3 (raw file):
The caller must not Close the batch without first calling Batch.SyncWait.
Does this contract cause the Batch
to hold on to resources for longer than strictly necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both have demonstrated the expected latency improvement and show a near-identical throughput-latency curve.
Excellent!
Regardless of the approach that we take in 23.1, I think we'll want to use this API.
I'll need to microbenchmark the change here. Worst case, we could time things such that the change here, plus the corresponding CockroachDB changes for async raft log writes, land in the same release so that any microbenchmarking loss here is made up for with the other benefits.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @jbowens and @nvanbenschoten)
commit.go
line 254 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
How do you imagine that such experiments would work? Do we have the observability in place to understand when a workload is being throttled by these bounded queue sizes? It doesn't look like it. Would you recommend that I run with a patched version of Pebble that detects when sends on either of these channels block? Or perhaps a blocking profile could be used to observe the same thing.
Related, should we make
SyncConcurrency
configurable so that a user of pebble who finds that their desired amount of concurrency causes these queues to reach their capacity can resolve this?
We have some LogWriterMetrics
here
Lines 735 to 736 in b9289d7
PendingBufferLen base.GaugeSampleMetric | |
SyncQueueLen base.GaugeSampleMetric |
Metrics
here Line 235 in b9289d7
record.LogWriterMetrics |
One could export these as metrics from
kvserver.StoreMetrics.updateEngineMetrics
. These are cumulative samples of the length, so one needs to take a delta and then compute the mean -- we can only tell if the mean is high. We could also count in LogWriter whenever one of these queues is full -- it knows when it is full when dequeuing.Which reminds me that the maximum number of pending blocks (
CapAllocatedBlocks
) can also be a bottleneck (which is what this pending metric is measuring -- not the commitPipeline.pending).
I am hopeful we don't need to make these sizes configurable, once we measure. We have arrays with these sizes and I am wary about changing them to slices.
db.go
line 736 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
The caller must not Close the batch without first calling Batch.SyncWait.
Does this contract cause the
Batch
to hold on to resources for longer than strictly necessary?
Yes. It was the minimum viable thing to do. Do you think there will be enough bytes in the "pipeline" that this will be an issue? One could separate out the 2 fields in Batch
that are needed for the LogWriter
and discard the rest.
Fixes cockroachdb#17500. Waiting on github.com/cockroachdb/pebble/pull/2117. This commit integrates with the `AsyncStorageWrites` functionality that we added to Raft in github.com/etcd-io/raft/pull/8. \## Approach The commit makes the minimal changes needed to integrate with async storage writes and pull fsyncs out of the raft state machine loop. It does not make an effort to extract the non-durable portion of raft log writes or raft log application onto separate goroutine pools, as was described in cockroachdb#17500. Those changes will also be impactful, but they're non trivial and bump into a pipelining vs. batching trade-off, so they are left as future work items (TODO(nvanbenschoten): open new issues). With this change, asynchronous Raft log syncs are enabled by the new `DB.ApplyNoSyncWait` Pebble API introduced in github.com/cockroachdb/pebble/pull/2117. The `handleRaftReady` state machine loop continues to initiate Raft log writes, but it uses the Pebble API to offload waiting on durability to a separate goroutine. This separate goroutine then sends the corresponding `MsgStorageAppend`'s response messages where they need to go (locally and/or to the Raft leader) when the fsync completes. The async storage writes functionality in Raft makes this all safe. \## Benchmark Results The result of this change is reduced interference between Raft proposals. As a result, it reduces end-to-end commit latency. github.com/etcd-io/raft/pull/8 presented a collection of benchmark results captured from integrating async storage writes with rafttoy. When integrated into CockroachDB, we see similar improvements to average and tail latency. However, it doesn't provide the throughput improvements at the top end because log appends and state machine application have not yet been extracted into separate goroutine pools, which would facilitate increased opportunity for batching. TODO: add images ---- Release note (performance improvement): The Raft proposal pipeline has been optimized to reduce interference between Raft proposals. This improves average and tail write latency at high concurrency.
b14cf8c
to
c12ba6a
Compare
46e6408
to
35e19bf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 8 files reviewed, 4 unresolved discussions (waiting on @jbowens and @nvanbenschoten)
commit.go
line 275 at r1 (raw file):
Previously, sumeerbhola wrote…
@nvanbenschoten the potential bottleneck
removed this TODO comment since the earlier comment subsumes it.
commit.go
line 227 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
s/when case/in case/
Done
commit.go
line 254 at r3 (raw file):
Previously, sumeerbhola wrote…
We have some
LogWriterMetrics
herethat are exposed in pebbleLines 735 to 736 in b9289d7
PendingBufferLen base.GaugeSampleMetric SyncQueueLen base.GaugeSampleMetric Metrics
here.Line 235 in b9289d7
record.LogWriterMetrics
One could export these as metrics fromkvserver.StoreMetrics.updateEngineMetrics
. These are cumulative samples of the length, so one needs to take a delta and then compute the mean -- we can only tell if the mean is high. We could also count in LogWriter whenever one of these queues is full -- it knows when it is full when dequeuing.
Which reminds me that the maximum number of pending blocks (CapAllocatedBlocks
) can also be a bottleneck (which is what this pending metric is measuring -- not the commitPipeline.pending).I am hopeful we don't need to make these sizes configurable, once we measure. We have arrays with these sizes and I am wary about changing them to slices.
Updated the code comment.
db.go
line 736 at r3 (raw file):
Previously, sumeerbhola wrote…
Yes. It was the minimum viable thing to do. Do you think there will be enough bytes in the "pipeline" that this will be an issue? One could separate out the 2 fields in
Batch
that are needed for theLogWriter
and discard the rest.
I've added a TODO in batch.go to optimize the memory in the future, if it becomes an issue. I am not eager to do it now since having another struct is likely to cause some cpu performance regression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added microbenchmark. Looks ok to me.
And improved some testing.
This is ready for a proper review.
Reviewable status: 0 of 8 files reviewed, 4 unresolved discussions (waiting on @jbowens, @nicktrav, and @nvanbenschoten)
35e19bf
to
6d20eb4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Someone from Storage should also give this a pass.
Reviewed 2 of 8 files at r4, 6 of 6 files at r5, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @jbowens, @nicktrav, and @sumeerbhola)
-- commits
line 24 at r5:
5 iterations per variant is the minimum that benchstat will consider to be statistically significant, but it's still not very many and may be hiding smaller changes in performance. Consider increasing the iteration count. Or at least do so in the future.
Also, in case you aren't familiar, benchdiff
makes this kind of microbenchmark comparison quite easy to run. This would just be (when run from this branch):
benchdiff --count=20 --run='CommitPipeline/(1|2|4|8)' .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 8 files at r4, 6 of 6 files at r5, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @nicktrav and @sumeerbhola)
batch.go
line 270 at r5 (raw file):
// into a separate struct that is allocated separately (using another // sync.Pool), and only that struct needs to outlive Batch.Close (which // could then be called immediately after ApplyNoSncWait).
nit: s/ApplyNoSncWait/ApplyNoSyncWait/
commit.go
line 418 at r5 (raw file):
// commitCount represents the waiting needed for publish, and optionally the // waiting needed for the WAL sync. b.commit.Add(commitCount)
nit: maybe now it'd read more clearly to use an explicit constant in each .Add
call up above. something like
switch {
case !syncWAL:
// Only need to wait for the publish.
b.commit.Add(1)
case syncWAL && noSyncWait:
syncErr = &b.commitErr
syncWG = &b.fsyncWait
// Only need to wait synchronously for the publish; Asynchronously the user
// will wait on the batch's fsyncWait.
b.commit.Add(1)
b.fsyncWait.Add(1)
case syncWAL && !noSyncWait:
syncErr = &b.commitErr
syncWG = &b.commit
// Must wait for both the publish and the synchronous WAL fsync.
b.commit.Add(2)
}
?
16be9da
to
7c85119
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTRs!
Reviewable status: 5 of 8 files reviewed, 1 unresolved discussion (waiting on @jbowens, @nicktrav, and @nvanbenschoten)
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
5 iterations per variant is the minimum that benchstat will consider to be statistically significant, but it's still not very many and may be hiding smaller changes in performance. Consider increasing the iteration count. Or at least do so in the future.
Also, in case you aren't familiar,
benchdiff
makes this kind of microbenchmark comparison quite easy to run. This would just be (when run from this branch):benchdiff --count=20 --run='CommitPipeline/(1|2|4|8)' .
Thanks for the benchdiff
pointer.
I increased to 20, and ran it on gceworker
. Oddly, this branch comes out ahead of master. I'm going to interpret this as no difference.
synchronous: old is master and new is this branch
name old time/op new time/op delta
CommitPipeline/parallel=1-16 1.45µs ± 3% 1.43µs ± 2% -1.76% (p=0.000 n=20+19)
CommitPipeline/parallel=2-16 1.39µs ± 1% 1.37µs ± 4% -2.10% (p=0.000 n=20+20)
CommitPipeline/parallel=4-16 1.38µs ± 2% 1.34µs ± 3% -2.95% (p=0.000 n=20+20)
CommitPipeline/parallel=8-16 1.52µs ±27% 1.34µs ± 4% -11.57% (p=0.000 n=20+20)
name old speed new speed delta
CommitPipeline/parallel=1-16 11.0MB/s ± 3% 11.2MB/s ± 2% +1.78% (p=0.000 n=20+19)
CommitPipeline/parallel=2-16 11.5MB/s ± 1% 11.7MB/s ± 4% +2.14% (p=0.000 n=20+20)
CommitPipeline/parallel=4-16 11.6MB/s ± 2% 12.0MB/s ± 3% +3.05% (p=0.000 n=20+20)
CommitPipeline/parallel=8-16 10.7MB/s ±22% 11.9MB/s ± 4% +11.31% (p=0.000 n=20+20)
name old alloc/op new alloc/op delta
CommitPipeline/parallel=1-16 1.37kB ± 0% 1.40kB ± 0% +2.29% (p=0.000 n=20+20)
CommitPipeline/parallel=2-16 1.37kB ± 0% 1.40kB ± 0% +2.31% (p=0.000 n=20+20)
CommitPipeline/parallel=4-16 1.37kB ± 0% 1.40kB ± 0% +2.27% (p=0.000 n=20+20)
CommitPipeline/parallel=8-16 1.37kB ± 0% 1.40kB ± 0% +2.30% (p=0.000 n=20+20)
name old allocs/op new allocs/op delta
CommitPipeline/parallel=1-16 2.00 ± 0% 2.00 ± 0% ~ (all equal)
CommitPipeline/parallel=2-16 2.00 ± 0% 2.00 ± 0% ~ (all equal)
CommitPipeline/parallel=4-16 2.00 ± 0% 2.00 ± 0% ~ (all equal)
CommitPipeline/parallel=8-16 2.00 ± 0% 2.00 ± 0% ~ (all equal)
old is branch with synchronous and new is async:
name old time/op new time/op delta
CommitPipeline/parallel=1-16 1.43µs ± 2% 1.42µs ± 3% -0.80% (p=0.029 n=19+19)
CommitPipeline/parallel=2-16 1.37µs ± 4% 1.37µs ± 2% ~ (p=0.545 n=20+19)
CommitPipeline/parallel=4-16 1.34µs ± 3% 1.36µs ± 3% +1.37% (p=0.008 n=20+19)
CommitPipeline/parallel=8-16 1.34µs ± 4% 1.36µs ± 3% +1.10% (p=0.018 n=20+20)
name old speed new speed delta
CommitPipeline/parallel=1-16 11.2MB/s ± 2% 11.3MB/s ± 2% +0.81% (p=0.022 n=19+19)
CommitPipeline/parallel=2-16 11.7MB/s ± 4% 11.7MB/s ± 2% ~ (p=0.545 n=20+19)
CommitPipeline/parallel=4-16 12.0MB/s ± 3% 11.8MB/s ± 3% -1.37% (p=0.009 n=20+19)
CommitPipeline/parallel=8-16 11.9MB/s ± 4% 11.8MB/s ± 3% -1.09% (p=0.018 n=20+20)
name old alloc/op new alloc/op delta
CommitPipeline/parallel=1-16 1.40kB ± 0% 1.40kB ± 0% ~ (p=0.085 n=20+20)
CommitPipeline/parallel=2-16 1.40kB ± 0% 1.40kB ± 0% ~ (p=0.076 n=20+20)
CommitPipeline/parallel=4-16 1.40kB ± 0% 1.40kB ± 0% ~ (p=0.233 n=20+20)
CommitPipeline/parallel=8-16 1.40kB ± 0% 1.40kB ± 0% ~ (p=0.228 n=20+20)
name old allocs/op new allocs/op delta
CommitPipeline/parallel=1-16 2.00 ± 0% 2.00 ± 0% ~ (all equal)
CommitPipeline/parallel=2-16 2.00 ± 0% 2.00 ± 0% ~ (all equal)
CommitPipeline/parallel=4-16 2.00 ± 0% 2.00 ± 0% ~ (all equal)
CommitPipeline/parallel=8-16 2.00 ± 0% 2.00 ± 0% ~ (all equal)
batch.go
line 270 at r5 (raw file):
Previously, jbowens (Jackson Owens) wrote…
nit: s/ApplyNoSncWait/ApplyNoSyncWait/
Done
commit.go
line 418 at r5 (raw file):
Previously, jbowens (Jackson Owens) wrote…
nit: maybe now it'd read more clearly to use an explicit constant in each
.Add
call up above. something likeswitch { case !syncWAL: // Only need to wait for the publish. b.commit.Add(1) case syncWAL && noSyncWait: syncErr = &b.commitErr syncWG = &b.fsyncWait // Only need to wait synchronously for the publish; Asynchronously the user // will wait on the batch's fsyncWait. b.commit.Add(1) b.fsyncWait.Add(1) case syncWAL && !noSyncWait: syncErr = &b.commitErr syncWG = &b.commit // Must wait for both the publish and the synchronous WAL fsync. b.commit.Add(2) }
?
Done
7c85119
to
ac7ea76
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 5 of 10 files reviewed, 2 unresolved discussions (waiting on @jbowens, @nicktrav, and @nvanbenschoten)
internal/metamorphic/ops.go
line 76 at r6 (raw file):
w := t.getWriter(o.writerID) var err error if o.writerID.tag() == dbTag && t.testOpts.asyncApplyToDB && t.writeOpts.Sync {
added metamorphic test support
ac7ea76
to
82206bd
Compare
ApplyNoSyncWait must only be used when WriteOptions.Sync is true. It enqueues the Batch to the WAL, adds to the memtable, and waits until the batch is visible in the memtable, and then returns to the caller. The caller is responsible for calling Batch.SyncWait to wait until the write to the WAL is fsynced. This change required splitting the WaitGroup in the Batch into two WaitGroups, so waiting for the visibility can happen separately from waiting for the WAL write. Additionally, the channel used as a semaphore for reserving space in the two lock-free queues is split into two channels, since dequeueing from these queues can happen in arbitrary order. Benchmarks indicate that the overhead of pushing and popping on an extra channel is tolerable. Benchmarks were run on a macbook pro -- note these are not doing an actual sync since they use io.Discard, and are only benchmarking the commit pipeline. Sync wait on master (old) vs this branch (new): name old time/op new time/op delta CommitPipeline/no-sync-wait=false/parallel=1-10 1.09µs ± 6% 1.15µs ± 9% ~ (p=0.310 n=5+5) CommitPipeline/no-sync-wait=false/parallel=2-10 1.53µs ± 4% 1.54µs ± 2% ~ (p=0.841 n=5+5) CommitPipeline/no-sync-wait=false/parallel=4-10 1.54µs ± 1% 1.59µs ± 1% +2.87% (p=0.008 n=5+5) CommitPipeline/no-sync-wait=false/parallel=8-10 1.52µs ± 1% 1.55µs ± 1% +2.43% (p=0.008 n=5+5) name old speed new speed delta CommitPipeline/no-sync-wait=false/parallel=1-10 14.7MB/s ± 5% 13.9MB/s ±10% ~ (p=0.310 n=5+5) CommitPipeline/no-sync-wait=false/parallel=2-10 10.5MB/s ± 4% 10.4MB/s ± 2% ~ (p=0.841 n=5+5) CommitPipeline/no-sync-wait=false/parallel=4-10 10.4MB/s ± 1% 10.1MB/s ± 1% -2.78% (p=0.008 n=5+5) CommitPipeline/no-sync-wait=false/parallel=8-10 10.5MB/s ± 1% 10.3MB/s ± 1% -2.35% (p=0.008 n=5+5) name old alloc/op new alloc/op delta CommitPipeline/no-sync-wait=false/parallel=1-10 1.37kB ± 0% 1.40kB ± 0% +2.31% (p=0.008 n=5+5) CommitPipeline/no-sync-wait=false/parallel=2-10 1.37kB ± 0% 1.40kB ± 0% +2.31% (p=0.008 n=5+5) CommitPipeline/no-sync-wait=false/parallel=4-10 1.37kB ± 0% 1.40kB ± 0% +2.15% (p=0.008 n=5+5) CommitPipeline/no-sync-wait=false/parallel=8-10 1.37kB ± 0% 1.40kB ± 0% +2.34% (p=0.008 n=5+5) name old allocs/op new allocs/op delta CommitPipeline/no-sync-wait=false/parallel=1-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) CommitPipeline/no-sync-wait=false/parallel=2-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) CommitPipeline/no-sync-wait=false/parallel=4-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) CommitPipeline/no-sync-wait=false/parallel=8-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) Sync wait on this branch (old) vs async wait on this branch (new): name old time/op new time/op delta CommitPipeline/parallel=1-10 1.15µs ± 9% 1.20µs ± 7% ~ (p=0.421 n=5+5) CommitPipeline/parallel=2-10 1.54µs ± 2% 1.59µs ± 2% +3.50% (p=0.008 n=5+5) CommitPipeline/parallel=4-10 1.59µs ± 1% 1.58µs ± 1% ~ (p=0.802 n=5+5) CommitPipeline/parallel=8-10 1.55µs ± 1% 1.56µs ± 1% ~ (p=0.452 n=5+5) name old speed new speed delta CommitPipeline/parallel=1-10 13.9MB/s ±10% 13.3MB/s ± 7% ~ (p=0.421 n=5+5) CommitPipeline/parallel=2-10 10.4MB/s ± 2% 10.1MB/s ± 2% -3.36% (p=0.008 n=5+5) CommitPipeline/parallel=4-10 10.1MB/s ± 1% 10.1MB/s ± 1% ~ (p=0.786 n=5+5) CommitPipeline/parallel=8-10 10.3MB/s ± 1% 10.3MB/s ± 1% ~ (p=0.452 n=5+5) name old alloc/op new alloc/op delta CommitPipeline/parallel=1-10 1.40kB ± 0% 1.40kB ± 0% ~ (p=0.651 n=5+5) CommitPipeline/parallel=2-10 1.40kB ± 0% 1.39kB ± 0% -0.21% (p=0.008 n=5+5) CommitPipeline/parallel=4-10 1.40kB ± 0% 1.40kB ± 0% ~ (p=0.706 n=5+5) CommitPipeline/parallel=8-10 1.40kB ± 0% 1.40kB ± 0% ~ (p=0.587 n=5+5) name old allocs/op new allocs/op delta CommitPipeline/parallel=1-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) CommitPipeline/parallel=2-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) CommitPipeline/parallel=4-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) CommitPipeline/parallel=8-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) Informs cockroachdb/cockroach#17500 See discussion thread cockroachdb/cockroach#87050 (review)
82206bd
to
f0f2a16
Compare
Extracted from cockroachdb#94165. Picks up github.com/cockroachdb/pebble/pull/2117. Release note: None Epic: None
Extracted from cockroachdb#94165. Picks up github.com/cockroachdb/pebble/pull/2117. Release note: None Epic: None
95862: storage: add CommitNoSyncWait and SyncWait to Batch interface r=sumeerbhola a=nvanbenschoten Extracted from #94165. Picks up cockroachdb/pebble/pull/2117. Release note: None Epic: None Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Fixes cockroachdb#17500. Waiting on github.com/cockroachdb/pebble/pull/2117. This commit integrates with the `AsyncStorageWrites` functionality that we added to Raft in github.com/etcd-io/raft/pull/8. \## Approach The commit makes the minimal changes needed to integrate with async storage writes and pull fsyncs out of the raft state machine loop. It does not make an effort to extract the non-durable portion of raft log writes or raft log application onto separate goroutine pools, as was described in cockroachdb#17500. Those changes will also be impactful, but they're non trivial and bump into a pipelining vs. batching trade-off, so they are left as future work items (TODO(nvanbenschoten): open new issues). With this change, asynchronous Raft log syncs are enabled by the new `DB.ApplyNoSyncWait` Pebble API introduced in github.com/cockroachdb/pebble/pull/2117. The `handleRaftReady` state machine loop continues to initiate Raft log writes, but it uses the Pebble API to offload waiting on durability to a separate goroutine. This separate goroutine then sends the corresponding `MsgStorageAppend`'s response messages where they need to go (locally and/or to the Raft leader) when the fsync completes. The async storage writes functionality in Raft makes this all safe. \## Benchmark Results The result of this change is reduced interference between Raft proposals. As a result, it reduces end-to-end commit latency. github.com/etcd-io/raft/pull/8 presented a collection of benchmark results captured from integrating async storage writes with rafttoy. When integrated into CockroachDB, we see similar improvements to average and tail latency. However, it doesn't provide the throughput improvements at the top end because log appends and state machine application have not yet been extracted into separate goroutine pools, which would facilitate increased opportunity for batching. TODO: add images ---- Release note (performance improvement): The Raft proposal pipeline has been optimized to reduce interference between Raft proposals. This improves average and tail write latency at high concurrency.
Fixes cockroachdb#17500. Waiting on github.com/cockroachdb/pebble/pull/2117. This commit integrates with the `AsyncStorageWrites` functionality that we added to Raft in github.com/etcd-io/raft/pull/8. \## Approach The commit makes the minimal changes needed to integrate with async storage writes and pull fsyncs out of the raft state machine loop. It does not make an effort to extract the non-durable portion of raft log writes or raft log application onto separate goroutine pools, as was described in cockroachdb#17500. Those changes will also be impactful, but they're non trivial and bump into a pipelining vs. batching trade-off, so they are left as future work items (TODO(nvanbenschoten): open new issues). With this change, asynchronous Raft log syncs are enabled by the new `DB.ApplyNoSyncWait` Pebble API introduced in github.com/cockroachdb/pebble/pull/2117. The `handleRaftReady` state machine loop continues to initiate Raft log writes, but it uses the Pebble API to offload waiting on durability to a separate goroutine. This separate goroutine then sends the corresponding `MsgStorageAppend`'s response messages where they need to go (locally and/or to the Raft leader) when the fsync completes. The async storage writes functionality in Raft makes this all safe. \## Benchmark Results The result of this change is reduced interference between Raft proposals. As a result, it reduces end-to-end commit latency. github.com/etcd-io/raft/pull/8 presented a collection of benchmark results captured from integrating async storage writes with rafttoy. When integrated into CockroachDB, we see similar improvements to average and tail latency. However, it doesn't provide the throughput improvements at the top end because log appends and state machine application have not yet been extracted into separate goroutine pools, which would facilitate increased opportunity for batching. TODO: add images ---- Release note (performance improvement): The Raft proposal pipeline has been optimized to reduce interference between Raft proposals. This improves average and tail write latency at high concurrency.
94165: kv: integrate raft async storage writes r=nvanbenschoten a=nvanbenschoten Fixes #17500. Epic: CRDB-22644 This commit integrates with the `AsyncStorageWrites` functionality that we added to Raft in etcd-io/raft/pull/8. ## Approach The commit makes the minimal changes needed to integrate with async storage writes and pull fsyncs out of the raft state machine loop. It does not make an effort to extract the non-durable portion of raft log writes or raft log application onto separate goroutine pools, as was described in #17500. Those changes will also be impactful, but they're non trivial and bump into a pipelining vs. batching trade-off, so they are left as future work items. See #94853 and #94854. With this change, asynchronous Raft log syncs are enabled by the new `DB.ApplyNoSyncWait` Pebble API introduced in cockroachdb/pebble/pull/2117. The `handleRaftReady` state machine loop continues to initiate Raft log writes, but it uses the Pebble API to offload waiting on durability to a separate goroutine. This separate goroutine then sends the corresponding `MsgStorageAppend`'s response messages where they need to go (locally and/or to the Raft leader) when the fsync completes. The async storage writes functionality in Raft makes this all safe. ## Benchmark Results The result of this change is reduced interference between Raft proposals. As a result, it reduces end-to-end commit latency. etcd-io/raft/pull/8 presented a collection of benchmark results captured from integrating async storage writes with rafttoy. When integrated into CockroachDB, we see similar improvements to average and tail latency. However, it doesn't provide the throughput improvements at the top end because log appends and state machine application have not yet been extracted into separate goroutine pools, which would facilitate an increased opportunity for batching. To visualize the impact on latency, consider the following test. The experiment uses a 3-node GCP cluster with n2-standard-32 instances spread across three availability zones. It runs kv0 (write-only) against the cluster with 64-byte values. It then ramps up concurrency to compare throughput vs. average and tail latency. _NOTE: log scales on x and y axes_ ![Throughput vs average latency of write-only workload](https://user-images.githubusercontent.com/5438456/209210719-bec842f6-1093-48cd-8be7-05a3d79c2a71.svg) ![Throughput vs tail latency of write-only workload](https://user-images.githubusercontent.com/5438456/209210777-670a4d25-9516-41a2-b7e7-86b402004536.svg) Async storage writes impacts latency by different amounts at different throughputs, ranging from an improvement of 20% to 40% when the system is "well utilized". However, it increases latency by 5% to 10% when the system is over-saturated and CPU bound, presumably because of the extra goroutine handoff to the log append fsync callback, which will be impacted by elevated goroutine scheduling latency. | Throughput (B/s) | Throughput (qps) | Avg. Latency Δ | p99 Latency Δ | | ---------------- | ---------------- | -------------- | ------------- | | 63 KB/s | 1,000 | -10.5% | -8.8% | | 125 KB/s | 2,000 | -7.1% | -10.4% | | 250 KB/s | 4,000 | -20% | -11.2% | | 500 KB/s | 8,000 | -16.6% | -25.3% | | 1 MB/s | 16,000 | -30.8% | -44.0% | | 2 MB/s | 32,000 | -38.2% | -30.9% | | 4 MB/s | 64,000 | 5.9% | 9.4% | ### Other benchmark results ```bash name old ops/s new ops/s delta # 50% read, 50% update ycsb/A/nodes=3 16.0k ± 5% 16.2k ± 4% ~ (p=0.353 n=10+10) ycsb/A/nodes=3/cpu=32 28.7k ± 5% 33.8k ± 2% +17.57% (p=0.000 n=9+9) # 95% read, 5% update ycsb/B/nodes=3 29.9k ± 3% 30.2k ± 3% ~ (p=0.278 n=9+10) ycsb/B/nodes=3/cpu=32 101k ± 1% 100k ± 3% ~ (p=0.274 n=8+10) # 100% read ycsb/C/nodes=3 40.4k ± 3% 40.0k ± 3% ~ (p=0.190 n=10+10) ycsb/C/nodes=3/cpu=32 135k ± 1% 137k ± 1% +0.87% (p=0.011 n=9+9) # 95% read, 5% insert ycsb/D/nodes=3 33.6k ± 3% 33.8k ± 3% ~ (p=0.315 n=10+10) ycsb/D/nodes=3/cpu=32 108k ± 1% 106k ± 6% ~ (p=0.739 n=10+10) # 95% scan, 5% insert ycsb/E/nodes=3 3.79k ± 1% 3.73k ± 1% -1.42% (p=0.000 n=9+9) ycsb/E/nodes=3/cpu=32 6.31k ± 5% 6.48k ± 6% ~ (p=0.123 n=10+10) # 50% read, 50% read-modify-write ycsb/F/nodes=3 7.68k ± 2% 7.99k ± 2% +4.11% (p=0.000 n=10+10) ycsb/F/nodes=3/cpu=32 15.6k ± 4% 18.1k ± 3% +16.14% (p=0.000 n=8+10) name old avg(ms) new avg(ms) delta ycsb/A/nodes=3 6.01 ± 5% 5.95 ± 4% ~ (p=0.460 n=10+10) ycsb/A/nodes=3/cpu=32 5.01 ± 4% 4.25 ± 4% -15.19% (p=0.000 n=9+10) ycsb/B/nodes=3 4.80 ± 0% 4.77 ± 4% ~ (p=0.586 n=7+10) ycsb/B/nodes=3/cpu=32 1.90 ± 0% 1.90 ± 0% ~ (all equal) ycsb/C/nodes=3 3.56 ± 2% 3.61 ± 3% ~ (p=0.180 n=10+10) ycsb/C/nodes=3/cpu=32 1.40 ± 0% 1.40 ± 0% ~ (all equal) ycsb/D/nodes=3 2.87 ± 2% 2.85 ± 2% ~ (p=0.650 n=10+10) ycsb/D/nodes=3/cpu=32 1.30 ± 0% 1.34 ± 4% ~ (p=0.087 n=10+10) ycsb/E/nodes=3 25.3 ± 0% 25.7 ± 1% +1.38% (p=0.000 n=8+8) ycsb/E/nodes=3/cpu=32 22.9 ± 5% 22.2 ± 6% ~ (p=0.109 n=10+10) ycsb/F/nodes=3 12.5 ± 2% 12.0 ± 1% -3.72% (p=0.000 n=10+9) ycsb/F/nodes=3/cpu=32 9.27 ± 4% 7.98 ± 3% -13.96% (p=0.000 n=8+10) name old p99(ms) new p99(ms) delta ycsb/A/nodes=3 45.7 ±15% 35.7 ± 6% -21.90% (p=0.000 n=10+8) ycsb/A/nodes=3/cpu=32 67.6 ±13% 55.3 ± 5% -18.10% (p=0.000 n=9+10) ycsb/B/nodes=3 30.5 ±24% 29.4 ± 7% ~ (p=0.589 n=10+10) ycsb/B/nodes=3/cpu=32 12.8 ± 2% 13.3 ± 7% ~ (p=0.052 n=10+8) ycsb/C/nodes=3 14.0 ± 3% 14.2 ± 0% ~ (p=0.294 n=10+8) ycsb/C/nodes=3/cpu=32 5.80 ± 0% 5.70 ± 5% ~ (p=0.233 n=7+10) ycsb/D/nodes=3 12.4 ± 2% 11.7 ± 3% -5.32% (p=0.001 n=10+10) ycsb/D/nodes=3/cpu=32 6.30 ± 0% 5.96 ± 6% -5.40% (p=0.001 n=10+10) ycsb/E/nodes=3 81.0 ± 4% 83.9 ± 0% +3.63% (p=0.012 n=10+7) ycsb/E/nodes=3/cpu=32 139 ±19% 119 ±12% -14.46% (p=0.021 n=10+10) ycsb/F/nodes=3 122 ±17% 103 ±10% -15.48% (p=0.002 n=10+8) ycsb/F/nodes=3/cpu=32 146 ±20% 133 ± 7% -8.89% (p=0.025 n=10+10) ``` The way to interpret these results is that async raft storage writes reduce latency and, as a result of the closed loop natured workload, also increase throughput for the YCSB variants that perform writes and aren't already CPU saturated. Variants that are read-only are unaffected. Variants that are CPU-saturated do not benefit from the change because they are already bottlenecked on CPU resources and cannot push any more load (see above). ---- Release note (performance improvement): The Raft proposal pipeline has been optimized to reduce interference between Raft proposals. This improves average and tail write latency at high concurrency. 96458: sql: fixes statement contention count metric r=j82w a=j82w Fixes a bug introduced in #94750 where the metric count was counting transaction that hit contention events instead of the statement count. closes: #96429 Release note: none Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: j82w <jwilley@cockroachlabs.com>
ApplyNoSyncWait must only be used when WriteOptions.Sync is true. It enqueues
the Batch to the WAL, adds to the memtable, and waits until the batch is
visible in the memtable, and then returns to the caller. The caller is
responsible for calling Batch.SyncWait to wait until the write to the
WAL is fsynced.
This change required splitting the WaitGroup in the Batch into two
WaitGroups, so waiting for the visibility can happen separately from
waiting for the WAL write. Additionally, the channel used as a semaphore
for reserving space in the two lock-free queues is split into two channels,
since dequeueing from these queues can happen in arbitrary order.
Benchmarks indicate that the overhead of pushing and popping on an extra
channel is tolerable. Benchmarks were run on a macbook pro -- note these are
not doing an actual sync since they use io.Discard, and are only benchmarking
the commit pipeline.
Sync wait on master (old) vs this branch (new):
Sync wait on this branch (old) vs async wait on this branch (new):
Informs cockroachdb/cockroach#17500
See discussion thread cockroachdb/cockroach#87050 (review)