kv: make disk I/O asynchronous with respect to Raft state machine #17500

nvanbenschoten · 2017-08-07T23:15:24Z

⚠️ EDIT: The original optimization proposed here was implemented in #38954. See #17500 (comment) for the remainder of this issue.

Raft, along with most consensus protocols in the Paxos family, distinguishes committed entries from executed entries. An entry is committed when the Raft protocol has completed and the entry has been sufficiently replicated and persisted in the Raft log. The entry is then executed when it is applied to a given replica's state machine. This distinction is important because in cases where the execution of an entry onto the state machine does not produce a result, it is not necessary to wait for execution before sending a commit notification to clients.

Currently in Cockroach, the proposing replica waits until command execution before responding to clients. However, changes made for PropEval KV assured that all MVCC related logic is made upstream of Raft and that by the time we reach entry execution, we're simply applying a WriteBatch to RockDB. While some of these execution steps can create ReplicaCorruptionErrors, I don't think it's necessary or possibly even correct that we attach these errors to the proposal response itself. This is because the entry has already been replicated through Raft, so a local ReplicaCorruptionError doesn't mean that all replicas are corrupted or that the command failed. In fact, after looking at the code I don't think that proposalResult needs anything populated at execution time for correctness. Because of this, I have a suspicion that all of the time spent in applyRaftCommand, including writing to RocksDB and performing stats computations, is unnecessary latency.

Preliminary results from a (very) rough draft of the change show a 2-3% improvement on average latency for the KV benchmark with a --batch size of 4:

Without change (batch=4):

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
  600.0s        0         597164          995.3     32.2     32.5     41.9     65.0    939.5

BenchmarkBlocks	  597164	   1004753.2 ns/op

With change (batch=4):

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
  600.0s        0         612652         1021.1     31.4     30.4     41.9     58.7    604.0

BenchmarkBlocks	  612652	    979351.4 ns/op

With a --batch size of 40, the results were even more pronounced:

Without change (batch=40):

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
  300.0s        0         616040         2053.4    134.0     54.5    151.0   2415.9  10200.5

BenchmarkBlocks	  616040	    486990.5 ns/op

With change (batch=40):

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
  300.0s        0         708040         2360.1    114.7     54.5    130.0   1140.9  10200.5

BenchmarkBlocks	  708040	    423713.8 ns/op

Note that while I think we can respond to clients sooner, I don't think we can pop the command out of the CommandQueue until after it has actually been executed because of how proposer evaluated kv works upstream of Raft.

@tschottdorf @bdarnell

Jira issue: CRDB-6037

Epic CRDB-22644

The text was updated successfully, but these errors were encountered:

andreimatei · 2017-08-07T23:23:29Z

But there's a lease check in processRaftCommand (which is "downstream" of Raft). The result depends on that check...

nvanbenschoten · 2017-08-07T23:39:00Z

The lease check happens before applyRaftCommand is called though, which is what I'm trying to avoid waiting on. We can still check the lease before replying like we do now while avoiding the extra latency created by the majority of entry execution. Somewhat accidentally, this is exactly what my PoC branch does.

petermattis · 2017-08-08T00:34:57Z

I can't think of anything off-hand that would break with this change, but @tschottdorf and @bdarnell know this area of code the best.

bdarnell · 2017-08-08T01:21:37Z

I think it would be fine to respond to the client before applying the write batch. I'm not sure it's OK to do so before we update the local HLC with the command's timestamp, so I'd move it down a little further.

How does this interact with @irfansharif's #16624? Applying the write batch should be faster when we've decoupled it from the raft log and its synchronous writes, so the benefit of this change may be smaller.

petermattis · 2017-08-08T15:30:15Z

Mentioned this to @nvanbenschoten in person: a further optimization here would be to respond to all of the committed commands before applying any of the associated write batches. I have an abandoned PR that combined the Raft batches. Probably worth taking another look at it: #15648

nvanbenschoten · 2017-08-08T15:50:01Z

@petermattis the only thing to be careful with there is that none of the batches' corresponding commands can be removed from the CommandQueue until we apply the merged WriteBatch. I like the idea though and will investigate because there's some serious overlap here.

I'm not sure how this interacts with #16624, but I would also expect it to reduce the benefit we can gain from this change. Still, right now we're just leaving extra latency on the table, which could be especially detrimental to large Raft batches.

petermattis · 2017-08-08T15:54:39Z

Note that #16624 isn't making it into 1.1 due to the modest improvement combined with the stability concerns. We should keep #16624 in mind while making changes, but certainly not use it to block nearer term wins.

bdarnell · 2017-08-08T16:44:38Z

OK, if we've decided to punt #16624 into 1.2, this seems like a good idea.

petermattis · 2017-08-30T18:16:25Z

Another area for investigation is handling applying Raft commands differently on followers vs the leader. Followers need to keep track of the commit Raft log index, but they don't actually need to apply the commands until they become the leader. At the very least this suggests there are opportunities for batching of the command application.

bdarnell · 2017-08-30T18:24:27Z

they don't actually need to apply the commands until they become the leader

They need to be caught up (at least with respect to all ChangeReplicas commits) to become a candidate, not just the leader. And I think even as a follower, falling too far behind on ChangeReplicas can be a bad thing. But yes, queueing up changes and batching them on followers can be helpful.

bdarnell · 2017-08-30T18:37:03Z

Another possibility: Currently, in Replica.handleRaftReady, we write all new log entries and the HardState to disk before sending any messages. This is conservative; some messages can be sent concurrently with the write to disk and this would improve latency by allowing the followers to start their disk syncs before the leader has completed its.

Specifically, a MsgApp can be sent before the Entries it contains have been synced to disk. However, MsgApp also contains the Commit index (stored in the HardState), and it cannot be sent until the corresponding HardState has been persisted (I think). In theory, most MsgApps either contain new Entries or an updated Commit index, but not both, so many MsgApp messages should be able to take advantage of this optimization.

nvanbenschoten · 2017-08-30T18:44:07Z

Batching the application of Raft commands would also be simpler on followers because followers don't need to deal with the proposal of future commands. Prop eval KV requires that future proposals look at the MVCC/engine layer to compute its WriteBatch after all prerequisite commands have applied their commands. This constrains us in the way I said before:

Note that while I think we can respond to clients sooner, I don't think we can pop the command out of the CommandQueue until after it has actually been executed because of how proposer evaluated kv works upstream of Raft.

I foresee this constraint making it more difficult for any batching mechanism on the leaseholder, because batching might delay the proposal of future commands.

nvanbenschoten · 2017-08-30T19:58:12Z

@bdarnell you're referencing the optimized Raft pipeline from section 10.2.1 in the Raft thesis, right?

It certainly seems like a clear win, although I'm not sure the interface exposed by etcd/raft would be suitable for the full extent of the optimization:

The leader may even commit an entry before it has been written to its own disk, if a majority of
followers have written it to their disks; this is still safe

I doubt supporting that case is very important in practice anyway, though.

bdarnell · 2017-08-30T20:06:32Z

Yes, more or less, although that diagram doesn't quite work for the architecture of etcd/raft. And I agree that the edge case of a leader committing an entry without having it in its own log is not worth supporting.

…ecute This change addresses the first optimization discussed in cockroachdb#17500. The change seems to work and provides a modest performance boost. Unfortunately, I don't think we'll want to consider merging it at the moment. The problem is that while it is technically safe to respond to clients before performing the Raft command application, doing so is a nightmare for testing. Pretty much every test in the `storage` package expects to be able to perform an operation and then "reach beneath raft" immediately to operate on the result. This can range from inspecting Raft entries to working on the most up-to-date `Replica` state. To support this change, all of these tests would need to be updated to handle the now asynchronous operations performed in `handleEvalResultRaftMuLocked`. I addressed this by adding a testing knob called `DisableRaftRespBeforeApplication` in this change. The problem is that I don't feel very comfortable with it because we basically need to use it for all tests (indirectly through `multiTestContext` and `LocalTestCluster`) which means that we probably aren't testing this optimization thoroughly. We could disable the optimization on a finer granularity but this would become a serious issue for maintainability and I'm not sure it would be worth it. Perhaps there's some middle ground between returning to the client after performing in-memory state updates but before performing persistent state updates? Something like calling: 1. `handleEvalResultRaftMuLocked` 2. `maybeRespondToClient` 3. `applyRaftCommand` This would solve a lot of the testing issues present here without the need to use the `DisableRaftRespBeforeApplication` knob, but I'm almost certain that wouldn't be safe to do. I think cockroachdb#15648 will run into a similar issue to this. We'll either need to block clients while we combine Raft batches or we'll need to update tests which expect a client response to be an indication that the command has already been applied in all cases. Things might not be as bad in that case though because less is being done asynchronously.

Referenced in cockroachdb#17500. This change implements the optimization in the Raft thesis under the section: 10.2.1 Writing to the leader’s disk in parallel. The optimization allows the leader to sync new entries to its disk after it has sent the corresponding `MsgApp` messages, instead of before. Here, we invoke this optimization by: 1. sending all MsgApps. 2. syncing all entries and Raft state to disk. 3. sending all other messages. Release note: None

Referenced in cockroachdb#17500. This change implements the optimization in the Raft thesis under the section: 10.2.1 Writing to the leader’s disk in parallel. The optimization allows the leader to sync new entries to its disk after it has sent the corresponding `MsgApp` messages, instead of before. Here, we invoke this optimization by: 1. sending all MsgApps. 2. syncing all entries and Raft state to disk. 3. sending all other messages. Release note (performance improvement): Raft followers now write to their disks in parallel with the leader.

nvanbenschoten · 2018-02-12T17:05:38Z

Most of this was addressed in #19229. The original idea was tested in #18710, where it did not show a significant speedup. I may revisit that branch in the future.

There are a few other ideas here related to batching of disk writes/syncs beneath Raft. None of these will be addressed in the 2.0 timeframe.

ApplyNoSyncWait must only be used when WriteOptions.Sync is true. It enqueues the Batch to the WAL, adds to the memtable, and waits until the batch is visible in the memtable, and then returns to the caller. The caller is responsible for calling Batch.SyncWait to wait until the write to the WAL is fsynced. This change required splitting the WaitGroup in the Batch into two WaitGroups, so waiting for the visibility can happen separately from waiting for the WAL write. Additionally, the channel used as a semaphore for reserving space in the two lock-free queues is split into two channels, since dequeueing from these queues can happen in arbitrary order. There may be some performance overhead of pushing and popping from two channels instead of one. Informs cockroachdb/cockroach#17500 See discussion thread cockroachdb/cockroach#87050 (review)

Fixes #12257. This change adds opt-in support to raft to perform local storage writes asynchronously from the raft state machine handling loop. A new AsyncStorageWrites configuration instructs the raft node to write to its local storage (raft log and state machine) using a request/response message passing interface instead of the default `Ready`/`Advance` function call interface. Local storage messages can be pipelined and processed asynchronously (with respect to `Ready` iteration), facilitating reduced interference between Raft proposals and increased batching of log appends and state machine application. As a result, use of asynchronous storage writes can reduce end-to-end commit latency and increase maximum throughput. When AsyncStorageWrites is enabled, the `Ready.Message` slice will include new `MsgStorageAppend` and `MsgStorageApply` messages. The messages will target a `LocalAppendThread` and a `LocalApplyThread`, respectively. Messages to the same target must be reliably processed in order. In other words, they can't be dropped (like messages over the network) and those targeted at the same thread can't be reordered. Messages to different targets can be processed in any order. `MsgStorageAppend` carries Raft log entries to append, election votes to persist, and snapshots to apply. All writes performed in response to a `MsgStorageAppend` are expected to be durable. The message assumes the role of the Entries, HardState, and Snapshot fields in Ready. `MsgStorageApply` carries committed entries to apply. The message assumes the role of the CommittedEntries field in Ready. Local messages each carry one or more response messages which should be delivered after the corresponding storage write has been completed. These responses may target the same node or may target other nodes. The storage threads are not responsible for understanding the response messages, only for delivering them to the correct target after performing the storage write. \## Design Considerations - There must be no regression for existing users that do not enable `AsyncStorageWrites`. For instance, CommittedEntries must not wait on unstable entries to be stabilized in cases where a follower is given committed entries in a MsgApp. - Asynchronous storage work should use a message passing interface, like the rest of this library. - The Raft leader and followers should behave symmetrically. Both should be able to use asynchronous storage writes for log appends and entry application. - The LocalAppendThread on a follower should be able to send MsgAppResp messages directly to the leader without passing back through the raft state machine handling loop. - The `unstable` log should remain true to its name. It should hold entries until they are stable and should not rely on an intermediate reliable cache. - Pseudo-targets should be assigned to messages that target the local storage systems to denote required ordering guarantees. - Code should be maximally unified across `AsyncStorageWrites=false` and `AsyncStorageWrites=true`. `AsyncStorageWrites=false` should be a special case of `AsyncStorageWrites=true` where the library hides the possibility of asynchrony. - It should be possible to apply snapshots asynchronously, even though a snapshot touches both the Raft log state and the state machine. The library should make this easy for users to handle by delaying all committed entries until after the snapshot has applied, so snapshot application can be handled by 1) flushing the apply thread, 2) sending the `MsgStorageAppend` that contains a snapshot to the `LocalAppendThread` to be applied. \## Usage When asynchronous storage writes is enabled, the responsibility of code using the library is different from what is presented in raft/doc.go (which has been updated to include a section about async storage writes). Users still read from the Node.Ready() channel. However, they process the updates it contains in a different manner. Users no longer consult the HardState, Entries, and Snapshot fields (steps 1 and 3 in doc.go). They also no longer call Node.Advance() to indicate that they have processed all entries in the Ready (step 4 in doc.go). Instead, all local storage operations are also communicated through messages present in the Ready.Message slice. The local storage messages come in two flavors. The first flavor is log append messages, which target a LocalAppendThread and carry Entries, HardState, and a Snapshot. The second flavor is entry application messages, which target a LocalApplyThread and carry CommittedEntries. Messages to the same target must be reliably processed in order. Messages to different targets can be processed in any order. Each local storage message carries a slice of response messages that must delivered after the corresponding storage write has been completed. With Asynchronous Storage Writes enabled, the total state machine handling loop will look something like this: ```go for { select { case <-s.Ticker: n.Tick() case rd := <-s.Node.Ready(): for _, m := range rd.Messages { switch m.To { case raft.LocalAppendThread: toAppend <- m case raft.LocalApplyThread: toApply <-m default: sendOverNetwork(m) } } case <-s.done: return } } ``` Usage of Asynchronous Storage Writes will typically also contain a pair of storage handler threads, one for log writes (append) and one for entry application to the local state machine (apply). Those will look something like: ```go // append thread go func() { for { select { case m := <-toAppend: saveToStorage(m.State, m.Entries, m.Snapshot) send(m.Responses) case <-s.done: return } } } // apply thread go func() { for { select { case m := <-toApply: for _, entry := range m.CommittedEntries { process(entry) if entry.Type == raftpb.EntryConfChange { var cc raftpb.ConfChange cc.Unmarshal(entry.Data) s.Node.ApplyConfChange(cc) } } send(m.Responses) case <-s.done: return } } } ``` \## Compatibility The library remains backwards compatible with existing users and the change does not introduce any breaking changes. Users that do not set `AsyncStorageWrites` to true in the `Config` struct will not notice a difference with this change. This is despite the fact that the existing "synchronous storage writes" interface was adapted to share a majority of the same code. For instance, `Node.Advance` has been adapted to transparently acknowledge an asynchronous log append attempt and an asynchronous state machine application attempt, internally using the same message passing mechanism introduced in this change. The change has no cross-version compatibility concerns. All changes are local to a process and nodes using asynchronous storage writes appear to behave no differently from the outside. Clusters are free to mix nodes running with and without asynchronous storage writes. \## Performance The bulk of the performance evaluation of this functionality thus far has been done with [rafttoy](https://github.com/nvanbenschoten/rafttoy), a benchmarking harness developed to experiment with Raft proposal pipeline optimization. The harness can be used to run single-node benchmarks or multi-node benchmarks. It supports plugable raft logs, storage engines, network transports, and pipeline implementations. To evaluate this change, we fixed the raft log (`etcd/wal`), storage engine (`pebble`), and network transport (`grpc`). We then built (nvanbenschoten/rafttoy#3) a pipeline implementation on top of the new asynchronous storage writes functionality and compared it against two other pipeline implementations. The three pipeline implementations we compared were: - **basic** (P1): baseline stock raft usage, similar to the code in `doc.go` - **parallel append + early ack** (P2): CockroachDB's current pipeline, which includes two significant variations to the basic pipeline. The first is that it sends MsgApp messages to followers before writing to local Raft log (see [commit](cockroachdb/cockroach@b67eb69) for explanation), allowing log appends to occur in parallel across replicas. The second is that it acknowledges committed log entries before applying them (see [commit](cockroachdb/cockroach@87aaea7) for explanation). - **async append + async apply + early ack** (P3): A pipelining using asynchronous storage writes with a separate append thread and a separate apply thread. Also uses the same early acknowledgement optimization from above to ack committed entries before handing them to the apply thread. All testing was performed on a 3 node AWS cluster of m5.4xlarge instances with gp3 EBS volumes (16000 IOPS, 1GB/s throughput). ![Throughput vs latency of Raft proposal pipeline implementations](https://user-images.githubusercontent.com/5438456/197925200-11352c09-569b-460c-ae42-effbf407c4e5.svg) The comparison demonstrates two different benefits of asynchronous storage writes. The first is that it reduces end-to-end latency of proposals by 20-25%. For instance, when serving 16MB/s of write traffic, P1's average latency was 13.2ms, P2's average latency was 7.3ms, and P3's average latency was 5.24ms. This is a reduction in average latency of 28% from the optimized pipeline that does not use asynchronous storage writes. This matches expectations outlined in cockroachdb/cockroach#17500. The second is that it increases the maximum throughput at saturation. This is because asynchronous storage writes can improve batching for both log appends and log application. In this experiment, we saw the average append batch size under saturation increase from 928 to 1542, which is a similar ratio to the increase in peak throughput. We see a similar difference for apply batch sizes. There is more benchmarking to do. For instance, we'll need to thoroughly verify that this change does not negatively impact the performance of users of this library that do not use asynchronous storage writes. Signed-off-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

Fixes cockroachdb#17500. Waiting on github.com/cockroachdb/pebble/pull/2117. This commit integrates with the `AsyncStorageWrites` functionality that we added to Raft in github.com/etcd-io/raft/pull/8. \## Approach The commit makes the minimal changes needed to integrate with async storage writes and pull fsyncs out of the raft state machine loop. It does not make an effort to extract the non-durable portion of raft log writes or raft log application onto separate goroutine pools, as was described in cockroachdb#17500. Those changes will also be impactful, but they're non trivial and bump into a pipelining vs. batching trade-off, so they are left as future work items (TODO(nvanbenschoten): open new issues). With this change, asynchronous Raft log syncs are enabled by the new `DB.ApplyNoSyncWait` Pebble API introduced in github.com/cockroachdb/pebble/pull/2117. The `handleRaftReady` state machine loop continues to initiate Raft log writes, but it uses the Pebble API to offload waiting on durability to a separate goroutine. This separate goroutine then sends the corresponding `MsgStorageAppend`'s response messages where they need to go (locally and/or to the Raft leader) when the fsync completes. The async storage writes functionality in Raft makes this all safe. \## Benchmark Results The result of this change is reduced interference between Raft proposals. As a result, it reduces end-to-end commit latency. github.com/etcd-io/raft/pull/8 presented a collection of benchmark results captured from integrating async storage writes with rafttoy. When integrated into CockroachDB, we see similar improvements to average and tail latency. However, it doesn't provide the throughput improvements at the top end because log appends and state machine application have not yet been extracted into separate goroutine pools, which would facilitate increased opportunity for batching. TODO: add images ---- Release note (performance improvement): The Raft proposal pipeline has been optimized to reduce interference between Raft proposals. This improves average and tail write latency at high concurrency.

nvanbenschoten · 2023-01-07T02:56:35Z

The core of this issue is going to be addressed by #94165. That PR integrates the raft library changes that we made in etcd-io/raft#8 to support asynchronous local storage writes (both raft log appends and state machine applications). The PR does so by pulling log append disk write syncs out of the raft state machine loop using the new DB.ApplyNoSyncWait pebble API. The benchmark results attached to that PR demonstrate the kind of latency improvements we had predicted earlier in this issue.

However, the PR does not make an effort to extract the non-durable portion of raft log writes or state machine application onto separate goroutine pools. Both of these changes could have additional benefits (faster state machine loop iteration => less interference between entries, larger append and apply batches => more efficient raft operations). I've opened #94853 and #94854 to track the remainder of this work.

ApplyNoSyncWait must only be used when WriteOptions.Sync is true. It enqueues the Batch to the WAL, adds to the memtable, and waits until the batch is visible in the memtable, and then returns to the caller. The caller is responsible for calling Batch.SyncWait to wait until the write to the WAL is fsynced. This change required splitting the WaitGroup in the Batch into two WaitGroups, so waiting for the visibility can happen separately from waiting for the WAL write. Additionally, the channel used as a semaphore for reserving space in the two lock-free queues is split into two channels, since dequeueing from these queues can happen in arbitrary order. There may be some performance overhead of pushing and popping from two channels instead of one. Informs cockroachdb/cockroach#17500 See discussion thread cockroachdb/cockroach#87050 (review)

ApplyNoSyncWait must only be used when WriteOptions.Sync is true. It enqueues the Batch to the WAL, adds to the memtable, and waits until the batch is visible in the memtable, and then returns to the caller. The caller is responsible for calling Batch.SyncWait to wait until the write to the WAL is fsynced. This change required splitting the WaitGroup in the Batch into two WaitGroups, so waiting for the visibility can happen separately from waiting for the WAL write. Additionally, the channel used as a semaphore for reserving space in the two lock-free queues is split into two channels, since dequeueing from these queues can happen in arbitrary order. Benchmarks indicate that the overhead of pushing and popping on an extra channel is tolerable. Benchmarks were run on a macbook pro -- note these are not doing an actual sync since they use io.Discard, and are only benchmarking the commit pipeline. Sync wait on master (old) vs this branch (new): name old time/op new time/op delta CommitPipeline/no-sync-wait=false/parallel=1-10 1.09µs ± 6% 1.15µs ± 9% ~ (p=0.310 n=5+5) CommitPipeline/no-sync-wait=false/parallel=2-10 1.53µs ± 4% 1.54µs ± 2% ~ (p=0.841 n=5+5) CommitPipeline/no-sync-wait=false/parallel=4-10 1.54µs ± 1% 1.59µs ± 1% +2.87% (p=0.008 n=5+5) CommitPipeline/no-sync-wait=false/parallel=8-10 1.52µs ± 1% 1.55µs ± 1% +2.43% (p=0.008 n=5+5) name old speed new speed delta CommitPipeline/no-sync-wait=false/parallel=1-10 14.7MB/s ± 5% 13.9MB/s ±10% ~ (p=0.310 n=5+5) CommitPipeline/no-sync-wait=false/parallel=2-10 10.5MB/s ± 4% 10.4MB/s ± 2% ~ (p=0.841 n=5+5) CommitPipeline/no-sync-wait=false/parallel=4-10 10.4MB/s ± 1% 10.1MB/s ± 1% -2.78% (p=0.008 n=5+5) CommitPipeline/no-sync-wait=false/parallel=8-10 10.5MB/s ± 1% 10.3MB/s ± 1% -2.35% (p=0.008 n=5+5) name old alloc/op new alloc/op delta CommitPipeline/no-sync-wait=false/parallel=1-10 1.37kB ± 0% 1.40kB ± 0% +2.31% (p=0.008 n=5+5) CommitPipeline/no-sync-wait=false/parallel=2-10 1.37kB ± 0% 1.40kB ± 0% +2.31% (p=0.008 n=5+5) CommitPipeline/no-sync-wait=false/parallel=4-10 1.37kB ± 0% 1.40kB ± 0% +2.15% (p=0.008 n=5+5) CommitPipeline/no-sync-wait=false/parallel=8-10 1.37kB ± 0% 1.40kB ± 0% +2.34% (p=0.008 n=5+5) name old allocs/op new allocs/op delta CommitPipeline/no-sync-wait=false/parallel=1-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) CommitPipeline/no-sync-wait=false/parallel=2-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) CommitPipeline/no-sync-wait=false/parallel=4-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) CommitPipeline/no-sync-wait=false/parallel=8-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) Sync wait on this branch (old) vs async wait on this branch (new): name old time/op new time/op delta CommitPipeline/parallel=1-10 1.15µs ± 9% 1.20µs ± 7% ~ (p=0.421 n=5+5) CommitPipeline/parallel=2-10 1.54µs ± 2% 1.59µs ± 2% +3.50% (p=0.008 n=5+5) CommitPipeline/parallel=4-10 1.59µs ± 1% 1.58µs ± 1% ~ (p=0.802 n=5+5) CommitPipeline/parallel=8-10 1.55µs ± 1% 1.56µs ± 1% ~ (p=0.452 n=5+5) name old speed new speed delta CommitPipeline/parallel=1-10 13.9MB/s ±10% 13.3MB/s ± 7% ~ (p=0.421 n=5+5) CommitPipeline/parallel=2-10 10.4MB/s ± 2% 10.1MB/s ± 2% -3.36% (p=0.008 n=5+5) CommitPipeline/parallel=4-10 10.1MB/s ± 1% 10.1MB/s ± 1% ~ (p=0.786 n=5+5) CommitPipeline/parallel=8-10 10.3MB/s ± 1% 10.3MB/s ± 1% ~ (p=0.452 n=5+5) name old alloc/op new alloc/op delta CommitPipeline/parallel=1-10 1.40kB ± 0% 1.40kB ± 0% ~ (p=0.651 n=5+5) CommitPipeline/parallel=2-10 1.40kB ± 0% 1.39kB ± 0% -0.21% (p=0.008 n=5+5) CommitPipeline/parallel=4-10 1.40kB ± 0% 1.40kB ± 0% ~ (p=0.706 n=5+5) CommitPipeline/parallel=8-10 1.40kB ± 0% 1.40kB ± 0% ~ (p=0.587 n=5+5) name old allocs/op new allocs/op delta CommitPipeline/parallel=1-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) CommitPipeline/parallel=2-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) CommitPipeline/parallel=4-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) CommitPipeline/parallel=8-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) Informs cockroachdb/cockroach#17500 See discussion thread cockroachdb/cockroach#87050 (review)

Fixes cockroachdb#17500. Waiting on github.com/cockroachdb/pebble/pull/2117. This commit integrates with the `AsyncStorageWrites` functionality that we added to Raft in github.com/etcd-io/raft/pull/8. \## Approach The commit makes the minimal changes needed to integrate with async storage writes and pull fsyncs out of the raft state machine loop. It does not make an effort to extract the non-durable portion of raft log writes or raft log application onto separate goroutine pools, as was described in cockroachdb#17500. Those changes will also be impactful, but they're non trivial and bump into a pipelining vs. batching trade-off, so they are left as future work items (TODO(nvanbenschoten): open new issues). With this change, asynchronous Raft log syncs are enabled by the new `DB.ApplyNoSyncWait` Pebble API introduced in github.com/cockroachdb/pebble/pull/2117. The `handleRaftReady` state machine loop continues to initiate Raft log writes, but it uses the Pebble API to offload waiting on durability to a separate goroutine. This separate goroutine then sends the corresponding `MsgStorageAppend`'s response messages where they need to go (locally and/or to the Raft leader) when the fsync completes. The async storage writes functionality in Raft makes this all safe. \## Benchmark Results The result of this change is reduced interference between Raft proposals. As a result, it reduces end-to-end commit latency. github.com/etcd-io/raft/pull/8 presented a collection of benchmark results captured from integrating async storage writes with rafttoy. When integrated into CockroachDB, we see similar improvements to average and tail latency. However, it doesn't provide the throughput improvements at the top end because log appends and state machine application have not yet been extracted into separate goroutine pools, which would facilitate increased opportunity for batching. TODO: add images ---- Release note (performance improvement): The Raft proposal pipeline has been optimized to reduce interference between Raft proposals. This improves average and tail write latency at high concurrency.

94165: kv: integrate raft async storage writes r=nvanbenschoten a=nvanbenschoten Fixes #17500. Epic: CRDB-22644 This commit integrates with the `AsyncStorageWrites` functionality that we added to Raft in etcd-io/raft/pull/8. ## Approach The commit makes the minimal changes needed to integrate with async storage writes and pull fsyncs out of the raft state machine loop. It does not make an effort to extract the non-durable portion of raft log writes or raft log application onto separate goroutine pools, as was described in #17500. Those changes will also be impactful, but they're non trivial and bump into a pipelining vs. batching trade-off, so they are left as future work items. See #94853 and #94854. With this change, asynchronous Raft log syncs are enabled by the new `DB.ApplyNoSyncWait` Pebble API introduced in cockroachdb/pebble/pull/2117. The `handleRaftReady` state machine loop continues to initiate Raft log writes, but it uses the Pebble API to offload waiting on durability to a separate goroutine. This separate goroutine then sends the corresponding `MsgStorageAppend`'s response messages where they need to go (locally and/or to the Raft leader) when the fsync completes. The async storage writes functionality in Raft makes this all safe. ## Benchmark Results The result of this change is reduced interference between Raft proposals. As a result, it reduces end-to-end commit latency. etcd-io/raft/pull/8 presented a collection of benchmark results captured from integrating async storage writes with rafttoy. When integrated into CockroachDB, we see similar improvements to average and tail latency. However, it doesn't provide the throughput improvements at the top end because log appends and state machine application have not yet been extracted into separate goroutine pools, which would facilitate an increased opportunity for batching. To visualize the impact on latency, consider the following test. The experiment uses a 3-node GCP cluster with n2-standard-32 instances spread across three availability zones. It runs kv0 (write-only) against the cluster with 64-byte values. It then ramps up concurrency to compare throughput vs. average and tail latency. _NOTE: log scales on x and y axes_ ![Throughput vs average latency of write-only workload](https://user-images.githubusercontent.com/5438456/209210719-bec842f6-1093-48cd-8be7-05a3d79c2a71.svg) ![Throughput vs tail latency of write-only workload](https://user-images.githubusercontent.com/5438456/209210777-670a4d25-9516-41a2-b7e7-86b402004536.svg) Async storage writes impacts latency by different amounts at different throughputs, ranging from an improvement of 20% to 40% when the system is "well utilized". However, it increases latency by 5% to 10% when the system is over-saturated and CPU bound, presumably because of the extra goroutine handoff to the log append fsync callback, which will be impacted by elevated goroutine scheduling latency. | Throughput (B/s) | Throughput (qps) | Avg. Latency Δ | p99 Latency Δ | | ---------------- | ---------------- | -------------- | ------------- | | 63 KB/s | 1,000 | -10.5% | -8.8% | | 125 KB/s | 2,000 | -7.1% | -10.4% | | 250 KB/s | 4,000 | -20% | -11.2% | | 500 KB/s | 8,000 | -16.6% | -25.3% | | 1 MB/s | 16,000 | -30.8% | -44.0% | | 2 MB/s | 32,000 | -38.2% | -30.9% | | 4 MB/s | 64,000 | 5.9% | 9.4% | ### Other benchmark results ```bash name old ops/s new ops/s delta # 50% read, 50% update ycsb/A/nodes=3 16.0k ± 5% 16.2k ± 4% ~ (p=0.353 n=10+10) ycsb/A/nodes=3/cpu=32 28.7k ± 5% 33.8k ± 2% +17.57% (p=0.000 n=9+9) # 95% read, 5% update ycsb/B/nodes=3 29.9k ± 3% 30.2k ± 3% ~ (p=0.278 n=9+10) ycsb/B/nodes=3/cpu=32 101k ± 1% 100k ± 3% ~ (p=0.274 n=8+10) # 100% read ycsb/C/nodes=3 40.4k ± 3% 40.0k ± 3% ~ (p=0.190 n=10+10) ycsb/C/nodes=3/cpu=32 135k ± 1% 137k ± 1% +0.87% (p=0.011 n=9+9) # 95% read, 5% insert ycsb/D/nodes=3 33.6k ± 3% 33.8k ± 3% ~ (p=0.315 n=10+10) ycsb/D/nodes=3/cpu=32 108k ± 1% 106k ± 6% ~ (p=0.739 n=10+10) # 95% scan, 5% insert ycsb/E/nodes=3 3.79k ± 1% 3.73k ± 1% -1.42% (p=0.000 n=9+9) ycsb/E/nodes=3/cpu=32 6.31k ± 5% 6.48k ± 6% ~ (p=0.123 n=10+10) # 50% read, 50% read-modify-write ycsb/F/nodes=3 7.68k ± 2% 7.99k ± 2% +4.11% (p=0.000 n=10+10) ycsb/F/nodes=3/cpu=32 15.6k ± 4% 18.1k ± 3% +16.14% (p=0.000 n=8+10) name old avg(ms) new avg(ms) delta ycsb/A/nodes=3 6.01 ± 5% 5.95 ± 4% ~ (p=0.460 n=10+10) ycsb/A/nodes=3/cpu=32 5.01 ± 4% 4.25 ± 4% -15.19% (p=0.000 n=9+10) ycsb/B/nodes=3 4.80 ± 0% 4.77 ± 4% ~ (p=0.586 n=7+10) ycsb/B/nodes=3/cpu=32 1.90 ± 0% 1.90 ± 0% ~ (all equal) ycsb/C/nodes=3 3.56 ± 2% 3.61 ± 3% ~ (p=0.180 n=10+10) ycsb/C/nodes=3/cpu=32 1.40 ± 0% 1.40 ± 0% ~ (all equal) ycsb/D/nodes=3 2.87 ± 2% 2.85 ± 2% ~ (p=0.650 n=10+10) ycsb/D/nodes=3/cpu=32 1.30 ± 0% 1.34 ± 4% ~ (p=0.087 n=10+10) ycsb/E/nodes=3 25.3 ± 0% 25.7 ± 1% +1.38% (p=0.000 n=8+8) ycsb/E/nodes=3/cpu=32 22.9 ± 5% 22.2 ± 6% ~ (p=0.109 n=10+10) ycsb/F/nodes=3 12.5 ± 2% 12.0 ± 1% -3.72% (p=0.000 n=10+9) ycsb/F/nodes=3/cpu=32 9.27 ± 4% 7.98 ± 3% -13.96% (p=0.000 n=8+10) name old p99(ms) new p99(ms) delta ycsb/A/nodes=3 45.7 ±15% 35.7 ± 6% -21.90% (p=0.000 n=10+8) ycsb/A/nodes=3/cpu=32 67.6 ±13% 55.3 ± 5% -18.10% (p=0.000 n=9+10) ycsb/B/nodes=3 30.5 ±24% 29.4 ± 7% ~ (p=0.589 n=10+10) ycsb/B/nodes=3/cpu=32 12.8 ± 2% 13.3 ± 7% ~ (p=0.052 n=10+8) ycsb/C/nodes=3 14.0 ± 3% 14.2 ± 0% ~ (p=0.294 n=10+8) ycsb/C/nodes=3/cpu=32 5.80 ± 0% 5.70 ± 5% ~ (p=0.233 n=7+10) ycsb/D/nodes=3 12.4 ± 2% 11.7 ± 3% -5.32% (p=0.001 n=10+10) ycsb/D/nodes=3/cpu=32 6.30 ± 0% 5.96 ± 6% -5.40% (p=0.001 n=10+10) ycsb/E/nodes=3 81.0 ± 4% 83.9 ± 0% +3.63% (p=0.012 n=10+7) ycsb/E/nodes=3/cpu=32 139 ±19% 119 ±12% -14.46% (p=0.021 n=10+10) ycsb/F/nodes=3 122 ±17% 103 ±10% -15.48% (p=0.002 n=10+8) ycsb/F/nodes=3/cpu=32 146 ±20% 133 ± 7% -8.89% (p=0.025 n=10+10) ``` The way to interpret these results is that async raft storage writes reduce latency and, as a result of the closed loop natured workload, also increase throughput for the YCSB variants that perform writes and aren't already CPU saturated. Variants that are read-only are unaffected. Variants that are CPU-saturated do not benefit from the change because they are already bottlenecked on CPU resources and cannot push any more load (see above). ---- Release note (performance improvement): The Raft proposal pipeline has been optimized to reduce interference between Raft proposals. This improves average and tail write latency at high concurrency. 96458: sql: fixes statement contention count metric r=j82w a=j82w Fixes a bug introduced in #94750 where the metric count was counting transaction that hit contention events instead of the statement count. closes: #96429 Release note: none Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: j82w <jwilley@cockroachlabs.com>

nvanbenschoten self-assigned this Aug 7, 2017

nvanbenschoten added this to the 1.1 milestone Aug 9, 2017

petermattis modified the milestones: 1.2, 1.1 Aug 17, 2017

nvanbenschoten mentioned this issue Sep 23, 2017

[DNM] storage: respond to Raft proposals after entries commit, not execute #18710

Closed

nvanbenschoten mentioned this issue Oct 12, 2017

storage: sync entries to disk in parallel with followers #19229

Merged

nvanbenschoten mentioned this issue Dec 7, 2017

storage: remove unnecessary sideload inlining, add assertion #20573

Merged

nvanbenschoten removed this from the 2.0 milestone Feb 12, 2018

sumeerbhola mentioned this issue Nov 9, 2022

db: add DB.ApplyNoSyncWait for asynchronous apply cockroachdb/pebble#2117

Merged

nvanbenschoten mentioned this issue Dec 13, 2022

raft: support asynchronous storage writes etcd-io/raft#8

Merged

nvanbenschoten mentioned this issue Dec 22, 2022

kv: integrate raft async storage writes #94165

Merged

This was referenced Jan 6, 2023

kv: append log entries outside of raft state machine loop #94853

Open

kv: apply log entries outside of raft state machine loop #94854

Open

craig bot closed this as completed in 702ff6f Feb 3, 2023

nvanbenschoten mentioned this issue Jun 29, 2023

kv: make disk reads asynchronous with respect to Raft state machine #105850

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: make disk I/O asynchronous with respect to Raft state machine #17500

kv: make disk I/O asynchronous with respect to Raft state machine #17500

nvanbenschoten commented Aug 7, 2017 •

edited by exalate-issue-sync bot

Loading

andreimatei commented Aug 7, 2017

nvanbenschoten commented Aug 7, 2017 •

edited

Loading

petermattis commented Aug 8, 2017

bdarnell commented Aug 8, 2017

petermattis commented Aug 8, 2017

nvanbenschoten commented Aug 8, 2017

petermattis commented Aug 8, 2017

bdarnell commented Aug 8, 2017

petermattis commented Aug 30, 2017

bdarnell commented Aug 30, 2017

bdarnell commented Aug 30, 2017

nvanbenschoten commented Aug 30, 2017

nvanbenschoten commented Aug 30, 2017 •

edited

Loading

bdarnell commented Aug 30, 2017

nvanbenschoten commented Feb 12, 2018

nvanbenschoten commented Jan 7, 2023

kv: make disk I/O asynchronous with respect to Raft state machine #17500

kv: make disk I/O asynchronous with respect to Raft state machine #17500

Comments

nvanbenschoten commented Aug 7, 2017 • edited by exalate-issue-sync bot Loading

andreimatei commented Aug 7, 2017

nvanbenschoten commented Aug 7, 2017 • edited Loading

petermattis commented Aug 8, 2017

bdarnell commented Aug 8, 2017

petermattis commented Aug 8, 2017

nvanbenschoten commented Aug 8, 2017

petermattis commented Aug 8, 2017

bdarnell commented Aug 8, 2017

petermattis commented Aug 30, 2017

bdarnell commented Aug 30, 2017

bdarnell commented Aug 30, 2017

nvanbenschoten commented Aug 30, 2017

nvanbenschoten commented Aug 30, 2017 • edited Loading

bdarnell commented Aug 30, 2017

nvanbenschoten commented Feb 12, 2018

nvanbenschoten commented Jan 7, 2023

nvanbenschoten commented Aug 7, 2017 •

edited by exalate-issue-sync bot

Loading

nvanbenschoten commented Aug 7, 2017 •

edited

Loading

nvanbenschoten commented Aug 30, 2017 •

edited

Loading