storage: sync entries to disk in parallel with followers #19229

nvanbenschoten · 2017-10-12T23:49:15Z

Referenced in #17500.

This change implements the optimization in the Raft thesis under the
section: 10.2.1 Writing to the leader’s disk in parallel. The optimization
allows the leader to sync new entries to its disk after it has sent the
corresponding MsgApp messages, instead of before.

Here, we invoke this optimization by:

sending all MsgApps.
syncing all entries and Raft state to disk.
sending all other messages.

The write latency speedup we see from this change is promising. On a
write-only workload it's demonstrating anywhere from a 5%-15%
speedup for average latencies.

For instance, when running:

./kv --read-percent=0 -max-block-bytes=512 -min-block-bytes=32

we get the following histograms (thanks @jordanlewis!):

Zoomed on average latency

Zoomed on tail latency

cockroach-teamcity · 2017-10-12T23:49:22Z

This change is

nvanbenschoten · 2017-10-12T23:51:32Z

I'd like to do some testing with this on real clusters, but I guess I'll have to wait until
we're out of the cluster drought.

petermattis · 2017-10-12T23:52:51Z

I think roachprod is a thing now. Talk to @mberhault.

mberhault · 2017-10-13T00:30:42Z

It is, at least the initial version: https://github.com/cockroachlabs/roachprod
crl-prod (please pull latest) knows about machines created with it, so you can pull binaries etc... I'll either hook it up with roachperf or improve the tools to quickly setup/run a cockroach cluster.

nvanbenschoten · 2017-10-13T05:08:56Z

Great, thanks Marc!

bdarnell · 2017-10-17T03:03:58Z

We're a little conservative about when we invoke this optimization, opting
to only do so when all Entries in a raft.Ready struct support it. We could
split the Entries into those that need to be synced before the messages are
sent and those that can be synced after, but that has been seen to exhibit
worse performance because we end up syncing to disk twice.

The commit message indicates that we decide whether to do this based on the entries, but the code makes the decision based on the messages (writing to disk asynchronously if all messages are MsgApps). I think this is the better approach and could be structured to avoid two calls to sync (instead of "maybe sync, send messages, maybe sync", it could be "maybe send messages, sync, maybe send messages").

The latter construction has the nice property that we could easily transform it into "send appends, sync, send other messages". I think that would be safe. Keeping MsgApps in order is an important optimization, but I don't think it matters whether we reorder MsgApps relative to other messages. The one thing I'm not sure about is whether it's OK to send out a commit index in MsgApps that we have not ourselves written in HardState.Commit (I assume this is the reason for the IsEmptyHardState check in this diff). I think that's safe - the raft thesis says that the commit index may be treated as volatile state, although etcd/raft keeps it in HardState and I can't remember the details of why. So maybe we keep the IsEmptyHardState check to decide whether we send appends before or after syncing.

Review status: 0 of 1 files reviewed at latest revision, all discussions resolved, all commit checks successful.

Comments from Reviewable

a-robinson · 2017-11-13T19:19:38Z

How much work would it be to finish this off? I think it'll be a pretty nice win when disks are heavily contended.

nvanbenschoten · 2017-11-13T19:30:15Z

Not very much, a day or two of work at most. It currently grouped into the list of "experiment done, implementation pending" perf work that's on my timeline!

nvanbenschoten · 2017-12-01T00:43:50Z

@bdarnell looking back at the history of etcd/raft, I haven't been able to uncover a specific reason why the Commit index is included in the HardState. It does look like at one point it wasn't persisted which may mean that it's just a historical artifact. I'm going to keep digging, but I'd be very interested to hear whether @xiang90 or @yichengq have any insight into this.

Outside of that question, I think I've also been able to convince myself that broadcasting a MsgApp with a larger commit index than that in our persisted HardState.Commit is still safe. The situation where this gets interesting is when the leader crashes after broadcasting the larger commit index but before persisting it itself. In this case, one of the followers with the entry will necessarily become the next leader, per the leader completeness property and the fact that all majorities intersect. Since only replicas who have seen this entry can become future leaders, the entry will eventually be committed regardless of whether the broadcasted MsgApp with the larger commit index reached some of them or not.

Like you mention, Section 3.8 of the Raft thesis states the following:

Other state variables are safe to lose on a restart, as they can all be recreated. The most interesting example is the commit index, which can safely be reinitialized to zero on a restart. Even if every server restarts at the same time, the commit index will only temporarily lag behind its true value. Once a leader is elected and is able to commit a new entry, its commit index will advance, and it will quickly propagate this commit index to its followers.

So unless I'm missing something there, the only thing that would stop us from unconditionally doing the

"send appends, sync, send other messages"

approach (regardless of the IsEmptyHardState check) would be some implementation detail specific to etcd/raft.

nvanbenschoten · 2017-12-05T15:22:43Z

The most interesting example is the commit index, which can safely be reinitialized to zero on a restart.

Thinking about this more, if it isn't necessary to sync the commit index to disk (in both Raft and etcd/raft) I wonder if we'd see any difference from clearing it from the HardState unconditionally. This would shrink the proto on disk and more importantly avoid the separate key write that we do for about 50% of all messages.

bdarnell · 2017-12-05T18:13:08Z

Polling the group to get the commit index (as described in section 3.8) only works if you know the members of the group, but the members of the group may be changed by committed log entries (described in chapter 4) . You need an approximately up-to-date view of the commit index in order to identify the correct nodes to poll. This is mainly an issue when the state machine is transient (or has very infrequent checkpoints) and recovery relies more heavily on replaying the log. This is the reason (IIRC) that the commit index is persisted.

I think that since we persist our applied index (and truncate the raft log) frequently, there is little value to us in storing the commit index. I think that removing the commit index from our persisted HardState and instead setting it to the applied index at startup would be safe. (or more conservatively, persist HardState.Commit only when we commit a config change)

nvanbenschoten · 2017-12-05T21:11:52Z

I think that removing the commit index from our persisted HardState and instead setting it to the applied index at startup would be safe.

I experimented with making this change, which allowed us to sync the HardState less frequently. I was able to get all tests passing but it didn't seem to provide any noticeable performance improvement on kv running a 100% write workload (BenchmarkLoadgenKV/readPercent=0/splits=10/concurrency=48).

xiang90 · 2017-12-06T06:21:54Z

For etcd, it was just convenient. And we do not really see any performance improvement if we skip commit index sync.

nvanbenschoten · 2017-12-07T23:32:44Z

@bdarnell I've updated this perform the "maybe send messages, sync, maybe send messages" strategy you discussed above. PTAL.

I'll follow up with more benchmarks.

bdarnell · 2017-12-10T19:37:21Z

Review status: 0 of 1 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.

pkg/storage/replica.go, line 3690 at r1 (raw file):

// MsgApps and one containing all other message types. Each slice retains the
// relative ordering between messages in the original slice.
func splitMsgApps(msgs []raftpb.Message) (mgsApps, otherMsgs []raftpb.Message) {

This could use a unittest to make sure we get the boundary conditions right (I'm not sure how common mixed batches are otherwise in our tests)

Comments from Reviewable

nvanbenschoten · 2017-12-12T15:40:23Z

Review status: 0 of 2 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.

pkg/storage/replica.go, line 3690 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This could use a unittest to make sure we get the boundary conditions right (I'm not sure how common mixed batches are otherwise in our tests)

Done.

Comments from Reviewable

Referenced in cockroachdb#17500. This change implements the optimization in the Raft thesis under the section: 10.2.1 Writing to the leader’s disk in parallel. The optimization allows the leader to sync new entries to its disk after it has sent the corresponding `MsgApp` messages, instead of before. Here, we invoke this optimization by: 1. sending all MsgApps. 2. syncing all entries and Raft state to disk. 3. sending all other messages. Release note (performance improvement): Raft followers now write to their disks in parallel with the leader.

xiang90 · 2017-12-12T18:30:02Z

@nvanbenschoten do you have a performance improvement measurement for this optimization?

nvanbenschoten · 2017-12-12T21:33:02Z

@xiang90 it's certainly going to depend on your workload and your cluster topology, but I can share the benchmark results I recorded right before merging this.

Workload

./kv --read-percent=0 -max-block-bytes=5120 -min-block-bytes=320 --concurrency=64

(see loadgen/kv)

Cluster

3 nodes
All in us-east1-b
n1-standard-4 (4 vCPUs, 15 GB memory) on GCE
Local SSDs

Results

# Without Change:

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
   60.0s        0          66266         1104.4     57.9     56.6     88.1    134.2    234.9

BenchmarkLoadgenKV/readPercent=0/splits=0/concurrency=64/duration=1m0s	   66266	    905500 ns/op

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
   60.0s        0          66396         1106.5     57.8     56.6     88.1    134.2    234.9

BenchmarkLoadgenKV/readPercent=0/splits=0/concurrency=64/duration=1m0s	   66396	    903729 ns/op

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
   60.0s        0          66948         1115.7     57.4     56.6     88.1    134.2    285.2

BenchmarkLoadgenKV/readPercent=0/splits=0/concurrency=64/duration=1m0s	   66948	    896286 ns/op


# With Change:

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
   60.0s        0          83230         1387.1     46.1     44.0     75.5    100.7    201.3

BenchmarkLoadgenKV/readPercent=0/splits=0/concurrency=64/duration=1m0s	   83230	    720930 ns/op

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
   60.0s        0          81610         1360.1     47.0     44.0     79.7    104.9    234.9

BenchmarkLoadgenKV/readPercent=0/splits=0/concurrency=64/duration=1m0s	   81610	    735263 ns/op

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
   60.0s        0          80888         1348.0     47.5     44.0     83.9    121.6    302.0

BenchmarkLoadgenKV/readPercent=0/splits=0/concurrency=64/duration=1m0s	   80888	    741821 ns/op

Performance Improvement

(averaged over three trials before and after the change)

Throughput: +23%
Avg Latency: -19%
95th Latency: -10%
99th Latency: -19%

Keep in mind that this is a write-only workload with decently large writes (-max-block-bytes=5120 -min-block-bytes=320), which is the type of workload that should benefit most from this change.

xiang90 · 2017-12-12T21:36:28Z

@nvanbenschoten

awesome! thank you!

nvanbenschoten requested review from bdarnell, petermattis and a team October 12, 2017 23:49

a-robinson mentioned this pull request Nov 8, 2017

storage: Improve reliability of node liveness #19699

Closed

13 tasks

nvanbenschoten mentioned this pull request Dec 5, 2017

storage: thread ctx into reportSnapshotStatus #20507

Merged

nvanbenschoten force-pushed the nvanbenschoten/asyncRaft branch from 8ed8128 to a7f0030 Compare December 7, 2017 23:31

nvanbenschoten changed the title ~~[WIP] storage: sync entries to disk in parallel with followers~~ storage: sync entries to disk in parallel with followers Dec 7, 2017

nvanbenschoten force-pushed the nvanbenschoten/asyncRaft branch from a7f0030 to 323ffa3 Compare December 12, 2017 14:04

nvanbenschoten force-pushed the nvanbenschoten/asyncRaft branch from 323ffa3 to b67eb69 Compare December 12, 2017 15:42

nvanbenschoten merged commit 872656c into cockroachdb:master Dec 12, 2017

nvanbenschoten deleted the nvanbenschoten/asyncRaft branch December 12, 2017 16:29

nvanbenschoten mentioned this pull request Feb 12, 2018

kv: make disk I/O asynchronous with respect to Raft state machine #17500

Closed

robert-s-lee mentioned this pull request Feb 25, 2018

describe parallel writes during an insert cockroachdb/docs#2300

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: sync entries to disk in parallel with followers #19229

storage: sync entries to disk in parallel with followers #19229

nvanbenschoten commented Oct 12, 2017 •

edited

Loading

cockroach-teamcity commented Oct 12, 2017

nvanbenschoten commented Oct 12, 2017

petermattis commented Oct 12, 2017

mberhault commented Oct 13, 2017

nvanbenschoten commented Oct 13, 2017

bdarnell commented Oct 17, 2017

a-robinson commented Nov 13, 2017

nvanbenschoten commented Nov 13, 2017

nvanbenschoten commented Dec 1, 2017

nvanbenschoten commented Dec 5, 2017

bdarnell commented Dec 5, 2017

nvanbenschoten commented Dec 5, 2017

xiang90 commented Dec 6, 2017

nvanbenschoten commented Dec 7, 2017

bdarnell commented Dec 10, 2017

nvanbenschoten commented Dec 12, 2017

xiang90 commented Dec 12, 2017

nvanbenschoten commented Dec 12, 2017

xiang90 commented Dec 12, 2017

storage: sync entries to disk in parallel with followers #19229

storage: sync entries to disk in parallel with followers #19229

Conversation

nvanbenschoten commented Oct 12, 2017 • edited Loading

Zoomed on average latency

Zoomed on tail latency

cockroach-teamcity commented Oct 12, 2017

nvanbenschoten commented Oct 12, 2017

petermattis commented Oct 12, 2017

mberhault commented Oct 13, 2017

nvanbenschoten commented Oct 13, 2017

bdarnell commented Oct 17, 2017

a-robinson commented Nov 13, 2017

nvanbenschoten commented Nov 13, 2017

nvanbenschoten commented Dec 1, 2017

nvanbenschoten commented Dec 5, 2017

bdarnell commented Dec 5, 2017

nvanbenschoten commented Dec 5, 2017

xiang90 commented Dec 6, 2017

nvanbenschoten commented Dec 7, 2017

bdarnell commented Dec 10, 2017

nvanbenschoten commented Dec 12, 2017

xiang90 commented Dec 12, 2017

nvanbenschoten commented Dec 12, 2017

Workload

Cluster

Results

Performance Improvement

xiang90 commented Dec 12, 2017

nvanbenschoten commented Oct 12, 2017 •

edited

Loading