Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: sync entries to disk in parallel with followers #19229

Merged

Conversation

nvanbenschoten
Copy link
Member

@nvanbenschoten nvanbenschoten commented Oct 12, 2017

Referenced in #17500.

This change implements the optimization in the Raft thesis under the
section: 10.2.1 Writing to the leader’s disk in parallel. The optimization
allows the leader to sync new entries to its disk after it has sent the
corresponding MsgApp messages, instead of before.

screen shot 2017-08-30 at 3 53 22 pm

Here, we invoke this optimization by:

  1. sending all MsgApps.
  2. syncing all entries and Raft state to disk.
  3. sending all other messages.

The write latency speedup we see from this change is promising. On a
write-only workload it's demonstrating anywhere from a 5%-15%
speedup for average latencies.

For instance, when running:

./kv --read-percent=0 -max-block-bytes=512 -min-block-bytes=32

we get the following histograms (thanks @jordanlewis!):

Zoomed on average latency

histogram avg

Zoomed on tail latency

histogram tail

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@nvanbenschoten
Copy link
Member Author

I'd like to do some testing with this on real clusters, but I guess I'll have to wait until
we're out of the cluster drought.

@petermattis
Copy link
Collaborator

I think roachprod is a thing now. Talk to @mberhault.

@mberhault
Copy link
Contributor

It is, at least the initial version: https://github.com/cockroachlabs/roachprod
crl-prod (please pull latest) knows about machines created with it, so you can pull binaries etc... I'll either hook it up with roachperf or improve the tools to quickly setup/run a cockroach cluster.

@nvanbenschoten
Copy link
Member Author

Great, thanks Marc!

@bdarnell
Copy link
Contributor

We're a little conservative about when we invoke this optimization, opting
to only do so when all Entries in a raft.Ready struct support it. We could
split the Entries into those that need to be synced before the messages are
sent and those that can be synced after, but that has been seen to exhibit
worse performance because we end up syncing to disk twice.

The commit message indicates that we decide whether to do this based on the entries, but the code makes the decision based on the messages (writing to disk asynchronously if all messages are MsgApps). I think this is the better approach and could be structured to avoid two calls to sync (instead of "maybe sync, send messages, maybe sync", it could be "maybe send messages, sync, maybe send messages").

The latter construction has the nice property that we could easily transform it into "send appends, sync, send other messages". I think that would be safe. Keeping MsgApps in order is an important optimization, but I don't think it matters whether we reorder MsgApps relative to other messages. The one thing I'm not sure about is whether it's OK to send out a commit index in MsgApps that we have not ourselves written in HardState.Commit (I assume this is the reason for the IsEmptyHardState check in this diff). I think that's safe - the raft thesis says that the commit index may be treated as volatile state, although etcd/raft keeps it in HardState and I can't remember the details of why. So maybe we keep the IsEmptyHardState check to decide whether we send appends before or after syncing.


Review status: 0 of 1 files reviewed at latest revision, all discussions resolved, all commit checks successful.


Comments from Reviewable

@a-robinson
Copy link
Contributor

How much work would it be to finish this off? I think it'll be a pretty nice win when disks are heavily contended.

@nvanbenschoten
Copy link
Member Author

Not very much, a day or two of work at most. It currently grouped into the list of "experiment done, implementation pending" perf work that's on my timeline!

@nvanbenschoten
Copy link
Member Author

@bdarnell looking back at the history of etcd/raft, I haven't been able to uncover a specific reason why the Commit index is included in the HardState. It does look like at one point it wasn't persisted which may mean that it's just a historical artifact. I'm going to keep digging, but I'd be very interested to hear whether @xiang90 or @yichengq have any insight into this.

Outside of that question, I think I've also been able to convince myself that broadcasting a MsgApp with a larger commit index than that in our persisted HardState.Commit is still safe. The situation where this gets interesting is when the leader crashes after broadcasting the larger commit index but before persisting it itself. In this case, one of the followers with the entry will necessarily become the next leader, per the leader completeness property and the fact that all majorities intersect. Since only replicas who have seen this entry can become future leaders, the entry will eventually be committed regardless of whether the broadcasted MsgApp with the larger commit index reached some of them or not.

Like you mention, Section 3.8 of the Raft thesis states the following:

Other state variables are safe to lose on a restart, as they can all be recreated. The most interesting example is the commit index, which can safely be reinitialized to zero on a restart. Even if every server restarts at the same time, the commit index will only temporarily lag behind its true value. Once a leader is elected and is able to commit a new entry, its commit index will advance, and it will quickly propagate this commit index to its followers.

So unless I'm missing something there, the only thing that would stop us from unconditionally doing the

"send appends, sync, send other messages"

approach (regardless of the IsEmptyHardState check) would be some implementation detail specific to etcd/raft.

@nvanbenschoten
Copy link
Member Author

The most interesting example is the commit index, which can safely be reinitialized to zero on a restart.

Thinking about this more, if it isn't necessary to sync the commit index to disk (in both Raft and etcd/raft) I wonder if we'd see any difference from clearing it from the HardState unconditionally. This would shrink the proto on disk and more importantly avoid the separate key write that we do for about 50% of all messages.

@bdarnell
Copy link
Contributor

bdarnell commented Dec 5, 2017

Polling the group to get the commit index (as described in section 3.8) only works if you know the members of the group, but the members of the group may be changed by committed log entries (described in chapter 4) . You need an approximately up-to-date view of the commit index in order to identify the correct nodes to poll. This is mainly an issue when the state machine is transient (or has very infrequent checkpoints) and recovery relies more heavily on replaying the log. This is the reason (IIRC) that the commit index is persisted.

I think that since we persist our applied index (and truncate the raft log) frequently, there is little value to us in storing the commit index. I think that removing the commit index from our persisted HardState and instead setting it to the applied index at startup would be safe. (or more conservatively, persist HardState.Commit only when we commit a config change)

@nvanbenschoten
Copy link
Member Author

I think that removing the commit index from our persisted HardState and instead setting it to the applied index at startup would be safe.

I experimented with making this change, which allowed us to sync the HardState less frequently. I was able to get all tests passing but it didn't seem to provide any noticeable performance improvement on kv running a 100% write workload (BenchmarkLoadgenKV/readPercent=0/splits=10/concurrency=48).

@xiang90
Copy link
Contributor

xiang90 commented Dec 6, 2017

For etcd, it was just convenient. And we do not really see any performance improvement if we skip commit index sync.

@nvanbenschoten
Copy link
Member Author

@bdarnell I've updated this perform the "maybe send messages, sync, maybe send messages" strategy you discussed above. PTAL.

I'll follow up with more benchmarks.

@nvanbenschoten nvanbenschoten changed the title [WIP] storage: sync entries to disk in parallel with followers storage: sync entries to disk in parallel with followers Dec 7, 2017
@bdarnell
Copy link
Contributor

:lgtm:


Review status: 0 of 1 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.


pkg/storage/replica.go, line 3690 at r1 (raw file):

// MsgApps and one containing all other message types. Each slice retains the
// relative ordering between messages in the original slice.
func splitMsgApps(msgs []raftpb.Message) (mgsApps, otherMsgs []raftpb.Message) {

This could use a unittest to make sure we get the boundary conditions right (I'm not sure how common mixed batches are otherwise in our tests)


Comments from Reviewable

@nvanbenschoten
Copy link
Member Author

Review status: 0 of 2 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.


pkg/storage/replica.go, line 3690 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This could use a unittest to make sure we get the boundary conditions right (I'm not sure how common mixed batches are otherwise in our tests)

Done.


Comments from Reviewable

Referenced in cockroachdb#17500.

This change implements the optimization in the Raft thesis under the
section: 10.2.1 Writing to the leader’s disk in parallel. The optimization
allows the leader to sync new entries to its disk after it has sent the
corresponding `MsgApp` messages, instead of before.

Here, we invoke this optimization by:
1. sending all MsgApps.
2. syncing all entries and Raft state to disk.
3. sending all other messages.

Release note (performance improvement): Raft followers now write to
their disks in parallel with the leader.
@nvanbenschoten nvanbenschoten merged commit 872656c into cockroachdb:master Dec 12, 2017
@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/asyncRaft branch December 12, 2017 16:29
@xiang90
Copy link
Contributor

xiang90 commented Dec 12, 2017

@nvanbenschoten do you have a performance improvement measurement for this optimization?

@nvanbenschoten
Copy link
Member Author

@xiang90 it's certainly going to depend on your workload and your cluster topology, but I can share the benchmark results I recorded right before merging this.


Workload

./kv --read-percent=0 -max-block-bytes=5120 -min-block-bytes=320 --concurrency=64

(see loadgen/kv)

Cluster

  • 3 nodes
  • All in us-east1-b
  • n1-standard-4 (4 vCPUs, 15 GB memory) on GCE
  • Local SSDs

Results

# Without Change:

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
   60.0s        0          66266         1104.4     57.9     56.6     88.1    134.2    234.9

BenchmarkLoadgenKV/readPercent=0/splits=0/concurrency=64/duration=1m0s	   66266	    905500 ns/op

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
   60.0s        0          66396         1106.5     57.8     56.6     88.1    134.2    234.9

BenchmarkLoadgenKV/readPercent=0/splits=0/concurrency=64/duration=1m0s	   66396	    903729 ns/op

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
   60.0s        0          66948         1115.7     57.4     56.6     88.1    134.2    285.2

BenchmarkLoadgenKV/readPercent=0/splits=0/concurrency=64/duration=1m0s	   66948	    896286 ns/op


# With Change:

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
   60.0s        0          83230         1387.1     46.1     44.0     75.5    100.7    201.3

BenchmarkLoadgenKV/readPercent=0/splits=0/concurrency=64/duration=1m0s	   83230	    720930 ns/op

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
   60.0s        0          81610         1360.1     47.0     44.0     79.7    104.9    234.9

BenchmarkLoadgenKV/readPercent=0/splits=0/concurrency=64/duration=1m0s	   81610	    735263 ns/op

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
   60.0s        0          80888         1348.0     47.5     44.0     83.9    121.6    302.0

BenchmarkLoadgenKV/readPercent=0/splits=0/concurrency=64/duration=1m0s	   80888	    741821 ns/op

Performance Improvement

(averaged over three trials before and after the change)

Throughput: +23%
Avg Latency: -19%
95th Latency: -10%
99th Latency: -19%

Keep in mind that this is a write-only workload with decently large writes (-max-block-bytes=5120 -min-block-bytes=320), which is the type of workload that should benefit most from this change.

@xiang90
Copy link
Contributor

xiang90 commented Dec 12, 2017

@nvanbenschoten

awesome! thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants