-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: integrate raft async storage writes #94165
kv: integrate raft async storage writes #94165
Conversation
69c0ef7
to
9faec10
Compare
9faec10
to
7f77d1a
Compare
@tbg @sumeerbhola this is now ready for review! There's one outstanding issue around sideloaded SST cleanup, but we can address that concurrent with the rest of the review. @sumeerbhola this PR is going to require cockroachdb/pebble#2117. Were you planning to push that along in the coming weeks? |
Do you have a / would you be able to run a flavor of this experiment that is I/O bound instead? One thing I hope can fall out of this change is more graceful degradation under I/O overload since we won't be blocking the raft handling loop for extended periods of times. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Need to better understand the leading raft entry cache issue and the sideloading truncation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couple of drive-by nits
Reviewed 7 of 7 files at r1, 5 of 5 files at r2, 3 of 10 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)
pkg/kv/kvserver/logstore/logstore.go
line 130 at r3 (raw file):
// depending on the kv.raft_log.non_blocking_synchronization.enabled cluster // setting. Either way, the effects of the log append will be immediately // visible readers of the Engine.
nit: visible to readers ...
pkg/storage/engine.go
line 965 at r1 (raw file):
Commit(sync bool) error // CommitNoSyncWait atomically applies any batched updates to the underlying // engine and initiates a synchronous disk write, but does not wait for that
nit: initiates a disk write ...
(presumably it doesn't matter how that disk write is implemented -- async or sync)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @tbg)
pkg/storage/engine.go
line 965 at r1 (raw file):
Previously, sumeerbhola wrote…
nit: initiates a disk write ...
(presumably it doesn't matter how that disk write is implemented -- async or sync)
Done.
Extracted from cockroachdb#94165. Picks up github.com/cockroachdb/pebble/pull/2117. Release note: None Epic: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTRs!
Do you have a / would you be able to run a flavor of this experiment that is I/O bound instead?
Do you have any particular cases or test setups in mind? I'm happy to run more experiments so we can better understand the impact of this change.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @tbg)
pkg/kv/kvserver/replica_raft.go
line 818 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
state.LastIndex
can now contain non-durable entries? Seems sus
Good point. As of etcd-io/raft#8, we no longer allow application of unstable entries. Instead, we wait for them to stabilize locally before handing them in a Ready to be applied. I removed the handling of this case in a newly added commit.
pkg/kv/kvserver/replica_raft.go
line 949 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
what's the intended purpose of this message? Should we print it unconditionally on l.947?
The purpose is to add a trace event to the context attached to each committed entry, before applying. If we don't have any entries, then the call is a no-op. However, it does RLock the Replica
either way, so we might as well make it conditional on there actually being entries to apply.
pkg/kv/kvserver/replica_raft.go
line 1459 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
These are the append thread (and maybe later apply) responding to async work, right? A little comment would be helpful.
Done.
pkg/kv/kvserver/replica_raft.go
line 1575 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
and?
Done.
pkg/kv/kvserver/logstore/logstore.go
line 130 at r3 (raw file):
Previously, sumeerbhola wrote…
nit: visible to readers ...
Done.
pkg/kv/kvserver/logstore/logstore.go
line 223 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Maybe this dance between things we must syncs and refusing to sync due to the
disableSyncRaftLog
setting could a bit clearer if we renamedmustSync
->wantsSync
andsync
->willSync
. Just a thought, take it or leave it.
Done.
pkg/kv/kvserver/logstore/logstore.go
line 256 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
shouldn't this be
mustSync
? We are only recording toRaftLogCommitLatency
ifmustSync
and it makes sense for those two to be aligned. I think even when the unsafe setting is active, then we should be recording to this histogram and "pretend" we've synced whenever it was requested. I.e. have the code be all the same, except we don't actually sync.
Done. I had come to a similar conclusion but decided not to change the behavior. But happy to switch.
pkg/kv/kvserver/logstore/logstore.go
line 262 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Maybe we should detect early in this method whether we're replacing a log tail, and if so disable the pipelining. It's not a common code path so not worth optimizing for.
Thanks, this was a good suggestion.
pkg/kv/kvserver/logstore/logstore.go
line 285 at r3 (raw file):
I added some words here, on replicaRaftStorage.Entries
, and on r.mu.lastIndex
. It is ok for the raft entry cache to lead the durable raft log, as long as it does not lead the pebble engine.
etcd/raft
maintains its own tracking of entry durability by splitting its log into an unstable portion for entries that are not known to be durable and a stable portion for entries that are known to be durable. Its boundary of "stability" is defined by raftLog.unstable.offset
, and it advances this boundary when entries become durable. It then overlays its unstable log over the raft.Storage
interface, only ever querying raft.Storage
below the unstable offset. See raftLog.firstIndex
and raftLog.slice
. So returning unstable entries from raft.Storage
is allowed.
This is more than just happenstance. It's important that etcd/raft
makes decisions by looking at the unstable log overlayed over the raft.Storage
. That's because any entry beyond unstable.offset
could be uncommitted, and therefore could be replaced by an in-progress append. Before async storage writes, this was less common, but it was still allowed by the etcd/raft library to Step
the raft state machine between a Ready()
and an Advance()
(though crdb didn't use this flexibility). For etcd/raft to properly synchronize log appends with other state transitions (e.g. campaigning), it needed to make sure that the other state transitions always saw the latest version of the local log, unstable or not.
The raft entry cache backs the raft.Storage implementation and raft assumes everything in it is durable.
Raft shouldn't be assuming that everything in it is durable. Even before this change, that's not the case, because replicaRaftStorage.Entries
read with StandardDurability
instead of GuaranteedDurability
. So it will read out of LSM memtables even before the corresponding LSM writes are stable. But again, this should be ok.
Couldn't this lead to situations in which a replica campaigns using a log that isn't actually durable yet, and may distribute log entries not actually stable yet believing them to be stable?
It can lead to this situation, but is this a problem? If the candidate is elected and then crashes before the suffix of its log (that it campaigned on) is stable, then it cannot have voted for itself in the election or have voted for any new log entries after getting raft leadership. So if it was able to get any log entries committed to the replicated log, a quorum of other replicas must have durably appended them to their log.
pkg/kv/kvserver/logstore/sync_waiter.go
line 34 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Mention the invariant that callbacks are notified in completion order, i.e. if a callback completes it is true that all previous operations have also succeeded.
Done, though I did not mention that the one callback implies the completion of all previous operations because users of this loop should not need to think about the relationship between different syncs. I think it's also implied by the invariant you suggested below.
pkg/kv/kvserver/logstore/sync_waiter.go
line 35 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Invariant: callbacks are invoked in order and without concurrency.
Done.
pkg/kv/kvserver/logstore/sync_waiter.go
line 78 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Hmm, to avoid fsyncgate we'd have to hold on to a local
var err error
and make sure that if one caller gets an error, they all get an error. (i.e. basically poisoning the loop). So maybe fataling is fine; if the disk is broken the disk is broken.
Yeah, I think Fatal-ing is the safest bet to handle errors here. In addition to what you said, errors returned from SyncWait
should only be due to invariant violations and never due to unrecoverable Pebble states. See the discussion in https://cockroachlabs.slack.com/archives/CAC6K3SLU/p1674678349690899 and the TODOs added in #95862.
pkg/kv/kvserver/logstore/sync_waiter.go
line 90 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
// It may never be called in case the stopper stops.
Done.
pkg/kv/kvserver/logstore/sync_waiter.go
line 98 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
do you want to pool these?
I don't think there's a need. They shouldn't allocate because the channel has a chan syncBatch
type. Notice that the contained type is concrete, so there's no indirection or boxing.
To demonstrate, I added BenchmarkSyncWaiterLoop
, which allocates 0 times per iteration:
name time/op
SyncWaiterLoop-10 375ns ± 4%
name alloc/op
SyncWaiterLoop-10 0.00B
name allocs/op
SyncWaiterLoop-10 0.00
pkg/kv/kvserver/logstore/sync_waiter.go
line 103 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This needs a
log.EveryN
. Also, is this really a warning? Actual question, not a rhetorical one.
Done. It is a warning because it's unexpected. We size the enqueue channel with enough capacity to hold as many in-progress sync operations as Pebble allows (pebble/record.SyncConcurrency). I left a comment about this.
pkg/kv/kvserver/logstore/sync_waiter_test.go
line 48 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Wouldn't it not block anyway? You have a
q
channel that presumably has some capacity, so maybe this should be done 2*cap(chan) times?
Done.
6093246
to
1c2135f
Compare
I added two new commits that introduce randomness into message delivery order in tests to ensure that re-ordering of local raft message responses is tolerated and handled properly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New commits LGTM, I still need to give parts of the main commit a second read since it's been a while.
Did you make any changes in it other than the rebase?
Reviewed 2 of 10 files at r3, 17 of 17 files at r4, 4 of 4 files at r5, 2 of 2 files at r6, 1 of 1 files at r7, all commit messages.
Dismissed @nvanbenschoten from a discussion.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/kv/kvserver/replica.go
line 405 at r4 (raw file):
// will have to be retrieved from the Raft log entry. Use the // invalidLastTerm constant for this case. lastIndex, lastTerm uint64
This is a semantic change, right? This was previously durable. You've convinced me (in another comment thread) that it's fine that this is getting picked up by replicaRaftStorage
. And when we read pebble, we'll always see the lastIndex
entry with corresponding lastTerm
as well, so I hope any other uses of this field would also be ok. I sleuthed a bit and there's
- raft log queue (should be ok since it lowers to committed index anyway)
- quiescing (ok)
It might still be worth to rename these inflightLastIndex
and inflightLastTerm
(not in love with inflight
, maybe you can think of something better) because I think there is the risk of someone using them like they were durable. I cede that this is likely an imaginary fear but until we encapsulate these variables away from the "general public" I'd like to nurse that fear a little bit.
pkg/kv/kvserver/replica_raft.go
line 1617 at r7 (raw file):
} // NB: we can reset messages in the localMsgs.recycled slice without holding // the localMsgs mutex because no-one ever writes to localMsgs.recycled abd
if you catch another CI cycle anyway, abd
Fixes cockroachdb#17500. Waiting on github.com/cockroachdb/pebble/pull/2117. This commit integrates with the `AsyncStorageWrites` functionality that we added to Raft in github.com/etcd-io/raft/pull/8. \## Approach The commit makes the minimal changes needed to integrate with async storage writes and pull fsyncs out of the raft state machine loop. It does not make an effort to extract the non-durable portion of raft log writes or raft log application onto separate goroutine pools, as was described in cockroachdb#17500. Those changes will also be impactful, but they're non trivial and bump into a pipelining vs. batching trade-off, so they are left as future work items (TODO(nvanbenschoten): open new issues). With this change, asynchronous Raft log syncs are enabled by the new `DB.ApplyNoSyncWait` Pebble API introduced in github.com/cockroachdb/pebble/pull/2117. The `handleRaftReady` state machine loop continues to initiate Raft log writes, but it uses the Pebble API to offload waiting on durability to a separate goroutine. This separate goroutine then sends the corresponding `MsgStorageAppend`'s response messages where they need to go (locally and/or to the Raft leader) when the fsync completes. The async storage writes functionality in Raft makes this all safe. \## Benchmark Results The result of this change is reduced interference between Raft proposals. As a result, it reduces end-to-end commit latency. github.com/etcd-io/raft/pull/8 presented a collection of benchmark results captured from integrating async storage writes with rafttoy. When integrated into CockroachDB, we see similar improvements to average and tail latency. However, it doesn't provide the throughput improvements at the top end because log appends and state machine application have not yet been extracted into separate goroutine pools, which would facilitate increased opportunity for batching. TODO: add images ---- Release note (performance improvement): The Raft proposal pipeline has been optimized to reduce interference between Raft proposals. This improves average and tail write latency at high concurrency.
…previous LastIndex This commit removes logic in `apply.Task.AckCommittedEntriesBeforeApplication` which was ensuring that we don't acknowledge committed Raft log entries before they were durable in a node's local Raft log. This is now ensured inside of the etcd/raft library when AsyncStorageWrites is enabled, as the comment added here describes.
This commit adds a testing facility to randomly disable non-blocking log syncs in tests. This ensures that we properly handle the case where non-blocking log sync is disabled while some log syncs are still in-progress.
This commit adds a testing facility to shuffle local raft messages before delivering them to the raft state machine. These are not required to be in order, so we shuffle in tests ensure that re-ordering is handled properly.
Simple rename to make the semantics of these fields more clear and harder to misinterpret.
1c2135f
to
ebc33fa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you make any changes in it other than the rebase?
Just a rebase and addressing your comments. Nothing substantial beyond your suggested change to skip non-block log writes when a MsgStorageAppend is replacing existing log entries (in logstore.go
).
I'm running this PR through the roachtest suite here to get a bit more confidence in it before we'll feel good about merging.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)
pkg/kv/kvserver/replica.go
line 405 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This is a semantic change, right? This was previously durable. You've convinced me (in another comment thread) that it's fine that this is getting picked up by
replicaRaftStorage
. And when we read pebble, we'll always see thelastIndex
entry with correspondinglastTerm
as well, so I hope any other uses of this field would also be ok. I sleuthed a bit and there's
- raft log queue (should be ok since it lowers to committed index anyway)
- quiescing (ok)
It might still be worth to rename these
inflightLastIndex
andinflightLastTerm
(not in love withinflight
, maybe you can think of something better) because I think there is the risk of someone using them like they were durable. I cede that this is likely an imaginary fear but until we encapsulate these variables away from the "general public" I'd like to nurse that fear a little bit.
That's all fair. It would be an issue if a new user of these fields misinterpreted their semantics and assumed that they are talking about durability. I've updated the names to lastIndexNotDurable
and lastTermNotDurable
.
pkg/kv/kvserver/replica_raft.go
line 1617 at r7 (raw file):
Previously, tbg (Tobias Grieger) wrote…
if you catch another CI cycle anyway,
abd
Done.
We saw 14 individual roachtest failures, none of which look related and all of which have corresponding github tracking issues on master (4 were fixed yesterday). CI and roachtests are all looking good for this change. |
TFTR! bors r+ |
Essentially I'm hoping that by moving the log writes out of the critical path of the raft ready loop, we'll somehow do "better" under I/O overload - lower handle raft ready latencies, less build-up in the raft receive queues, and so on. An apples-to-apples comparison could be helpful here - run a workload where we eventually see LSM inversion under both sync and async log appends, and simply investigate the differences. It just so happens that @andrewbaptist is looking at such an experiment1. Footnotes |
Build failed: |
The test failure looks real, but I'm also able to hit it on bors r+ |
Build succeeded: |
Fixes #17500.
Epic: CRDB-22644
This commit integrates with the
AsyncStorageWrites
functionality that we added to Raft in etcd-io/raft/pull/8.Approach
The commit makes the minimal changes needed to integrate with async storage writes and pull fsyncs out of the raft state machine loop. It does not make an effort to extract the non-durable portion of raft log writes or raft log application onto separate goroutine pools, as was described in #17500. Those changes will also be impactful, but they're non trivial and bump into a pipelining vs. batching trade-off, so they are left as future work items. See #94853 and #94854.
With this change, asynchronous Raft log syncs are enabled by the new
DB.ApplyNoSyncWait
Pebble API introduced in cockroachdb/pebble/pull/2117. ThehandleRaftReady
state machine loop continues to initiate Raft log writes, but it uses the Pebble API to offload waiting on durability to a separate goroutine. This separate goroutine then sends the correspondingMsgStorageAppend
's response messages where they need to go (locally and/or to the Raft leader) when the fsync completes. The async storage writes functionality in Raft makes this all safe.Benchmark Results
The result of this change is reduced interference between Raft proposals. As a result, it reduces end-to-end commit latency.
etcd-io/raft/pull/8 presented a collection of benchmark results captured from integrating async storage writes with rafttoy.
When integrated into CockroachDB, we see similar improvements to average and tail latency. However, it doesn't provide the throughput improvements at the top end because log appends and state machine application have not yet been extracted into separate goroutine pools, which would facilitate an increased opportunity for batching.
To visualize the impact on latency, consider the following test. The experiment uses a 3-node GCP cluster with n2-standard-32 instances spread across three availability zones. It runs kv0 (write-only) against the cluster with 64-byte values. It then ramps up concurrency to compare throughput vs. average and tail latency.
NOTE: log scales on x and y axes
Async storage writes impacts latency by different amounts at different throughputs, ranging from an improvement of 20% to 40% when the system is "well utilized". However, it increases latency by 5% to 10% when the system is over-saturated and CPU bound, presumably because of the extra goroutine handoff to the log append fsync callback, which will be impacted by elevated goroutine scheduling latency.
Other benchmark results
The way to interpret these results is that async raft storage writes reduce latency and, as a result of the closed loop natured workload, also increase throughput for the YCSB variants that perform writes and aren't already CPU saturated. Variants that are read-only are unaffected. Variants that are CPU-saturated do not benefit from the change because they are already bottlenecked on CPU resources and cannot push any more load (see above).
Release note (performance improvement): The Raft proposal pipeline has been optimized to reduce interference between Raft proposals. This improves average and tail write latency at high concurrency.