kv: integrate raft async storage writes #94165

nvanbenschoten · 2022-12-22T19:42:09Z

This commit integrates with the AsyncStorageWrites functionality that we added to Raft in etcd-io/raft/pull/8.

Approach

The commit makes the minimal changes needed to integrate with async storage writes and pull fsyncs out of the raft state machine loop. It does not make an effort to extract the non-durable portion of raft log writes or raft log application onto separate goroutine pools, as was described in #17500. Those changes will also be impactful, but they're non trivial and bump into a pipelining vs. batching trade-off, so they are left as future work items. See #94853 and #94854.

With this change, asynchronous Raft log syncs are enabled by the new DB.ApplyNoSyncWait Pebble API introduced in cockroachdb/pebble/pull/2117. The handleRaftReady state machine loop continues to initiate Raft log writes, but it uses the Pebble API to offload waiting on durability to a separate goroutine. This separate goroutine then sends the corresponding MsgStorageAppend's response messages where they need to go (locally and/or to the Raft leader) when the fsync completes. The async storage writes functionality in Raft makes this all safe.

Benchmark Results

The result of this change is reduced interference between Raft proposals. As a result, it reduces end-to-end commit latency.

etcd-io/raft/pull/8 presented a collection of benchmark results captured from integrating async storage writes with rafttoy.

When integrated into CockroachDB, we see similar improvements to average and tail latency. However, it doesn't provide the throughput improvements at the top end because log appends and state machine application have not yet been extracted into separate goroutine pools, which would facilitate an increased opportunity for batching.

To visualize the impact on latency, consider the following test. The experiment uses a 3-node GCP cluster with n2-standard-32 instances spread across three availability zones. It runs kv0 (write-only) against the cluster with 64-byte values. It then ramps up concurrency to compare throughput vs. average and tail latency.

NOTE: log scales on x and y axes

Async storage writes impacts latency by different amounts at different throughputs, ranging from an improvement of 20% to 40% when the system is "well utilized". However, it increases latency by 5% to 10% when the system is over-saturated and CPU bound, presumably because of the extra goroutine handoff to the log append fsync callback, which will be impacted by elevated goroutine scheduling latency.

Throughput (B/s)	Throughput (qps)	Avg. Latency Δ	p99 Latency Δ
63 KB/s	1,000	-10.5%	-8.8%
125 KB/s	2,000	-7.1%	-10.4%
250 KB/s	4,000	-20%	-11.2%
500 KB/s	8,000	-16.6%	-25.3%
1 MB/s	16,000	-30.8%	-44.0%
2 MB/s	32,000	-38.2%	-30.9%
4 MB/s	64,000	5.9%	9.4%

Other benchmark results

name                   old ops/s    new ops/s    delta
# 50% read, 50% update
ycsb/A/nodes=3          16.0k ± 5%   16.2k ± 4%     ~     (p=0.353 n=10+10)
ycsb/A/nodes=3/cpu=32   28.7k ± 5%   33.8k ± 2%  +17.57%  (p=0.000 n=9+9)
# 95% read, 5% update
ycsb/B/nodes=3          29.9k ± 3%   30.2k ± 3%     ~     (p=0.278 n=9+10)
ycsb/B/nodes=3/cpu=32    101k ± 1%    100k ± 3%     ~     (p=0.274 n=8+10)
# 100% read
ycsb/C/nodes=3          40.4k ± 3%   40.0k ± 3%     ~     (p=0.190 n=10+10)
ycsb/C/nodes=3/cpu=32    135k ± 1%    137k ± 1%   +0.87%  (p=0.011 n=9+9)
# 95% read, 5% insert
ycsb/D/nodes=3          33.6k ± 3%   33.8k ± 3%     ~     (p=0.315 n=10+10)
ycsb/D/nodes=3/cpu=32    108k ± 1%    106k ± 6%     ~     (p=0.739 n=10+10)
# 95% scan, 5% insert
ycsb/E/nodes=3          3.79k ± 1%   3.73k ± 1%   -1.42%  (p=0.000 n=9+9)
ycsb/E/nodes=3/cpu=32   6.31k ± 5%   6.48k ± 6%     ~     (p=0.123 n=10+10)
# 50% read, 50% read-modify-write
ycsb/F/nodes=3          7.68k ± 2%   7.99k ± 2%   +4.11%  (p=0.000 n=10+10)
ycsb/F/nodes=3/cpu=32   15.6k ± 4%   18.1k ± 3%  +16.14%  (p=0.000 n=8+10)

name                   old avg(ms)  new avg(ms)  delta
ycsb/A/nodes=3           6.01 ± 5%    5.95 ± 4%     ~     (p=0.460 n=10+10)
ycsb/A/nodes=3/cpu=32    5.01 ± 4%    4.25 ± 4%  -15.19%  (p=0.000 n=9+10)
ycsb/B/nodes=3           4.80 ± 0%    4.77 ± 4%     ~     (p=0.586 n=7+10)
ycsb/B/nodes=3/cpu=32    1.90 ± 0%    1.90 ± 0%     ~     (all equal)
ycsb/C/nodes=3           3.56 ± 2%    3.61 ± 3%     ~     (p=0.180 n=10+10)
ycsb/C/nodes=3/cpu=32    1.40 ± 0%    1.40 ± 0%     ~     (all equal)
ycsb/D/nodes=3           2.87 ± 2%    2.85 ± 2%     ~     (p=0.650 n=10+10)
ycsb/D/nodes=3/cpu=32    1.30 ± 0%    1.34 ± 4%     ~     (p=0.087 n=10+10)
ycsb/E/nodes=3           25.3 ± 0%    25.7 ± 1%   +1.38%  (p=0.000 n=8+8)
ycsb/E/nodes=3/cpu=32    22.9 ± 5%    22.2 ± 6%     ~     (p=0.109 n=10+10)
ycsb/F/nodes=3           12.5 ± 2%    12.0 ± 1%   -3.72%  (p=0.000 n=10+9)
ycsb/F/nodes=3/cpu=32    9.27 ± 4%    7.98 ± 3%  -13.96%  (p=0.000 n=8+10)

name                   old p99(ms)  new p99(ms)  delta
ycsb/A/nodes=3           45.7 ±15%    35.7 ± 6%  -21.90%  (p=0.000 n=10+8)
ycsb/A/nodes=3/cpu=32    67.6 ±13%    55.3 ± 5%  -18.10%  (p=0.000 n=9+10)
ycsb/B/nodes=3           30.5 ±24%    29.4 ± 7%     ~     (p=0.589 n=10+10)
ycsb/B/nodes=3/cpu=32    12.8 ± 2%    13.3 ± 7%     ~     (p=0.052 n=10+8)
ycsb/C/nodes=3           14.0 ± 3%    14.2 ± 0%     ~     (p=0.294 n=10+8)
ycsb/C/nodes=3/cpu=32    5.80 ± 0%    5.70 ± 5%     ~     (p=0.233 n=7+10)
ycsb/D/nodes=3           12.4 ± 2%    11.7 ± 3%   -5.32%  (p=0.001 n=10+10)
ycsb/D/nodes=3/cpu=32    6.30 ± 0%    5.96 ± 6%   -5.40%  (p=0.001 n=10+10)
ycsb/E/nodes=3           81.0 ± 4%    83.9 ± 0%   +3.63%  (p=0.012 n=10+7)
ycsb/E/nodes=3/cpu=32     139 ±19%     119 ±12%  -14.46%  (p=0.021 n=10+10)
ycsb/F/nodes=3            122 ±17%     103 ±10%  -15.48%  (p=0.002 n=10+8)
ycsb/F/nodes=3/cpu=32     146 ±20%     133 ± 7%   -8.89%  (p=0.025 n=10+10)

The way to interpret these results is that async raft storage writes reduce latency and, as a result of the closed loop natured workload, also increase throughput for the YCSB variants that perform writes and aren't already CPU saturated. Variants that are read-only are unaffected. Variants that are CPU-saturated do not benefit from the change because they are already bottlenecked on CPU resources and cannot push any more load (see above).

Release note (performance improvement): The Raft proposal pipeline has been optimized to reduce interference between Raft proposals. This improves average and tail write latency at high concurrency.

cockroach-teamcity · 2022-12-22T19:42:21Z

This change is

nvanbenschoten · 2023-01-07T02:55:17Z

@tbg @sumeerbhola this is now ready for review! There's one outstanding issue around sideloaded SST cleanup, but we can address that concurrent with the rest of the review.

@sumeerbhola this PR is going to require cockroachdb/pebble#2117. Were you planning to push that along in the coming weeks?

tbg · 2023-01-10T11:25:43Z

However, it increases latency by 5% to 10% when the system is over-saturated and CPU bound, presumably because of the extra goroutine handoff to the log append fsync callback, which will be impacted by elevated goroutine scheduling latency.

Do you have a / would you be able to run a flavor of this experiment that is I/O bound instead? One thing I hope can fall out of this change is more graceful degradation under I/O overload since we won't be blocking the raft handling loop for extended periods of times.

pkg/kv/kvserver/replica.go

tbg

Looks good! Need to better understand the leading raft entry cache issue and the sideloading truncation.

pkg/kv/kvserver/replica.go

pkg/kv/kvserver/logstore/sync_waiter.go

pkg/kv/kvserver/replica_raft.go

pkg/storage/engine.go

pkg/storage/pebble_batch.go

sumeerbhola

couple of drive-by nits

Reviewed 7 of 7 files at r1, 5 of 5 files at r2, 3 of 10 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)

pkg/kv/kvserver/logstore/logstore.go line 130 at r3 (raw file):

// depending on the kv.raft_log.non_blocking_synchronization.enabled cluster
// setting. Either way, the effects of the log append will be immediately
// visible readers of the Engine.

nit: visible to readers ...

pkg/storage/engine.go line 965 at r1 (raw file):

	Commit(sync bool) error
	// CommitNoSyncWait atomically applies any batched updates to the underlying
	// engine and initiates a synchronous disk write, but does not wait for that

nit: initiates a disk write ...
(presumably it doesn't matter how that disk write is implemented -- async or sync)

nvanbenschoten

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @tbg)

pkg/storage/engine.go line 965 at r1 (raw file):

Previously, sumeerbhola wrote…

nit: initiates a disk write ...
(presumably it doesn't matter how that disk write is implemented -- async or sync)

Done.

Extracted from cockroachdb#94165. Picks up github.com/cockroachdb/pebble/pull/2117. Release note: None Epic: None

nvanbenschoten

TFTRs!

Do you have a / would you be able to run a flavor of this experiment that is I/O bound instead?

Do you have any particular cases or test setups in mind? I'm happy to run more experiments so we can better understand the impact of this change.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @tbg)

pkg/kv/kvserver/replica_raft.go line 818 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

state.LastIndex can now contain non-durable entries? Seems sus

Good point. As of etcd-io/raft#8, we no longer allow application of unstable entries. Instead, we wait for them to stabilize locally before handing them in a Ready to be applied. I removed the handling of this case in a newly added commit.

pkg/kv/kvserver/replica_raft.go line 949 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

what's the intended purpose of this message? Should we print it unconditionally on l.947?

The purpose is to add a trace event to the context attached to each committed entry, before applying. If we don't have any entries, then the call is a no-op. However, it does RLock the Replica either way, so we might as well make it conditional on there actually being entries to apply.

pkg/kv/kvserver/replica_raft.go line 1459 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

These are the append thread (and maybe later apply) responding to async work, right? A little comment would be helpful.

Done.

pkg/kv/kvserver/replica_raft.go line 1575 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

and?

Done.

pkg/kv/kvserver/logstore/logstore.go line 130 at r3 (raw file):

Previously, sumeerbhola wrote…

nit: visible to readers ...

Done.

pkg/kv/kvserver/logstore/logstore.go line 223 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Maybe this dance between things we must syncs and refusing to sync due to the disableSyncRaftLog setting could a bit clearer if we renamed mustSync -> wantsSync and sync -> willSync. Just a thought, take it or leave it.

Done.

pkg/kv/kvserver/logstore/logstore.go line 256 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

shouldn't this be mustSync? We are only recording to RaftLogCommitLatency if mustSync and it makes sense for those two to be aligned. I think even when the unsafe setting is active, then we should be recording to this histogram and "pretend" we've synced whenever it was requested. I.e. have the code be all the same, except we don't actually sync.

Done. I had come to a similar conclusion but decided not to change the behavior. But happy to switch.

pkg/kv/kvserver/logstore/logstore.go line 262 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Maybe we should detect early in this method whether we're replacing a log tail, and if so disable the pipelining. It's not a common code path so not worth optimizing for.

Thanks, this was a good suggestion.

pkg/kv/kvserver/logstore/logstore.go line 285 at r3 (raw file):
I added some words here, on replicaRaftStorage.Entries, and on r.mu.lastIndex. It is ok for the raft entry cache to lead the durable raft log, as long as it does not lead the pebble engine.

etcd/raft maintains its own tracking of entry durability by splitting its log into an unstable portion for entries that are not known to be durable and a stable portion for entries that are known to be durable. Its boundary of "stability" is defined by raftLog.unstable.offset, and it advances this boundary when entries become durable. It then overlays its unstable log over the raft.Storage interface, only ever querying raft.Storage below the unstable offset. See raftLog.firstIndex and raftLog.slice. So returning unstable entries from raft.Storage is allowed.

This is more than just happenstance. It's important that etcd/raft makes decisions by looking at the unstable log overlayed over the raft.Storage. That's because any entry beyond unstable.offset could be uncommitted, and therefore could be replaced by an in-progress append. Before async storage writes, this was less common, but it was still allowed by the etcd/raft library to Step the raft state machine between a Ready() and an Advance() (though crdb didn't use this flexibility). For etcd/raft to properly synchronize log appends with other state transitions (e.g. campaigning), it needed to make sure that the other state transitions always saw the latest version of the local log, unstable or not.

The raft entry cache backs the raft.Storage implementation and raft assumes everything in it is durable.

Raft shouldn't be assuming that everything in it is durable. Even before this change, that's not the case, because replicaRaftStorage.Entries read with StandardDurability instead of GuaranteedDurability. So it will read out of LSM memtables even before the corresponding LSM writes are stable. But again, this should be ok.

Couldn't this lead to situations in which a replica campaigns using a log that isn't actually durable yet, and may distribute log entries not actually stable yet believing them to be stable?

It can lead to this situation, but is this a problem? If the candidate is elected and then crashes before the suffix of its log (that it campaigned on) is stable, then it cannot have voted for itself in the election or have voted for any new log entries after getting raft leadership. So if it was able to get any log entries committed to the replicated log, a quorum of other replicas must have durably appended them to their log.

pkg/kv/kvserver/logstore/sync_waiter.go line 34 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Mention the invariant that callbacks are notified in completion order, i.e. if a callback completes it is true that all previous operations have also succeeded.

Done, though I did not mention that the one callback implies the completion of all previous operations because users of this loop should not need to think about the relationship between different syncs. I think it's also implied by the invariant you suggested below.

pkg/kv/kvserver/logstore/sync_waiter.go line 35 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Invariant: callbacks are invoked in order and without concurrency.

Done.

pkg/kv/kvserver/logstore/sync_waiter.go line 78 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Hmm, to avoid fsyncgate we'd have to hold on to a local var err error and make sure that if one caller gets an error, they all get an error. (i.e. basically poisoning the loop). So maybe fataling is fine; if the disk is broken the disk is broken.

Yeah, I think Fatal-ing is the safest bet to handle errors here. In addition to what you said, errors returned from SyncWait should only be due to invariant violations and never due to unrecoverable Pebble states. See the discussion in https://cockroachlabs.slack.com/archives/CAC6K3SLU/p1674678349690899 and the TODOs added in #95862.

pkg/kv/kvserver/logstore/sync_waiter.go line 90 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

// It may never be called in case the stopper stops.

Done.

pkg/kv/kvserver/logstore/sync_waiter.go line 98 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

do you want to pool these?

I don't think there's a need. They shouldn't allocate because the channel has a chan syncBatch type. Notice that the contained type is concrete, so there's no indirection or boxing.

To demonstrate, I added BenchmarkSyncWaiterLoop, which allocates 0 times per iteration:

name               time/op
SyncWaiterLoop-10  375ns ± 4%

name               alloc/op
SyncWaiterLoop-10  0.00B

name               allocs/op
SyncWaiterLoop-10   0.00

pkg/kv/kvserver/logstore/sync_waiter.go line 103 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This needs a log.EveryN. Also, is this really a warning? Actual question, not a rhetorical one.

Done. It is a warning because it's unexpected. We size the enqueue channel with enough capacity to hold as many in-progress sync operations as Pebble allows (pebble/record.SyncConcurrency). I left a comment about this.

pkg/kv/kvserver/logstore/sync_waiter_test.go line 48 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Wouldn't it not block anyway? You have a q channel that presumably has some capacity, so maybe this should be done 2*cap(chan) times?

Done.

nvanbenschoten · 2023-01-31T16:34:41Z

I added two new commits that introduce randomness into message delivery order in tests to ensure that re-ordering of local raft message responses is tolerated and handled properly.

tbg

New commits LGTM, I still need to give parts of the main commit a second read since it's been a while.
Did you make any changes in it other than the rebase?

Reviewed 2 of 10 files at r3, 17 of 17 files at r4, 4 of 4 files at r5, 2 of 2 files at r6, 1 of 1 files at r7, all commit messages.
Dismissed @nvanbenschoten from a discussion.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)

pkg/kv/kvserver/replica.go line 405 at r4 (raw file):

		// will have to be retrieved from the Raft log entry. Use the
		// invalidLastTerm constant for this case.
		lastIndex, lastTerm uint64

This is a semantic change, right? This was previously durable. You've convinced me (in another comment thread) that it's fine that this is getting picked up by replicaRaftStorage. And when we read pebble, we'll always see the lastIndex entry with corresponding lastTerm as well, so I hope any other uses of this field would also be ok. I sleuthed a bit and there's

raft log queue (should be ok since it lowers to committed index anyway)
quiescing (ok)

It might still be worth to rename these inflightLastIndex and inflightLastTerm (not in love with inflight, maybe you can think of something better) because I think there is the risk of someone using them like they were durable. I cede that this is likely an imaginary fear but until we encapsulate these variables away from the "general public" I'd like to nurse that fear a little bit.

pkg/kv/kvserver/replica_raft.go line 1617 at r7 (raw file):

		}
		// NB: we can reset messages in the localMsgs.recycled slice without holding
		// the localMsgs mutex because no-one ever writes to localMsgs.recycled abd

if you catch another CI cycle anyway, abd

Fixes cockroachdb#17500. Waiting on github.com/cockroachdb/pebble/pull/2117. This commit integrates with the `AsyncStorageWrites` functionality that we added to Raft in github.com/etcd-io/raft/pull/8. \## Approach The commit makes the minimal changes needed to integrate with async storage writes and pull fsyncs out of the raft state machine loop. It does not make an effort to extract the non-durable portion of raft log writes or raft log application onto separate goroutine pools, as was described in cockroachdb#17500. Those changes will also be impactful, but they're non trivial and bump into a pipelining vs. batching trade-off, so they are left as future work items (TODO(nvanbenschoten): open new issues). With this change, asynchronous Raft log syncs are enabled by the new `DB.ApplyNoSyncWait` Pebble API introduced in github.com/cockroachdb/pebble/pull/2117. The `handleRaftReady` state machine loop continues to initiate Raft log writes, but it uses the Pebble API to offload waiting on durability to a separate goroutine. This separate goroutine then sends the corresponding `MsgStorageAppend`'s response messages where they need to go (locally and/or to the Raft leader) when the fsync completes. The async storage writes functionality in Raft makes this all safe. \## Benchmark Results The result of this change is reduced interference between Raft proposals. As a result, it reduces end-to-end commit latency. github.com/etcd-io/raft/pull/8 presented a collection of benchmark results captured from integrating async storage writes with rafttoy. When integrated into CockroachDB, we see similar improvements to average and tail latency. However, it doesn't provide the throughput improvements at the top end because log appends and state machine application have not yet been extracted into separate goroutine pools, which would facilitate increased opportunity for batching. TODO: add images ---- Release note (performance improvement): The Raft proposal pipeline has been optimized to reduce interference between Raft proposals. This improves average and tail write latency at high concurrency.

…previous LastIndex This commit removes logic in `apply.Task.AckCommittedEntriesBeforeApplication` which was ensuring that we don't acknowledge committed Raft log entries before they were durable in a node's local Raft log. This is now ensured inside of the etcd/raft library when AsyncStorageWrites is enabled, as the comment added here describes.

This commit adds a testing facility to randomly disable non-blocking log syncs in tests. This ensures that we properly handle the case where non-blocking log sync is disabled while some log syncs are still in-progress.

This commit adds a testing facility to shuffle local raft messages before delivering them to the raft state machine. These are not required to be in order, so we shuffle in tests ensure that re-ordering is handled properly.

Simple rename to make the semantics of these fields more clear and harder to misinterpret.

nvanbenschoten

Did you make any changes in it other than the rebase?

Just a rebase and addressing your comments. Nothing substantial beyond your suggested change to skip non-block log writes when a MsgStorageAppend is replacing existing log entries (in logstore.go).

I'm running this PR through the roachtest suite here to get a bit more confidence in it before we'll feel good about merging.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)

pkg/kv/kvserver/replica.go line 405 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This is a semantic change, right? This was previously durable. You've convinced me (in another comment thread) that it's fine that this is getting picked up by replicaRaftStorage. And when we read pebble, we'll always see the lastIndex entry with corresponding lastTerm as well, so I hope any other uses of this field would also be ok. I sleuthed a bit and there's

raft log queue (should be ok since it lowers to committed index anyway)

quiescing (ok)

It might still be worth to rename these inflightLastIndex and inflightLastTerm (not in love with inflight, maybe you can think of something better) because I think there is the risk of someone using them like they were durable. I cede that this is likely an imaginary fear but until we encapsulate these variables away from the "general public" I'd like to nurse that fear a little bit.

That's all fair. It would be an issue if a new user of these fields misinterpreted their semantics and assumed that they are talking about durability. I've updated the names to lastIndexNotDurable and lastTermNotDurable.

pkg/kv/kvserver/replica_raft.go line 1617 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

if you catch another CI cycle anyway, abd

Done.

nvanbenschoten · 2023-02-03T15:15:52Z

I'm running this PR through the roachtest suite here to get a bit more confidence in it before we'll feel good about merging.

We saw 14 individual roachtest failures, none of which look related and all of which have corresponding github tracking issues on master (4 were fixed yesterday). CI and roachtests are all looking good for this change.

nvanbenschoten · 2023-02-03T15:55:17Z

TFTR!

bors r+

tbg · 2023-02-03T15:57:18Z

Do you have a / would you be able to run a flavor of this experiment that is I/O bound instead?

Do you have any particular cases or test setups in mind? I'm happy to run more experiments so we can better understand the impact of this change.

Essentially I'm hoping that by moving the log writes out of the critical path of the raft ready loop, we'll somehow do "better" under I/O overload - lower handle raft ready latencies, less build-up in the raft receive queues, and so on.

An apples-to-apples comparison could be helpful here - run a workload where we eventually see LSM inversion under both sync and async log appends, and simply investigate the differences.

It just so happens that @andrewbaptist is looking at such an experiment¹.

https://cockroachlabs.slack.com/archives/G01G8LK77DK/p1675371760534349 ↩

craig · 2023-02-03T16:45:07Z

Build failed:

Bazel Essential CI (Cockroach)

nvanbenschoten · 2023-02-03T17:24:04Z

The test failure looks real, but I'm also able to hit it on master and at least as far back as 03a5c40. We seem to be leaking tracing spans of Raft proposals during server shutdown in some cases. I'll continue to investigate on the side.

bors r+

craig · 2023-02-03T18:43:36Z

Build succeeded:

Bazel Essential CI (Cockroach)

nvanbenschoten force-pushed the nvanbenschoten/asyncRaftLogSimple branch 3 times, most recently from 69c0ef7 to 9faec10 Compare December 23, 2022 18:13

nvanbenschoten force-pushed the nvanbenschoten/asyncRaftLogSimple branch from 9faec10 to 7f77d1a Compare January 6, 2023 18:00

This was referenced Jan 6, 2023

kv: append log entries outside of raft state machine loop #94853

Open

kv: apply log entries outside of raft state machine loop #94854

Open

nvanbenschoten marked this pull request as ready for review January 7, 2023 02:55

nvanbenschoten requested review from a team as code owners January 7, 2023 02:55

nvanbenschoten requested a review from a team January 7, 2023 02:55

nvanbenschoten requested a review from a team as a code owner January 7, 2023 02:55

nvanbenschoten requested review from tbg and sumeerbhola January 7, 2023 02:55

This was referenced Jan 7, 2023

kv: make disk I/O asynchronous with respect to Raft state machine #17500

Closed

roachtest: kv/contention/nodes=4 failed #94409

Closed

[DNM] kv: prototype async Raft log writes #87050

Closed

tbg reviewed Jan 10, 2023

View reviewed changes

pkg/kv/kvserver/replica.go Outdated Show resolved Hide resolved

tbg reviewed Jan 10, 2023

View reviewed changes

pkg/kv/kvserver/replica.go Show resolved Hide resolved

tbg requested changes Jan 10, 2023

View reviewed changes

sumeerbhola reviewed Jan 11, 2023

View reviewed changes

nvanbenschoten mentioned this pull request Jan 25, 2023

kv: remove lastToReplica and lastFromReplica from raftMu #95861

Merged

nvanbenschoten commented Jan 25, 2023

View reviewed changes

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Jan 25, 2023

storage: add CommitNoSyncWait and SyncWait to Batch interface.

166fbbe

Extracted from cockroachdb#94165. Picks up github.com/cockroachdb/pebble/pull/2117. Release note: None Epic: None

nvanbenschoten mentioned this pull request Jan 25, 2023

storage: add CommitNoSyncWait and SyncWait to Batch interface #95862

Merged

nvanbenschoten commented Jan 31, 2023

View reviewed changes

nvanbenschoten requested a review from tbg January 31, 2023 05:40

nvanbenschoten force-pushed the nvanbenschoten/asyncRaftLogSimple branch from 6093246 to 1c2135f Compare January 31, 2023 16:30

tbg reviewed Feb 2, 2023

View reviewed changes

nvanbenschoten added 5 commits February 2, 2023 17:37

kv: shuffle local raft messages before delivery in tests

14ebedc

This commit adds a testing facility to shuffle local raft messages before delivering them to the raft state machine. These are not required to be in order, so we shuffle in tests ensure that re-ordering is handled properly.

kv: rename r.mu.last{Index,Term} to r.mu.last{Index,Term}NotDurable

ebc33fa

Simple rename to make the semantics of these fields more clear and harder to misinterpret.

nvanbenschoten force-pushed the nvanbenschoten/asyncRaftLogSimple branch from 1c2135f to ebc33fa Compare February 2, 2023 22:49

nvanbenschoten commented Feb 2, 2023

View reviewed changes

tbg self-requested a review February 3, 2023 15:51

tbg approved these changes Feb 3, 2023

View reviewed changes

craig bot merged commit 1708ea4 into cockroachdb:master Feb 3, 2023

nvanbenschoten deleted the nvanbenschoten/asyncRaftLogSimple branch February 3, 2023 21:22

cockroach-teamcity mentioned this pull request Feb 4, 2023

PR #94165 - kv: integrate raft async storage writes cockroachdb/docs#16188

Open

tbg mentioned this pull request Feb 7, 2023

roachtest: replicate/wide failed #96546

Closed

nvanbenschoten mentioned this pull request Feb 8, 2023

roachperf: regression on kv0/enc=false/nodes=3/cpu=32, aws but not gce #96800

Closed

pav-kv mentioned this pull request Mar 6, 2023

roachtest: sysbench/oltp_write_only/nodes=3/cpu=32/conc=256 failed [tocommit out of range] #97926

Closed

erikgrinaker mentioned this pull request Mar 20, 2023

kvserver: revisit COCKROACH_SCHEDULER_CONCURRENCY cap #99063

Closed

nvanbenschoten mentioned this pull request Jun 29, 2023

kv: make disk reads asynchronous with respect to Raft state machine #105850

Open

nvanbenschoten mentioned this pull request Jan 13, 2024

kvserver: allow committing entries not in leader's stable storage #88699

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: integrate raft async storage writes #94165

kv: integrate raft async storage writes #94165

nvanbenschoten commented Dec 22, 2022 •

edited

Loading

cockroach-teamcity commented Dec 22, 2022

nvanbenschoten commented Jan 7, 2023

tbg commented Jan 10, 2023

tbg left a comment

sumeerbhola left a comment

nvanbenschoten left a comment

nvanbenschoten left a comment

nvanbenschoten commented Jan 31, 2023

tbg left a comment

nvanbenschoten left a comment

nvanbenschoten commented Feb 3, 2023 •

edited

Loading

nvanbenschoten commented Feb 3, 2023

tbg commented Feb 3, 2023

craig bot commented Feb 3, 2023

nvanbenschoten commented Feb 3, 2023

craig bot commented Feb 3, 2023

kv: integrate raft async storage writes #94165

kv: integrate raft async storage writes #94165

Conversation

nvanbenschoten commented Dec 22, 2022 • edited Loading

Approach

Benchmark Results

Other benchmark results

cockroach-teamcity commented Dec 22, 2022

nvanbenschoten commented Jan 7, 2023

tbg commented Jan 10, 2023

tbg left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

nvanbenschoten commented Jan 31, 2023

tbg left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

nvanbenschoten commented Feb 3, 2023 • edited Loading

nvanbenschoten commented Feb 3, 2023

tbg commented Feb 3, 2023

Footnotes

craig bot commented Feb 3, 2023

nvanbenschoten commented Feb 3, 2023

craig bot commented Feb 3, 2023

nvanbenschoten commented Dec 22, 2022 •

edited

Loading

nvanbenschoten commented Feb 3, 2023 •

edited

Loading