kv: make disk reads asynchronous with respect to Raft state machine #105850

nvanbenschoten · 2023-06-29T21:04:01Z

This issue is the "disk read" counterpart to #17500, which was addressed by etcd-io/raft#8 and #94165. To contextualize this issue, it may be helpful to get re-familiarized with those, optionally with this presentation.

The raft state machine loop (handleRaftReady) is responsible for writing raft entries to the durable raft log, applying committed log entries to the state machine, and sending messages to peers. This event loop is the heart of the raft protocol and each raft write traverses it multiple times between proposal time and ack time. It is therefore important to keep the latency of this loop down, so that a slow iteration does not block writes in the pipeline and create cross-write interference.

To that end, #94165 made raft log writes non-blocking in this loop, so that slow log writes (which much fsync) do not block other raft proposals.

Another case where the event loop may synchronously touch disk is when constructing the list of committed entries to apply. In the common case, this pulls from the raft entry cache, so it is fast. However, on raft entry cache misses, this reads from pebble. Reads from pebble can be slow (relative to a cache hit), which can slow down the event loop because they are performed inline. The effect of this can be seen directly on raft scheduling tail latency.

Example graphs

Entry cache hit rates

	Accesses	Hits	Hit Rate
n1	314468	308334	98.1%
n2	276748	260645	94.2%
n3	271915	255306	93.9%
n4	325052	320766	98.7%
n5	326403	321934	98.6%

Raft scheduler latencies

High raft entry cache hit rate (n4)

Low raft entry cache hit rate (n3)

An alternate design would be to make these disk reads async on raft entry cache misses. Instead of blocking on the log iteration, raft.Storage.Entries could support returning a new ErrEntriesTemporarilyUnavailable error which instructs etcd/raft to retry the read later. This would allow the event loop to continue processing. When the read completes, the event loop would be notified and the read would be retried from the cache (or some other data structure that has no risk of eviction before the read is retries).

This would drive down tail latency for raft writes in cases where the raft entry cache has a less than perfect hit rate.

Jira issue: CRDB-29234

The text was updated successfully, but these errors were encountered:

blathers-crl · 2023-06-29T21:04:04Z

cc @cockroachdb/replication

erikgrinaker · 2023-06-29T21:23:47Z

Wouldn't it be better to do application on a separate thread, as outlined in #94854? That model makes much more sense to me. Or is this intended as a stopgap in the meanwhile?

tbg · 2023-06-30T12:44:16Z

Log application does disk writes (though unsynced, so likely buffered in memory) but to construct the list of entries, we need to do synchronous (read) IO. So these are separate areas we can optimize. If the writes can be buffered in memory (i.e. overall write throughput is below what the device/lsm can sustain, and so apply loop rarely ever gets blocked in the first place), the reads are more important (esp. if each batch is small, so the fixed overhead dominates). So this optimization makes sense to me.

It seems related to ideas of @pavelkalinnikov's about turning the raft storage model inside out by never reading from disk inside of raft (and, in the limit, never letting it even hold the contents of the log entries).

erikgrinaker · 2023-06-30T12:48:42Z

Log application does disk writes (though unsynced, so likely buffered in memory) but to construct the list of entries, we need to do synchronous (read) IO. So these are separate areas we can optimize.

If we had a separate apply thread, wouldn't that do both the log reads and state writes?

pav-kv · 2023-06-30T13:07:41Z

The "apply thread" can do either read+write, or only write. The extent of what it can/should do depends on raft API. At the moment the "apply thread" could only do writes because raft does the reads for us.

It seems related to ideas of @pavelkalinnikov's about turning the raft storage model inside out by never reading from disk inside of raft (and, in the limit, never letting it even hold the contents of the log entries).

Yeah, this issue falls into etcd-io/raft#64.

Instead of blocking on the log iteration, raft.Storage.Entries could support returning a new ErrEntriesTemporarilyUnavailable error which instructs etcd/raft to retry the read later. This would allow the event loop to continue processing. When the read completes, the event loop would be notified and the read would be retried from the cache (or some other data structure that has no risk of eviction before the read is retries).

This would do. I'm slightly in favour of the other approach though: moving responsibility out of raft instead of making its API more nuanced.

tbg · 2023-06-30T13:07:43Z

edit: responding to Erik's comment, not Pavel's. Our wires crossed despite sitting next to each other. :)

No, the way it would work is that raft would still read the entries from disk first, then delegate those to the apply thread. In other words, the entry reads wouldn't move off the main goroutine.

erikgrinaker · 2023-06-30T13:10:39Z

I see. That's no good.

pav-kv · 2024-10-22T12:26:20Z

With the introduction of LogSnapshot API in #130967, we should be able to decouple log reads from the main raft loop.

The behaviour would be along the lines of:

When there are entries to apply, Ready indicates that, but does not prefetch the entries for us.
We grab a log snapshot (which should be cheap), and pass it to an asynchronous "apply" worker.
The RawNode continues making progress.
The entries are read from the log snapshot and applied asynchronously.

There are technicalities on how LogSnapshot is implemented. Today, it holds raftMu, so it's not completely asynchronous, and e.g. blocks the log writes. However, LogSnapshot can be relaxed to allow log appends, and block the main loop only if the leader term changes and the log is being truncated below the point at which we took the log snapshot. Since the apply loop is only concerned with the committed entries, we should take a log snapshot at this index, and will have a guarantee that it won't ever be truncated below this point; thus, the log snapshot will never block the main loop.

Using this approach has additional benefits: the apply loop can pace log reads and prevent OOMs.

nvanbenschoten added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-kv-replication Relating to Raft, consensus, and coordination. T-kv-replication labels Jun 29, 2023

exalate-issue-sync bot added T-kv KV Team and removed T-kv-replication labels Jun 28, 2024

pav-kv mentioned this issue Oct 22, 2024

raft: remove apply pacing #133175

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: make disk reads asynchronous with respect to Raft state machine #105850

kv: make disk reads asynchronous with respect to Raft state machine #105850

nvanbenschoten commented Jun 29, 2023 •

edited by cockroach-jira-scripts

Loading

Entry cache hit rates

Raft scheduler latencies

High raft entry cache hit rate (n4)

Low raft entry cache hit rate (n3)

blathers-crl bot commented Jun 29, 2023

erikgrinaker commented Jun 29, 2023

tbg commented Jun 30, 2023 •

edited

Loading

erikgrinaker commented Jun 30, 2023

pav-kv commented Jun 30, 2023

tbg commented Jun 30, 2023 •

edited

Loading

erikgrinaker commented Jun 30, 2023

pav-kv commented Oct 22, 2024 •

edited

Loading

kv: make disk reads asynchronous with respect to Raft state machine #105850

kv: make disk reads asynchronous with respect to Raft state machine #105850

Comments

nvanbenschoten commented Jun 29, 2023 • edited by cockroach-jira-scripts Loading

Entry cache hit rates

Raft scheduler latencies

High raft entry cache hit rate (n4)

Low raft entry cache hit rate (n3)

blathers-crl bot commented Jun 29, 2023

erikgrinaker commented Jun 29, 2023

tbg commented Jun 30, 2023 • edited Loading

erikgrinaker commented Jun 30, 2023

pav-kv commented Jun 30, 2023

tbg commented Jun 30, 2023 • edited Loading

erikgrinaker commented Jun 30, 2023

pav-kv commented Oct 22, 2024 • edited Loading

nvanbenschoten commented Jun 29, 2023 •

edited by cockroach-jira-scripts

Loading

tbg commented Jun 30, 2023 •

edited

Loading

tbg commented Jun 30, 2023 •

edited

Loading

pav-kv commented Oct 22, 2024 •

edited

Loading