Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: make disk reads asynchronous with respect to Raft state machine #105850

Open
nvanbenschoten opened this issue Jun 29, 2023 · 8 comments
Open
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-performance Perf of queries or internals. Solution not expected to change functional behavior. T-kv KV Team

Comments

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Jun 29, 2023

This issue is the "disk read" counterpart to #17500, which was addressed by etcd-io/raft#8 and #94165. To contextualize this issue, it may be helpful to get re-familiarized with those, optionally with this presentation.

The raft state machine loop (handleRaftReady) is responsible for writing raft entries to the durable raft log, applying committed log entries to the state machine, and sending messages to peers. This event loop is the heart of the raft protocol and each raft write traverses it multiple times between proposal time and ack time. It is therefore important to keep the latency of this loop down, so that a slow iteration does not block writes in the pipeline and create cross-write interference.

To that end, #94165 made raft log writes non-blocking in this loop, so that slow log writes (which much fsync) do not block other raft proposals.

Another case where the event loop may synchronously touch disk is when constructing the list of committed entries to apply. In the common case, this pulls from the raft entry cache, so it is fast. However, on raft entry cache misses, this reads from pebble. Reads from pebble can be slow (relative to a cache hit), which can slow down the event loop because they are performed inline. The effect of this can be seen directly on raft scheduling tail latency.

Example graphs

Entry cache hit rates

Screenshot 2023-06-29 at 2 09 16 AM
Accesses Hits Hit Rate
n1 314468 308334 98.1%
n2 276748 260645 94.2%
n3 271915 255306 93.9%
n4 325052 320766 98.7%
n5 326403 321934 98.6%

Raft scheduler latencies

Screenshot 2023-06-29 at 2 13 39 AM

High raft entry cache hit rate (n4)

Screenshot 2023-06-29 at 2 44 13 AM

Low raft entry cache hit rate (n3)

Screenshot 2023-06-29 at 2 04 40 AM

An alternate design would be to make these disk reads async on raft entry cache misses. Instead of blocking on the log iteration, raft.Storage.Entries could support returning a new ErrEntriesTemporarilyUnavailable error which instructs etcd/raft to retry the read later. This would allow the event loop to continue processing. When the read completes, the event loop would be notified and the read would be retried from the cache (or some other data structure that has no risk of eviction before the read is retries).

This would drive down tail latency for raft writes in cases where the raft entry cache has a less than perfect hit rate.

Jira issue: CRDB-29234

@nvanbenschoten nvanbenschoten added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-kv-replication Relating to Raft, consensus, and coordination. T-kv-replication labels Jun 29, 2023
@blathers-crl
Copy link

blathers-crl bot commented Jun 29, 2023

cc @cockroachdb/replication

@erikgrinaker
Copy link
Contributor

Wouldn't it be better to do application on a separate thread, as outlined in #94854? That model makes much more sense to me. Or is this intended as a stopgap in the meanwhile?

@tbg
Copy link
Member

tbg commented Jun 30, 2023

Log application does disk writes (though unsynced, so likely buffered in memory) but to construct the list of entries, we need to do synchronous (read) IO. So these are separate areas we can optimize. If the writes can be buffered in memory (i.e. overall write throughput is below what the device/lsm can sustain, and so apply loop rarely ever gets blocked in the first place), the reads are more important (esp. if each batch is small, so the fixed overhead dominates). So this optimization makes sense to me.

It seems related to ideas of @pavelkalinnikov's about turning the raft storage model inside out by never reading from disk inside of raft (and, in the limit, never letting it even hold the contents of the log entries).

@erikgrinaker
Copy link
Contributor

Log application does disk writes (though unsynced, so likely buffered in memory) but to construct the list of entries, we need to do synchronous (read) IO. So these are separate areas we can optimize.

If we had a separate apply thread, wouldn't that do both the log reads and state writes?

@pav-kv
Copy link
Collaborator

pav-kv commented Jun 30, 2023

The "apply thread" can do either read+write, or only write. The extent of what it can/should do depends on raft API. At the moment the "apply thread" could only do writes because raft does the reads for us.

It seems related to ideas of @pavelkalinnikov's about turning the raft storage model inside out by never reading from disk inside of raft (and, in the limit, never letting it even hold the contents of the log entries).

Yeah, this issue falls into etcd-io/raft#64.

Instead of blocking on the log iteration, raft.Storage.Entries could support returning a new ErrEntriesTemporarilyUnavailable error which instructs etcd/raft to retry the read later. This would allow the event loop to continue processing. When the read completes, the event loop would be notified and the read would be retried from the cache (or some other data structure that has no risk of eviction before the read is retries).

This would do. I'm slightly in favour of the other approach though: moving responsibility out of raft instead of making its API more nuanced.

@tbg
Copy link
Member

tbg commented Jun 30, 2023

edit: responding to Erik's comment, not Pavel's. Our wires crossed despite sitting next to each other. :)

No, the way it would work is that raft would still read the entries from disk first, then delegate those to the apply thread. In other words, the entry reads wouldn't move off the main goroutine.

@erikgrinaker
Copy link
Contributor

I see. That's no good.

@exalate-issue-sync exalate-issue-sync bot added T-kv KV Team and removed T-kv-replication labels Jun 28, 2024
@pav-kv
Copy link
Collaborator

pav-kv commented Oct 22, 2024

With the introduction of LogSnapshot API in #130967, we should be able to decouple log reads from the main raft loop.

The behaviour would be along the lines of:

  • When there are entries to apply, Ready indicates that, but does not prefetch the entries for us.
  • We grab a log snapshot (which should be cheap), and pass it to an asynchronous "apply" worker.
  • The RawNode continues making progress.
  • The entries are read from the log snapshot and applied asynchronously.

There are technicalities on how LogSnapshot is implemented. Today, it holds raftMu, so it's not completely asynchronous, and e.g. blocks the log writes. However, LogSnapshot can be relaxed to allow log appends, and block the main loop only if the leader term changes and the log is being truncated below the point at which we took the log snapshot. Since the apply loop is only concerned with the committed entries, we should take a log snapshot at this index, and will have a guarantee that it won't ever be truncated below this point; thus, the log snapshot will never block the main loop.

Using this approach has additional benefits: the apply loop can pace log reads and prevent OOMs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-performance Perf of queries or internals. Solution not expected to change functional behavior. T-kv KV Team
Projects
No open projects
Status: Incoming
Development

No branches or pull requests

4 participants