-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: make disk reads asynchronous with respect to Raft state machine #105850
Comments
cc @cockroachdb/replication |
Wouldn't it be better to do application on a separate thread, as outlined in #94854? That model makes much more sense to me. Or is this intended as a stopgap in the meanwhile? |
Log application does disk writes (though unsynced, so likely buffered in memory) but to construct the list of entries, we need to do synchronous (read) IO. So these are separate areas we can optimize. If the writes can be buffered in memory (i.e. overall write throughput is below what the device/lsm can sustain, and so apply loop rarely ever gets blocked in the first place), the reads are more important (esp. if each batch is small, so the fixed overhead dominates). So this optimization makes sense to me. It seems related to ideas of @pavelkalinnikov's about turning the raft storage model inside out by never reading from disk inside of raft (and, in the limit, never letting it even hold the contents of the log entries). |
If we had a separate apply thread, wouldn't that do both the log reads and state writes? |
The "apply thread" can do either read+write, or only write. The extent of what it can/should do depends on raft API. At the moment the "apply thread" could only do writes because raft does the reads for us.
Yeah, this issue falls into etcd-io/raft#64.
This would do. I'm slightly in favour of the other approach though: moving responsibility out of raft instead of making its API more nuanced. |
edit: responding to Erik's comment, not Pavel's. Our wires crossed despite sitting next to each other. :) No, the way it would work is that raft would still read the entries from disk first, then delegate those to the apply thread. In other words, the entry reads wouldn't move off the main goroutine. |
I see. That's no good. |
With the introduction of The behaviour would be along the lines of:
There are technicalities on how Using this approach has additional benefits: the apply loop can pace log reads and prevent OOMs. |
This issue is the "disk read" counterpart to #17500, which was addressed by etcd-io/raft#8 and #94165. To contextualize this issue, it may be helpful to get re-familiarized with those, optionally with this presentation.
The raft state machine loop (
handleRaftReady
) is responsible for writing raft entries to the durable raft log, applying committed log entries to the state machine, and sending messages to peers. This event loop is the heart of the raft protocol and each raft write traverses it multiple times between proposal time and ack time. It is therefore important to keep the latency of this loop down, so that a slow iteration does not block writes in the pipeline and create cross-write interference.To that end, #94165 made raft log writes non-blocking in this loop, so that slow log writes (which much fsync) do not block other raft proposals.
Another case where the event loop may synchronously touch disk is when constructing the list of committed entries to apply. In the common case, this pulls from the raft entry cache, so it is fast. However, on raft entry cache misses, this reads from pebble. Reads from pebble can be slow (relative to a cache hit), which can slow down the event loop because they are performed inline. The effect of this can be seen directly on raft scheduling tail latency.
Example graphs
Entry cache hit rates
Raft scheduler latencies
High raft entry cache hit rate (n4)
Low raft entry cache hit rate (n3)
An alternate design would be to make these disk reads async on raft entry cache misses. Instead of blocking on the log iteration,
raft.Storage.Entries
could support returning a newErrEntriesTemporarilyUnavailable
error which instructs etcd/raft to retry the read later. This would allow the event loop to continue processing. When the read completes, the event loop would be notified and the read would be retried from the cache (or some other data structure that has no risk of eviction before the read is retries).This would drive down tail latency for raft writes in cases where the raft entry cache has a less than perfect hit rate.
Jira issue: CRDB-29234
The text was updated successfully, but these errors were encountered: