Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
38954: storage: ack Raft proposals after Raft log commit, not state machine apply r=ajwerner a=nvanbenschoten Informs #17500. This is a partial revival of #18710 and a culmination of more recent thinking in #17500 (comment). The change adjusts the Raft processing loop so that it acknowledges the success of raft entries as soon as it learns that they have been durably committed to the raft log instead of after they have been applied to the proposer replica's replicated state machine. This not only pulls the application latency out of the hot path for Raft proposals, but it also pulls the next raft ready iteration's write to its Raft log (with the associated fsync) out of the hot path for Raft proposals. This is safe because a proposal through raft is known to have succeeded as soon as it is replicated to a quorum of replicas (i.e. has committed in the raft log). The proposal does not need to wait for its effects to be applied in order to know whether its changes will succeed or fail. The raft log is the provider of atomicity and durability for replicated writes, not (ignoring log truncation) the replicated state machine itself, so a client can be confident in the result of a write as soon as the raft log confirms that it has succeeded. However, there are a few complications in acknowledging the success of a proposal at this stage: 1. Committing an entry in the raft log and having the command in that entry succeed are similar but not equivalent concepts. Even if the entry succeeds in achieving durability by replicating to a quorum of replicas, its command may still be rejected "beneath raft". This means that a (deterministic) check after replication decides that the command will not be applied to the replicated state machine. In that case, the client waiting on the result of the command should not be informed of its success. Luckily, this check is cheap to perform so we can do it here and when applying the command. See `Replica.shouldApplyCommand`. 2. Some commands perform non-trivial work such as updating Replica configuration state or performing Range splits. In those cases, it's likely that the client is interested in not only knowing whether it has succeeded in sequencing the change in the raft log, but also in knowing when the change has gone into effect. There's currently no exposed hook to ask for an acknowledgment only after a command has been applied, so for simplicity, the current implementation only ever acks transactional writes before they have gone into effect. All other commands wait until they have been applied to ack their client. 3. Even though we can determine whether a command has succeeded without applying it, the effect of the command will not be visible to conflicting commands until it is applied. Because of this, the client can be informed of the success of a write at this point, but we cannot release that write's latches until the write has applied. See `ProposalData.signalProposalResult/finishApplication`. ### Benchmarks The change appears to result in an **8-10%** improvement to throughput and a **6-10%** reduction in p50 latency across the board on kv0. I ran a series of tests with different node sizes and difference workload concurrencies and the win seemed pretty stable. This was also true regardless of whether the writes were to a single Raft group or a large number of Raft groups. ``` name old ops/sec new ops/sec delta kv0/cores=16/nodes=3/conc=32 24.1k ± 0% 26.1k ± 1% +8.35% (p=0.008 n=5+5) kv0/cores=16/nodes=3/conc=48 30.4k ± 1% 32.8k ± 1% +8.02% (p=0.008 n=5+5) kv0/cores=16/nodes=3/conc=64 34.6k ± 1% 37.6k ± 0% +8.79% (p=0.008 n=5+5) kv0/cores=36/nodes=3/conc=72 46.6k ± 1% 50.8k ± 0% +8.99% (p=0.008 n=5+5) kv0/cores=36/nodes=3/conc=108 58.8k ± 1% 64.0k ± 1% +8.99% (p=0.008 n=5+5) kv0/cores=36/nodes=3/conc=144 68.1k ± 1% 74.5k ± 1% +9.45% (p=0.008 n=5+5) kv0/cores=72/nodes=3/conc=144 55.8k ± 1% 59.7k ± 2% +7.12% (p=0.008 n=5+5) kv0/cores=72/nodes=3/conc=216 64.4k ± 4% 68.1k ± 4% +5.65% (p=0.016 n=5+5) kv0/cores=72/nodes=3/conc=288 68.8k ± 2% 74.5k ± 3% +8.39% (p=0.008 n=5+5) name old p50(ms) new p50(ms) delta kv0/cores=16/nodes=3/conc=32 1.30 ± 0% 1.20 ± 0% -7.69% (p=0.008 n=5+5) kv0/cores=16/nodes=3/conc=48 1.50 ± 0% 1.40 ± 0% -6.67% (p=0.008 n=5+5) kv0/cores=16/nodes=3/conc=64 1.70 ± 0% 1.60 ± 0% -5.88% (p=0.008 n=5+5) kv0/cores=36/nodes=3/conc=72 1.40 ± 0% 1.30 ± 0% -7.14% (p=0.008 n=5+5) kv0/cores=36/nodes=3/conc=108 1.60 ± 0% 1.50 ± 0% -6.25% (p=0.008 n=5+5) kv0/cores=36/nodes=3/conc=144 1.84 ± 3% 1.70 ± 0% -7.61% (p=0.000 n=5+4) kv0/cores=72/nodes=3/conc=144 2.00 ± 0% 1.80 ± 0% -10.00% (p=0.008 n=5+5) kv0/cores=72/nodes=3/conc=216 2.46 ± 2% 2.20 ± 0% -10.57% (p=0.008 n=5+5) kv0/cores=72/nodes=3/conc=288 2.80 ± 0% 2.60 ± 0% -7.14% (p=0.079 n=4+5) name old p99(ms) new p99(ms) delta kv0/cores=16/nodes=3/conc=32 3.50 ± 0% 3.50 ± 0% ~ (all equal) kv0/cores=16/nodes=3/conc=48 4.70 ± 0% 4.58 ± 3% ~ (p=0.167 n=5+5) kv0/cores=16/nodes=3/conc=64 5.50 ± 0% 5.20 ± 0% -5.45% (p=0.008 n=5+5) kv0/cores=36/nodes=3/conc=72 5.00 ± 0% 4.70 ± 0% -6.00% (p=0.008 n=5+5) kv0/cores=36/nodes=3/conc=108 5.80 ± 0% 5.50 ± 0% -5.17% (p=0.008 n=5+5) kv0/cores=36/nodes=3/conc=144 6.48 ± 3% 6.18 ± 3% -4.63% (p=0.079 n=5+5) kv0/cores=72/nodes=3/conc=144 11.0 ± 0% 10.5 ± 0% -4.55% (p=0.008 n=5+5) kv0/cores=72/nodes=3/conc=216 13.4 ± 2% 13.2 ± 5% ~ (p=0.683 n=5+5) kv0/cores=72/nodes=3/conc=288 18.2 ± 4% 17.2 ± 3% -5.70% (p=0.079 n=5+5) ``` Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
- Loading branch information