-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: ack Raft proposals after Raft log commit, not state machine apply #38954
storage: ack Raft proposals after Raft log commit, not state machine apply #38954
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exciting results.
The separation between the two apply stages makes the code complicated. I wonder if this can simplify somewhat by extracting a component out that combines the entryGen and the batch and accesses the replica only through the necessary abstraction, at which point you can actually unit test it (for example, generating random batches and asserting that the right decisions are made by mocking out the replica), and the code would simplify (:crossed_fingers:) to something like
// pre-stage
apper := newApplier(r, entryGen, cmdAppBatch)
apper.decodeAndMaybeAckEarly()
// regular stage (receives the apper from above)
apper.processAll()
the first argument to `newApplier` would be some
```go
type appI interface {
applyEntry(...)
canApplyEntry(...)
// probably many more
}
Reviewed 4 of 4 files at r1, 2 of 2 files at r2, 6 of 11 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @nvanbenschoten, and @tbg)
pkg/storage/client_replica_test.go, line 1819 at r3 (raw file):
verifyKeysWithPrefix := func(prefix roachpb.Key, expectedKeys []roachpb.Key) { t.Helper() // Raft entry application is asynchronous, so we may not see the updates
How did you go about finding these now-flaky tests? Stress ./pkg/storage
for a while?
pkg/storage/client_split_test.go, line 686 at r3 (raw file):
storeCfg.TestingKnobs.DisableSplitQueue = true storeCfg.TestingKnobs.DisableMergeQueue = true storeCfg.TestingKnobs.DisableRaftAckBeforeApplication = true
Mind adding a comment to each line like this explaining why we need it in the test?
pkg/storage/cmd_app_batch.go, line 26 at r3 (raw file):
// iterating over committed entries to decode and apply. // // The entry and next methods should only be called if valid returns true.
s/should/must/
pkg/storage/replica_application.go, line 44 at r3 (raw file):
// ackCommittedEntriesBeforeAppRaftMuLocked attempts to acknowledge the success // of raft entries that have been durably committed to the raft log but have not // yet been applied to the proposer replica's replicated state machine.
Add that whichever entries are handled here will be "unbound", that is, the delayed handling will not see them as needing signaling.
Also this method initializes the local proposals for all entries, which is confusing (you could think that this method is a pure optimization and commenting it out would leave things working, but that doesn't seem to be the case). Perhaps that can be rearranged into its own step or at least called out. Would it make sense to call retrieveLocalProposals
pkg/storage/replica_application.go, line 85 at r3 (raw file):
ctx context.Context, gen *entryGen, b *cmdAppBatch, ) (errExpl string, err error) { errExpl, err = b.decode(ctx, gen)
Add more comments here -- now you will either have nothing, or a chunk of trivial commands, or a nontrivial command. The common case is obviously that we get a chunk of trivial commands and that this is already all commands, but if we get unlucky there's a nontrivial command at the front and then a large chunk of trivial commands, in which case we won't be using the optimization. I think this is all fine, but please add counters for the number of commands that were in principle optimizable but were missed because they did not occur in the first batch. (Ideally this sits on the stats struct that @ajwerner also used in his PR).
pkg/storage/replica_application.go, line 114 at r3 (raw file):
// Signal the proposal's response channel with the result. // Make a copy of the response to avoid data races.
Who are we racing with? retrieveLocalProposals
has removed this from the proposal map already.
pkg/storage/replica_application.go, line 138 at r3 (raw file):
// writes. This could be exposed as an option on the batch header instead. func canAckBeforeApplication(ba *roachpb.BatchRequest) bool { return ba.IsTransactionWrite()
Shouldn't it also be trivial? You're not handling any state stuff in your early acks. The two probably coincide but it should be explicit.
pkg/storage/replica_application.go, line 217 at r3 (raw file):
Take it or leave it:
// The caller early-acked some proposals that are being committed right now via ackCommittedEntriesBeforeAppRaftMuLocked, which already decoded the first batch and retrieved the local proposals.
pkg/storage/replica_raft.go, line 566 at r3 (raw file):
// TODO(nvanbenschoten): I'm pretty sure that etcd/raft returns proposals // in rd.Entries and rd.CommittedEntries at the same time for single peer // Raft groups. In that case, we'll need to disable this optimization, as
This will also be the case whenever a peer is catching up. We should grab the first Index
for entries being appended right now and disable the optimization for anything in CommittedEntries
with indexes <= Index
.
pkg/storage/replica_test.go, line 8840 at r3 (raw file):
} // TestAckWriteBeforeApplication tests that the success of transactional writes
Would like to see the negative tested for some requests that aren't supposed to be early acked., i.e. a nontrivial one. I wish we could exercise this whole pipeline in more isolation, but it looks like we're slowly moving there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 11 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)
@ajwerner I'm curious to hear your perspective on the |
I think the basic idea makes a lot of sense. I had envisioned application primarily as a client concern of a replication package. In the code I had written committed entries get handed back to the client for application. I hadn't considered batching and the desire to decode and examine a batch before the rest of the ready gets processed but given that we want to do that I'm supportive of encapsulating application logic above replication logic. |
Where this could also come in handy is the datadriven raft testing I've been working on (not quite ready for prime time yet, but @nvanbenschoten has seen it). I'm introducing this primary for testing the nitty-gritty of joint config change interactions including simulated crashes, but I was also thinking that it can be used by apps using Raft (i.e. etcd) to do similar testing of the storage and apply loop in a more end-to-end fashion. Doing so, it would be good to be able to mock out the actual command application. |
…ation Fixes cockroachdb#39018. Fixes cockroachdb#37810. May fix other tests. This commit fixes a bug introduced in e4ce717 which allowed a single Raft proposal to be applied multiple times at multiple applied indexes. The bug was possible if a raft proposal was reproposed twice, once with the same max lease index and once with a new max lease index. Because there are two entries for the same proposal, one necessarily has to have an invalid max lease applied index (see the invariant in https://github.com/cockroachdb/cockroach/blob/5cbc4b50712546465de75dba69a1c0cdd1fe2f87/pkg/storage/replica_raft.go#L1430) If these two entries happened to land in the same application batch on the leaseholder then the first entry would be rejected and the second would apply. The replicas LeaseAppliedIndex would then be bumped all the way to the max lease index of the entry that applied. The bug occurred when the first entry (who must have hit a proposalIllegalLeaseIndex), called into tryReproposeWithNewLeaseIndex. The ProposalData's MaxLeaseIndex would be equal to the Replica's LeaseAppliedIndex because it would match the index in the successful entry. We would then repropose the proposal with a larger lease applied index. This new entry could then apply and result in duplicate entry application. Luckily, rangefeed's intent reference counting was sensitive enough that it caught this duplicate entry application and panicked loudly. Other tests might also be failing because of it but might not have as obvious of symptoms when they hit the bug. In addition to this primary bug fix, this commit has a secondary effect of fixing an issue where two entries for the same command could be in the same batch and only one would be linked to its ProposalData struct and be considered locally proposed (see the change in retrieveLocalProposals). I believe that this would prevent the command from being properly acknowledged if the first entry was rejected due to its max lease index and the second entry had a larger max lease index and applied. I think the first entry would have eventually hit the check in tryReproposeWithNewLeaseIndex and observed that the linked ProposalData had a new MaxLeaseIndex so it would have added it back to the proposal map, but then it would have had to wait for refreshProposalsLocked to refresh the proposal, at which point this refresh would have hit a lease index error and would be reproposed at a higher index. Not only could this cause duplicate versions of the same command to apply (described above), but I think this could even loop forever without acknowledging the client. It seems like there should be a way for this to cause cockroachdb#39022, but it doesn't exactly line up. Again, these kinds of cases will be easier to test once we properly mock out these interfaces in cockroachdb#38954. I'm working on that, but I don't think it should hold up the next alpha (cockroachdb#39036). However, this commit does address a TODO to properly handle errors during tryReproposeWithNewLeaseIndex reproposals and that is properly tested. Release note: None
…ation Fixes cockroachdb#39018. Fixes cockroachdb#37810. May fix other tests. This commit fixes a bug introduced in e4ce717 which allowed a single Raft proposal to be applied multiple times at multiple applied indexes. The bug was possible if a raft proposal was reproposed twice, once with the same max lease index and once with a new max lease index. Because there are two entries for the same proposal, one necessarily has to have an invalid max lease applied index (see the invariant in https://github.com/cockroachdb/cockroach/blob/5cbc4b50712546465de75dba69a1c0cdd1fe2f87/pkg/storage/replica_raft.go#L1430) If these two entries happened to land in the same application batch on the leaseholder then the first entry would be rejected and the second would apply. The replicas LeaseAppliedIndex would then be bumped all the way to the max lease index of the entry that applied. The bug occurred when the first entry (who must have hit a proposalIllegalLeaseIndex), called into tryReproposeWithNewLeaseIndex. The ProposalData's MaxLeaseIndex would be equal to the Replica's LeaseAppliedIndex because it would match the index in the successful entry. We would then repropose the proposal with a larger lease applied index. This new entry could then apply and result in duplicate entry application. Luckily, rangefeed's intent reference counting was sensitive enough that it caught this duplicate entry application and panicked loudly. Other tests might also be failing because of it but might not have as obvious of symptoms when they hit the bug. In addition to this primary bug fix, this commit has a secondary effect of fixing an issue where two entries for the same command could be in the same batch and only one would be linked to its ProposalData struct and be considered locally proposed (see the change in retrieveLocalProposals). I believe that this would prevent the command from being properly acknowledged if the first entry was rejected due to its max lease index and the second entry had a larger max lease index and applied. I think the first entry would have eventually hit the check in tryReproposeWithNewLeaseIndex and observed that the linked ProposalData had a new MaxLeaseIndex so it would have added it back to the proposal map, but then it would have had to wait for refreshProposalsLocked to refresh the proposal, at which point this refresh would have hit a lease index error and would be reproposed at a higher index. Not only could this cause duplicate versions of the same command to apply (described above), but I think this could even loop forever without acknowledging the client. It seems like there should be a way for this to cause cockroachdb#39022, but it doesn't exactly line up. Again, these kinds of cases will be easier to test once we properly mock out these interfaces in cockroachdb#38954. I'm working on that, but I don't think it should hold up the next alpha (cockroachdb#39036). However, this commit does address a TODO to properly handle errors during tryReproposeWithNewLeaseIndex reproposals and that is properly tested. My debugging process to track this down was to repeatedly run a set of 10 `cdc/ledger/rangefeed=true` roachtests after reducing its duration down to 5m. Usually, at least one of these tests would hit the `negative refcount` assertion. I then incrementally added more and more logging around entry application until I painted a full picture of which logical ops were being consumed by the rangefeed processor and why the same raft command was being applied twice (once it became clear that one was). After a few more rounds of fine-tuning the logging, the interaction with reproposals with a new max lease index became clear. Release note: None
39064: storage: prevent command reproposal with new lease index after application r=nvanbenschoten a=nvanbenschoten Fixes #39018. Fixes #37810. May fix other tests. This commit fixes a bug introduced in e4ce717 which allowed a single Raft proposal to be applied multiple times at multiple applied indexes. The bug was possible if a raft proposal was reproposed twice, once with the same max lease index and once with a new max lease index. Because there are two entries for the same proposal, one necessarily has to have an invalid max lease applied index (see the invariant in https://github.com/cockroachdb/cockroach/blob/5cbc4b50712546465de75dba69a1c0cdd1fe2f87/pkg/storage/replica_raft.go#L1430) If these two entries happened to land in the same application batch on the leaseholder then the first entry would be rejected and the second would apply. The replicas LeaseAppliedIndex would then be bumped all the way to the max lease index of the entry that applied. The bug occurred when the first entry (who must have hit a proposalIllegalLeaseIndex), called into tryReproposeWithNewLeaseIndex. The ProposalData's MaxLeaseIndex would be equal to the Replica's LeaseAppliedIndex because it would match the index in the successful entry. We would then repropose the proposal with a larger lease applied index. This new entry could then apply and result in duplicate entry application. Luckily, rangefeed's intent reference counting was sensitive enough that it caught this duplicate entry application and panicked loudly. Other tests might also be failing because of it but might not have as obvious symptoms when they hit the bug. In addition to this primary bug fix, this commit has a secondary effect of fixing an issue where two entries for the same command could be in the same batch and only one would be linked to its ProposalData struct and be considered locally proposed (see the change in retrieveLocalProposals). I believe that this would prevent the command from being properly acknowledged if the first entry was rejected due to its max lease index and the second entry had a larger max lease index and applied. I think the first entry would have eventually hit the check in tryReproposeWithNewLeaseIndex and observed that the linked ProposalData had a new MaxLeaseIndex so it would have added it back to the proposal map, but then it would have had to wait for refreshProposalsLocked to refresh the proposal, at which point this refresh would have hit a lease index error and would be reproposed at a higher index. Not only could this cause duplicate versions of the same command to apply (described above), but I think this could even loop forever without acknowledging the client. It seems like there should be a way for this to cause #39022, but it doesn't exactly line up. Again, these kinds of cases will be easier to test once we properly mock out these interfaces in #38954. I'm working on that, but I don't think it should hold up the next alpha (#39036). However, this commit does address a TODO to properly handle errors during tryReproposeWithNewLeaseIndex reproposals and that is properly tested. My debugging process to track this down was to repeatedly run a set of 10 `cdc/ledger/rangefeed=true` roachtests after reducing its duration down to 5m. Usually, at least one of these tests would hit the `negative refcount` assertion. I then incrementally added more and more logging around entry application until I painted a full picture of which logical ops were being consumed by the rangefeed processor and why the same raft command was being applied twice (once it became clear that one was). After a few more rounds of fine-tuning the logging, the interaction with reproposals with a new max lease index became clear. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
39149: roachtest: skip kv/contention/nodes=4 for release-19.1 r=tbg a=nvanbenschoten Fixes #39116. release-19.1 is susceptible to the issues described in #36089, so it won't reliably pass this test. 39160: storage: add DisableRaftLogQueue to StoreTestingKnobs r=tbg a=nvanbenschoten Pulled from #38954, which I want to keep focused, especially with the PR's new secondary focus on refactoring entry application to be easier to mock and test. Release note: None 39161: storage: address TODO in TestPushTxnHeartbeatTimeout r=tbg a=nvanbenschoten Pulled from #38954, which I want to keep focused, especially with the PR's new secondary focus on refactoring entry application to be easier to mock and test. Release note: None Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
39158: storage: clean up below-Raft Split and Merge locking r=bdarnell a=nvanbenschoten This PR contains three related cleanups that I stumbled into while cleaning up raft entry application in preparation for #38954. The first cleanup is that we avoid improperly handling errors that may be returned when acquiring the split/merge locks. If this operation fails then we can't simply return an error and not apply the command. We need to be deterministic at this point. The only option we have is to fatal. See #19448 (comment). The second cleanup is that we then remove stale code that was attempting to recover from failed split and merge application. This error handling only made sense in a pre-proposer evaluated KV world. The third commit addresses a TODO to assert that the RHS range during a split is not initialized. The TODO looks like it went in before proposer evaluated KV, which protects us against both replays and reproposals. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
The new package provides abstractions and routines associated with the application of committed raft entries to a replicated state machine. This was inspired by four driving forces: - We've been having a number of discussions on the Core team about making storage abstractions more clear and easier to understand in isolation. One commonly discussed proposal is introducing a `storage/replicate` package that would encapsulate the concerns of raft replication (e.g. log manipulation, snapshots, leader election, heartbeats, etc.). This `storage/apply` package will fit in nicely alongside a replication abstraction. - Initial discussion on cockroachdb#38954 concluded that adding an optimization to acknowledge clients after their raft entries have committed but before they had been applied with the current code structure was moving in the opposite direction and making things even harder to understand due to the introduction of more complex state management. - Recent instability in this area (cockroachdb#38976, cockroachdb#39064, cockroachdb#39135, cockroachdb#39203) has revealed that there exists a high degree of difficulty involved in testing any of the logic in the area of raft entry application. This has naturally led to testing at a distance using tools like testing hooks, which is frustrating and delicate. As a result, we're missing tests for thing like old migrations that we still need to support. We also have trouble writing regression tests when bugs do pop up. - The proposed optimization in cockroachdb#17500 (comment) to apply committed raft entries to the Replica storage engine asynchronously in a separate thread than the raft processing thread will make entry application significantly more complex. For instance, we'll likely need to introduce a separate scheduler to coordinate entry application passes across Ranges on a node. The schedule will likely want to prioritize leaders over followers and embed other policies to optimize for total system throughput. There's a strong desire to isolate this new complexity and to give the logic a place to live. The PR begins to address these concerns by formalizing the process of applying committed raft entries. To start, this makes the process easier to understand both in terms of the macro-level steps that are taken during application of a batch of entries and in terms of the impact that an individual command has on the replicated state machine. For instance, the PR helps provide answers to all of the following questions: - What are the stages of raft entry application? - What is the difference between a "raft entry" and a "replicated command"? - What can a command do besides apply its write batch to the storage engine? - What does it mean for a successfully replicated command to be rejected during application? - When can we acknowledge the outcome of a raft proposal? The refactor also uncovers a large testing surface that future PRs will exploit to write targeted unit tests. Not only can the `storage/apply` package be tested with a mock state machine (done in this PR), but we can test Replica's implementation of the state machine interface in isolation without needing to touch raft at all. Finally, the refactor paves the way for making the proposed change in cockroachdb#38954 in a much cleaner way. This is demonstrated in next commit, which is being included here to show why certain things were designed the way they were but will not be merged with this PR. Release note: None
…pplication This will be pulled out into cockroachdb#38954 instead. Release note: None
The new package provides abstractions and routines associated with the application of committed raft entries to a replicated state machine. This was inspired by four driving forces: - We've been having a number of discussions on the Core team about making storage abstractions more clear and easier to understand in isolation. One commonly discussed proposal is introducing a `storage/replicate` package that would encapsulate the concerns of raft replication (e.g. log manipulation, snapshots, leader election, heartbeats, etc.). This `storage/apply` package will fit in nicely alongside a replication abstraction. - Initial discussion on cockroachdb#38954 concluded that adding an optimization to acknowledge clients after their raft entries have committed but before they had been applied with the current code structure was moving in the opposite direction and making things even harder to understand due to the introduction of more complex state management. - Recent instability in this area (cockroachdb#38976, cockroachdb#39064, cockroachdb#39135, cockroachdb#39203) has revealed that there exists a high degree of difficulty involved in testing any of the logic in the area of raft entry application. This has naturally led to testing at a distance using tools like testing hooks, which is frustrating and delicate. As a result, we're missing tests for thing like old migrations that we still need to support. We also have trouble writing regression tests when bugs do pop up. - The proposed optimization in cockroachdb#17500 (comment) to apply committed raft entries to the Replica storage engine asynchronously in a separate thread than the raft processing thread will make entry application significantly more complex. For instance, we'll likely need to introduce a separate scheduler to coordinate entry application passes across Ranges on a node. The schedule will likely want to prioritize leaders over followers and embed other policies to optimize for total system throughput. There's a strong desire to isolate this new complexity and to give the logic a place to live. The PR begins to address these concerns by formalizing the process of applying committed raft entries. To start, this makes the process easier to understand both in terms of the macro-level steps that are taken during application of a batch of entries and in terms of the impact that an individual command has on the replicated state machine. For instance, the PR helps provide answers to all of the following questions: - What are the stages of raft entry application? - What is the difference between a "raft entry" and a "replicated command"? - What can a command do besides apply its write batch to the storage engine? - What does it mean for a successfully replicated command to be rejected during application? - When can we acknowledge the outcome of a raft proposal? The refactor also uncovers a large testing surface that future PRs will exploit to write targeted unit tests. Not only can the `storage/apply` package be tested with a mock state machine (done in this PR), but we can test Replica's implementation of the state machine interface in isolation without needing to touch raft at all. Finally, the refactor paves the way for making the proposed change in cockroachdb#38954 in a much cleaner way. This is demonstrated in next commit, which is being included here to show why certain things were designed the way they were but will not be merged with this PR. Release note: None
…pplication This will be pulled out into cockroachdb#38954 instead. Release note: None
The new package provides abstractions and routines associated with the application of committed raft entries to a replicated state machine. This was inspired by four driving forces: - We've been having a number of discussions on the Core team about making storage abstractions more clear and easier to understand in isolation. One commonly discussed proposal is introducing a `storage/replicate` package that would encapsulate the concerns of raft replication (e.g. log manipulation, snapshots, leader election, heartbeats, etc.). This `storage/apply` package will fit in nicely alongside a replication abstraction. - Initial discussion on cockroachdb#38954 concluded that adding an optimization to acknowledge clients after their raft entries have committed but before they had been applied with the current code structure was moving in the opposite direction and making things even harder to understand due to the introduction of more complex state management. - Recent instability in this area (cockroachdb#38976, cockroachdb#39064, cockroachdb#39135, cockroachdb#39203) has revealed that there exists a high degree of difficulty involved in testing any of the logic in the area of raft entry application. This has naturally led to testing at a distance using tools like testing hooks, which is frustrating and delicate. As a result, we're missing tests for thing like old migrations that we still need to support. We also have trouble writing regression tests when bugs do pop up. - The proposed optimization in cockroachdb#17500 (comment) to apply committed raft entries to the Replica storage engine asynchronously in a separate thread than the raft processing thread will make entry application significantly more complex. For instance, we'll likely need to introduce a separate scheduler to coordinate entry application passes across Ranges on a node. The schedule will likely want to prioritize leaders over followers and embed other policies to optimize for total system throughput. There's a strong desire to isolate this new complexity and to give the logic a place to live. The PR begins to address these concerns by formalizing the process of applying committed raft entries. To start, this makes the process easier to understand both in terms of the macro-level steps that are taken during application of a batch of entries and in terms of the impact that an individual command has on the replicated state machine. For instance, the PR helps provide answers to all of the following questions: - What are the stages of raft entry application? - What is the difference between a "raft entry" and a "replicated command"? - What can a command do besides apply its write batch to the storage engine? - What does it mean for a successfully replicated command to be rejected during application? - When can we acknowledge the outcome of a raft proposal? The refactor also uncovers a large testing surface that future PRs will exploit to write targeted unit tests. Not only can the `storage/apply` package be tested with a mock state machine (done in this PR), but we can test Replica's implementation of the state machine interface in isolation without needing to touch raft at all. Finally, the refactor paves the way for making the proposed change in cockroachdb#38954 in a much cleaner way. This is demonstrated in next commit, which is being included here to show why certain things were designed the way they were but will not be merged with this PR. Release note: None
39254: storage/apply: create apply package for raft entry application r=nvanbenschoten a=nvanbenschoten The new package provides abstractions and routines associated with the application of committed raft entries to a replicated state machine. This was inspired by four driving forces: - We've been having a number of discussions on the Core team about making storage abstractions more clear and easier to understand in isolation. One commonly discussed proposal is introducing a `storage/replicate` package that would encapsulate the concerns of raft replication (e.g. log manipulation, snapshots, leader election, heartbeats, etc.). This `storage/apply` package will fit in nicely alongside a replication abstraction. - Initial discussion on #38954 concluded that adding an optimization to acknowledge clients after their raft entries have committed but before they had been applied with the current code structure was moving in the opposite direction and making things even harder to understand due to the introduction of more complex state management. - Recent instability in this area (#38976, #39064, #39135, #39203) has revealed that there exists a high degree of difficulty involved in testing any of the logic in the area of raft entry application. This has naturally led to testing at a distance using tools like testing hooks, which is frustrating and delicate. As a result, we're missing tests for things like old migrations that we still need to support. We also have trouble writing regression tests when bugs do pop up. - The proposed optimization in #17500 (comment) to apply committed raft entries to the Replica storage engine asynchronously in a separate thread than the raft processing thread will make entry application significantly more complex. For instance, we'll likely need to introduce a separate scheduler to coordinate entry application passes across Ranges on a node. The schedule will likely want to prioritize leaders over followers and embed other policies to optimize for total system throughput. There's a strong desire to isolate this new complexity and to give the logic a place to live. The PR begins to address these concerns by formalizing the process of applying committed raft entries. To start, this makes the process easier to understand both in terms of the macro-level steps that are taken during application of a batch of entries and in terms of the impact that an individual command has on the replicated state machine. For instance, the PR helps provide answers to all of the following questions: - What are the stages of raft entry application? - What is the difference between a "raft entry" and a "replicated command"? - What can a command do besides apply its write batch to the storage engine? - What does it mean for a successfully replicated command to be rejected during application? - When can we acknowledge the outcome of a raft proposal? The refactor also uncovers a large testing surface that future PRs will exploit to write targeted unit tests. Not only can the `storage/apply` package be tested with a mock state machine (done in this PR), but we can test Replica's implementation of the state machine interface in isolation without needing to touch raft at all. Finally, the refactor paves the way for making the proposed change in #38954 in a much cleaner way. This is demonstrated in the second commit, which is being included here to show why certain things were designed the way they were but will not be merged with this PR. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
3404a42
to
9ad4965
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I've rebased this on top of #39254, which allowed a lot of the code here to dissolve away. Most of this has already been reviewed over there, so I'm not sure how much more love this PR needs on its own.
Outside of the rebase, the only big change is that we're now properly handling raft ready structs with overlap between the Entries and CommittedEntries slices. I updated TestAckWriteBeforeApplication to explore this case.
PTAL.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @tbg)
pkg/storage/client_replica_test.go, line 1819 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
How did you go about finding these now-flaky tests? Stress
./pkg/storage
for a while?
Yes, exactly. I stressed the who package for a while and kept skipping tests until it ran without issue. Most of this went away once we disabled the optimization for single-replica Ranges.
pkg/storage/replica_application.go, line 114 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Who are we racing with?
retrieveLocalProposals
has removed this from the proposal map already.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The refactor with the apply package made this really clean.
Reviewed 20 of 20 files at r4.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @nvanbenschoten, and @tbg)
pkg/storage/replica_application_cmd.go, line 121 at r4 (raw file):
// CanAckBeforeApplication implements the apply.CheckedCommand interface. func (c *replicatedCmd) CanAckBeforeApplication() bool { // CanAckBeforeApplication determines whether the request type is compatable
s/compatable/compatible/
how's that one in our dictionary?
pkg/storage/store_test.go, line 2491 at r4 (raw file):
// Query the range with a single scan, which should cause all intents // to be resolved.
Why this here?
…apply Informs cockroachdb#17500. This is a partial revival of cockroachdb#18710 and a culmination of more recent thinking in cockroachdb#17500 (comment). The change adjusts the Raft processing loop so that it acknowledges the success of raft entries as soon as it learns that they have been durably committed to the raft log instead of after they have been applied to the proposer replica's replicated state machine. This not only pulls the application latency out of the hot path for Raft proposals, but it also pulls the next raft ready iteration's write to its Raft log (with the associated fsync) out of the hot path for Raft proposals. This is safe because a proposal through raft is known to have succeeded as soon as it is replicated to a quorum of replicas (i.e. has committed in the raft log). The proposal does not need to wait for its effects to be applied in order to know whether its changes will succeed or fail. The raft log is the provider of atomicity and durability for replicated writes, not (ignoring log truncation) the replicated state machine itself, so a client can be confident in the result of a write as soon as the raft log confirms that it has succeeded. However, there are a few complications to acknowledging the success of a proposal at this stage: 1. Committing an entry in the raft log and having the command in that entry succeed are similar but not equivalent concepts. Even if the entry succeeds in achieving durability by replicating to a quorum of replicas, its command may still be rejected "beneath raft". This means that a (deterministic) check after replication decides that the command will not be applied to the replicated state machine. In that case, the client waiting on the result of the command should not be informed of its success. Luckily, this check is cheap to perform so we can do it here and when applying the command. See Replica.shouldApplyCommand. 2. Some commands perform non-trivial work such as updating Replica configuration state or performing Range splits. In those cases, it's likely that the client is interested in not only knowing whether it has succeeded in sequencing the change in the raft log, but also in knowing when the change has gone into effect. There's currently no exposed hook to ask for an acknowledgement only after a command has been applied, so for simplicity the current implementation only ever acks transactional writes before they have gone into effect. All other commands wait until they have been applied to ack their client. 3. Even though we can determine whether a command has succeeded without applying it, the effect of the command will not be visible to conflicting commands until it is applied. Because of this, the client can be informed of the success of a write at this point, but we cannot release that write's latches until the write has applied. See ProposalData.signalProposalResult/finishApplication. \### Benchmarks The change appears to result in an **8-10%** improvement to throughput and a **6-10%** reduction in p50 latency across the board on kv0. I ran a series of tests with different node sizes and difference workload concurrencies and the win seemed pretty stable. This was also true regardless of whether the writes were to a single Raft group or a large number of Raft groups. ``` name old ops/sec new ops/sec delta kv0/cores=16/nodes=3/conc=32 24.1k ± 0% 26.1k ± 1% +8.35% (p=0.008 n=5+5) kv0/cores=16/nodes=3/conc=48 30.4k ± 1% 32.8k ± 1% +8.02% (p=0.008 n=5+5) kv0/cores=16/nodes=3/conc=64 34.6k ± 1% 37.6k ± 0% +8.79% (p=0.008 n=5+5) kv0/cores=36/nodes=3/conc=72 46.6k ± 1% 50.8k ± 0% +8.99% (p=0.008 n=5+5) kv0/cores=36/nodes=3/conc=108 58.8k ± 1% 64.0k ± 1% +8.99% (p=0.008 n=5+5) kv0/cores=36/nodes=3/conc=144 68.1k ± 1% 74.5k ± 1% +9.45% (p=0.008 n=5+5) kv0/cores=72/nodes=3/conc=144 55.8k ± 1% 59.7k ± 2% +7.12% (p=0.008 n=5+5) kv0/cores=72/nodes=3/conc=216 64.4k ± 4% 68.1k ± 4% +5.65% (p=0.016 n=5+5) kv0/cores=72/nodes=3/conc=288 68.8k ± 2% 74.5k ± 3% +8.39% (p=0.008 n=5+5) name old p50(ms) new p50(ms) delta kv0/cores=16/nodes=3/conc=32 1.30 ± 0% 1.20 ± 0% -7.69% (p=0.008 n=5+5) kv0/cores=16/nodes=3/conc=48 1.50 ± 0% 1.40 ± 0% -6.67% (p=0.008 n=5+5) kv0/cores=16/nodes=3/conc=64 1.70 ± 0% 1.60 ± 0% -5.88% (p=0.008 n=5+5) kv0/cores=36/nodes=3/conc=72 1.40 ± 0% 1.30 ± 0% -7.14% (p=0.008 n=5+5) kv0/cores=36/nodes=3/conc=108 1.60 ± 0% 1.50 ± 0% -6.25% (p=0.008 n=5+5) kv0/cores=36/nodes=3/conc=144 1.84 ± 3% 1.70 ± 0% -7.61% (p=0.000 n=5+4) kv0/cores=72/nodes=3/conc=144 2.00 ± 0% 1.80 ± 0% -10.00% (p=0.008 n=5+5) kv0/cores=72/nodes=3/conc=216 2.46 ± 2% 2.20 ± 0% -10.57% (p=0.008 n=5+5) kv0/cores=72/nodes=3/conc=288 2.80 ± 0% 2.60 ± 0% -7.14% (p=0.079 n=4+5) name old p99(ms) new p99(ms) delta kv0/cores=16/nodes=3/conc=32 3.50 ± 0% 3.50 ± 0% ~ (all equal) kv0/cores=16/nodes=3/conc=48 4.70 ± 0% 4.58 ± 3% ~ (p=0.167 n=5+5) kv0/cores=16/nodes=3/conc=64 5.50 ± 0% 5.20 ± 0% -5.45% (p=0.008 n=5+5) kv0/cores=36/nodes=3/conc=72 5.00 ± 0% 4.70 ± 0% -6.00% (p=0.008 n=5+5) kv0/cores=36/nodes=3/conc=108 5.80 ± 0% 5.50 ± 0% -5.17% (p=0.008 n=5+5) kv0/cores=36/nodes=3/conc=144 6.48 ± 3% 6.18 ± 3% -4.63% (p=0.079 n=5+5) kv0/cores=72/nodes=3/conc=144 11.0 ± 0% 10.5 ± 0% -4.55% (p=0.008 n=5+5) kv0/cores=72/nodes=3/conc=216 13.4 ± 2% 13.2 ± 5% ~ (p=0.683 n=5+5) kv0/cores=72/nodes=3/conc=288 18.2 ± 4% 17.2 ± 3% -5.70% (p=0.079 n=5+5) ``` Release note (performance improvement): Raft entries no longer wait to be applied to the RocksDB storage engine before signaling their success to clients, they now only wait until they are committed in their Raft log.
9ad4965
to
87aaea7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTRs
bors r=ajwerner
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner and @tbg)
pkg/storage/replica_application_cmd.go, line 121 at r4 (raw file):
Previously, ajwerner wrote…
s/compatable/compatible/
how's that one in our dictionary?
Done.
pkg/storage/store_test.go, line 2491 at r4 (raw file):
Previously, ajwerner wrote…
Why this here?
It was some cleanup I made in an earlier version of this PR that I figured I'd carry over.
Build failed (retrying...) |
38954: storage: ack Raft proposals after Raft log commit, not state machine apply r=ajwerner a=nvanbenschoten Informs #17500. This is a partial revival of #18710 and a culmination of more recent thinking in #17500 (comment). The change adjusts the Raft processing loop so that it acknowledges the success of raft entries as soon as it learns that they have been durably committed to the raft log instead of after they have been applied to the proposer replica's replicated state machine. This not only pulls the application latency out of the hot path for Raft proposals, but it also pulls the next raft ready iteration's write to its Raft log (with the associated fsync) out of the hot path for Raft proposals. This is safe because a proposal through raft is known to have succeeded as soon as it is replicated to a quorum of replicas (i.e. has committed in the raft log). The proposal does not need to wait for its effects to be applied in order to know whether its changes will succeed or fail. The raft log is the provider of atomicity and durability for replicated writes, not (ignoring log truncation) the replicated state machine itself, so a client can be confident in the result of a write as soon as the raft log confirms that it has succeeded. However, there are a few complications in acknowledging the success of a proposal at this stage: 1. Committing an entry in the raft log and having the command in that entry succeed are similar but not equivalent concepts. Even if the entry succeeds in achieving durability by replicating to a quorum of replicas, its command may still be rejected "beneath raft". This means that a (deterministic) check after replication decides that the command will not be applied to the replicated state machine. In that case, the client waiting on the result of the command should not be informed of its success. Luckily, this check is cheap to perform so we can do it here and when applying the command. See `Replica.shouldApplyCommand`. 2. Some commands perform non-trivial work such as updating Replica configuration state or performing Range splits. In those cases, it's likely that the client is interested in not only knowing whether it has succeeded in sequencing the change in the raft log, but also in knowing when the change has gone into effect. There's currently no exposed hook to ask for an acknowledgment only after a command has been applied, so for simplicity, the current implementation only ever acks transactional writes before they have gone into effect. All other commands wait until they have been applied to ack their client. 3. Even though we can determine whether a command has succeeded without applying it, the effect of the command will not be visible to conflicting commands until it is applied. Because of this, the client can be informed of the success of a write at this point, but we cannot release that write's latches until the write has applied. See `ProposalData.signalProposalResult/finishApplication`. ### Benchmarks The change appears to result in an **8-10%** improvement to throughput and a **6-10%** reduction in p50 latency across the board on kv0. I ran a series of tests with different node sizes and difference workload concurrencies and the win seemed pretty stable. This was also true regardless of whether the writes were to a single Raft group or a large number of Raft groups. ``` name old ops/sec new ops/sec delta kv0/cores=16/nodes=3/conc=32 24.1k ± 0% 26.1k ± 1% +8.35% (p=0.008 n=5+5) kv0/cores=16/nodes=3/conc=48 30.4k ± 1% 32.8k ± 1% +8.02% (p=0.008 n=5+5) kv0/cores=16/nodes=3/conc=64 34.6k ± 1% 37.6k ± 0% +8.79% (p=0.008 n=5+5) kv0/cores=36/nodes=3/conc=72 46.6k ± 1% 50.8k ± 0% +8.99% (p=0.008 n=5+5) kv0/cores=36/nodes=3/conc=108 58.8k ± 1% 64.0k ± 1% +8.99% (p=0.008 n=5+5) kv0/cores=36/nodes=3/conc=144 68.1k ± 1% 74.5k ± 1% +9.45% (p=0.008 n=5+5) kv0/cores=72/nodes=3/conc=144 55.8k ± 1% 59.7k ± 2% +7.12% (p=0.008 n=5+5) kv0/cores=72/nodes=3/conc=216 64.4k ± 4% 68.1k ± 4% +5.65% (p=0.016 n=5+5) kv0/cores=72/nodes=3/conc=288 68.8k ± 2% 74.5k ± 3% +8.39% (p=0.008 n=5+5) name old p50(ms) new p50(ms) delta kv0/cores=16/nodes=3/conc=32 1.30 ± 0% 1.20 ± 0% -7.69% (p=0.008 n=5+5) kv0/cores=16/nodes=3/conc=48 1.50 ± 0% 1.40 ± 0% -6.67% (p=0.008 n=5+5) kv0/cores=16/nodes=3/conc=64 1.70 ± 0% 1.60 ± 0% -5.88% (p=0.008 n=5+5) kv0/cores=36/nodes=3/conc=72 1.40 ± 0% 1.30 ± 0% -7.14% (p=0.008 n=5+5) kv0/cores=36/nodes=3/conc=108 1.60 ± 0% 1.50 ± 0% -6.25% (p=0.008 n=5+5) kv0/cores=36/nodes=3/conc=144 1.84 ± 3% 1.70 ± 0% -7.61% (p=0.000 n=5+4) kv0/cores=72/nodes=3/conc=144 2.00 ± 0% 1.80 ± 0% -10.00% (p=0.008 n=5+5) kv0/cores=72/nodes=3/conc=216 2.46 ± 2% 2.20 ± 0% -10.57% (p=0.008 n=5+5) kv0/cores=72/nodes=3/conc=288 2.80 ± 0% 2.60 ± 0% -7.14% (p=0.079 n=4+5) name old p99(ms) new p99(ms) delta kv0/cores=16/nodes=3/conc=32 3.50 ± 0% 3.50 ± 0% ~ (all equal) kv0/cores=16/nodes=3/conc=48 4.70 ± 0% 4.58 ± 3% ~ (p=0.167 n=5+5) kv0/cores=16/nodes=3/conc=64 5.50 ± 0% 5.20 ± 0% -5.45% (p=0.008 n=5+5) kv0/cores=36/nodes=3/conc=72 5.00 ± 0% 4.70 ± 0% -6.00% (p=0.008 n=5+5) kv0/cores=36/nodes=3/conc=108 5.80 ± 0% 5.50 ± 0% -5.17% (p=0.008 n=5+5) kv0/cores=36/nodes=3/conc=144 6.48 ± 3% 6.18 ± 3% -4.63% (p=0.079 n=5+5) kv0/cores=72/nodes=3/conc=144 11.0 ± 0% 10.5 ± 0% -4.55% (p=0.008 n=5+5) kv0/cores=72/nodes=3/conc=216 13.4 ± 2% 13.2 ± 5% ~ (p=0.683 n=5+5) kv0/cores=72/nodes=3/conc=288 18.2 ± 4% 17.2 ± 3% -5.70% (p=0.079 n=5+5) ``` Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Build succeeded |
|
There isn't an issue yet. Think this might be new @nvanbenschoten? |
No, that was because of #39609. We shouldn't have been using |
Cool, thanks.
|
The new package provides abstractions and routines associated with the application of committed raft entries to a replicated state machine. This was inspired by four driving forces: - We've been having a number of discussions on the Core team about making storage abstractions more clear and easier to understand in isolation. One commonly discussed proposal is introducing a `storage/replicate` package that would encapsulate the concerns of raft replication (e.g. log manipulation, snapshots, leader election, heartbeats, etc.). This `storage/apply` package will fit in nicely alongside a replication abstraction. - Initial discussion on cockroachdb#38954 concluded that adding an optimization to acknowledge clients after their raft entries have committed but before they had been applied with the current code structure was moving in the opposite direction and making things even harder to understand due to the introduction of more complex state management. - Recent instability in this area (cockroachdb#38976, cockroachdb#39064, cockroachdb#39135, cockroachdb#39203) has revealed that there exists a high degree of difficulty involved in testing any of the logic in the area of raft entry application. This has naturally led to testing at a distance using tools like testing hooks, which is frustrating and delicate. As a result, we're missing tests for thing like old migrations that we still need to support. We also have trouble writing regression tests when bugs do pop up. - The proposed optimization in cockroachdb#17500 (comment) to apply committed raft entries to the Replica storage engine asynchronously in a separate thread than the raft processing thread will make entry application significantly more complex. For instance, we'll likely need to introduce a separate scheduler to coordinate entry application passes across Ranges on a node. The schedule will likely want to prioritize leaders over followers and embed other policies to optimize for total system throughput. There's a strong desire to isolate this new complexity and to give the logic a place to live. The PR begins to address these concerns by formalizing the process of applying committed raft entries. To start, this makes the process easier to understand both in terms of the macro-level steps that are taken during application of a batch of entries and in terms of the impact that an individual command has on the replicated state machine. For instance, the PR helps provide answers to all of the following questions: - What are the stages of raft entry application? - What is the difference between a "raft entry" and a "replicated command"? - What can a command do besides apply its write batch to the storage engine? - What does it mean for a successfully replicated command to be rejected during application? - When can we acknowledge the outcome of a raft proposal? The refactor also uncovers a large testing surface that future PRs will exploit to write targeted unit tests. Not only can the `storage/apply` package be tested with a mock state machine (done in this PR), but we can test Replica's implementation of the state machine interface in isolation without needing to touch raft at all. Finally, the refactor paves the way for making the proposed change in cockroachdb#38954 in a much cleaner way. This is demonstrated in next commit, which is being included here to show why certain things were designed the way they were but will not be merged with this PR. Release note: None
Informs #17500.
This is a partial revival of #18710 and a culmination of more recent thinking in #17500 (comment).
The change adjusts the Raft processing loop so that it acknowledges the success of raft entries as soon as it learns that they have been durably committed to the raft log instead of after they have been applied to the proposer replica's replicated state machine. This not only pulls the application latency out of the hot path for Raft proposals, but it also pulls the next raft ready iteration's write to its Raft log (with the associated fsync) out of the hot path for Raft proposals.
This is safe because a proposal through raft is known to have succeeded as soon as it is replicated to a quorum of replicas (i.e. has committed in the raft log). The proposal does not need to wait for its effects to be applied in order to know whether its changes will succeed or fail. The raft log is the provider of atomicity and durability for replicated writes, not (ignoring log truncation) the replicated state machine itself, so a client can be confident in the result of a write as soon as the raft log confirms that it has succeeded.
However, there are a few complications in acknowledging the success of a proposal at this stage:
Replica.shouldApplyCommand
.ProposalData.signalProposalResult/finishApplication
.Benchmarks
The change appears to result in an 8-10% improvement to throughput and a 6-10% reduction in p50 latency across the board on kv0. I ran a series of tests with different node sizes and difference workload concurrencies and the win seemed pretty stable. This was also true regardless of whether the writes were to a single Raft group or a large number of Raft groups.