ZOOKEEPER-4646: Committed txns may still be lost if followers crash after replying ACK of NEWLEADER but before writing txns to disk #1993
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In brief, committed logs might be lost due to the follower's asynchronous transaction logging when replying ACK of NEWLEADER during the SYNC phase.
See ZOOKEEPER-4646 for details on the symptom, example trace, diagnostic, and possible fix idea.
Actually, this problem had first been raised in ZOOKEEPER-3911. However, the fixing patch of ZOOKEEPER-3911 does not solve the problem at the root. Besides, ZOOKEEPER-4685 is also caused with similar root cause, i.e., non-deterministic execution orders between the QuorumPeer thread and the SyncThread when the follower replies ACK of NEWLEADER.
Possible Fix
Considering ZOOKEEPER-4646 and ZOOKEEPER-4685, one possible fix is to guarantee the following execution orders to be satisfied:
There are several ways of implementation.
Option 1 : Multi-threaded collaboration between the follower's QuorumPeer thread and the SyncThread
The follower's QuorumPeer thread will not reply ACK of NEWLEADER until it is notified that the SyncThread has persisted the uncommitted logs to the disk. Use a CountDownLatch ( named
newleaderLatch
, for example) to record the number of uncommitted transactions that should be logged before the follower replies ACK of NEWLEADER. Before replying ACK of NEWLEADER, the follower's QuorumPeer thread will be blocked atnewleaderLatch.await(..)
if the count of thenewleaderLatch
is still non-zero. Whenever the SyncThread logs a transaction and finds thatnewleaderLatch.getCount() > 0
, it will callnewleaderLatch.countDown()
. Only after the count of thenewleaderLatch
turns to zero will the follower be able to reply ACK of NEWLEADER. The ACKs of above transaction proposals will be replied after the ACK of NEWLEADER is replied.- Pros & Cons
learner.newleaderLatch
inSendAckRequestProcessor.processRequest(..)
may affect the performance of follower's SyncThread replying ACK of PROPOSAL during BROADCAST phase.Option 2 (Selected) : All done by the follower's QuorumPeer thread
The uncommitted logs will be persisted to disk by the follower's QuorumPeer thread rather than SyncThread. That is, after receiving NEWLEADER, the follower's QuorumPeer thread will persist the uncommitted logs to disk, then reply ACK of NEWLEADER, and then reply pending ACKs of PROPOSALs.
- Pros & Cons
The solution in this patch implements Option 2 mainly for the performance consideration. With little performance penalty, it promises the safety property that all committed logs will not be lost. The solution is able to avoid the issues of ZOOKEEPER-4646, ZOOKEEPER-3911 and ZOOKEEPER-4685 at the same time.