Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZOOKEEPER-4394: Apply only committed requests in sync with leader before NEWLEADER ACK #2152

Merged
merged 4 commits into from
Sep 19, 2024

Conversation

AlphaCanisMajoris
Copy link
Contributor

@AlphaCanisMajoris AlphaCanisMajoris commented Mar 28, 2024

This fix aims to fix a bunch of issues in follower's syncWithLeader().
e.g.,
ZOOKEEPER-4394: NullPointerException when follower receives COMMIT after replying NEWLEADER ack in syncWithLeader().
Explanation: In syncWithLeader(), a follower may receive the COMMIT message after replying NEWLEADER ack. e.g., a follower receives the following messages in order:
DIFF -> PROPOSAL -> COMMIT -> PROPOSAL -> NEWLEADER -> COMMIT -> UPTODATE

NullPointerException will occur if the corresponding PROPOSAL has been processed and removed from the "packetsNotCommitted" during the follower process NEWLEADER.

To fix this, when the follower receives NEWLEADER, it does the following things in order:

  1. take snapshot if needed;
  2. persist and process the committed txns according to "packetsCommitted" synchronously (these txns have been committed by leader). Note that leave the outstanding proposals in "packetsNotCommitted" (not in "packetsCommitted") to be processed later;
  3. update currentEpoch;
  4. reply NEWLEADER ack.

After the follower replies UPTODATE ack, it persist the txns in "packetsNotCommitted" and commit the txns in "packetsCommitted" asynchronously.

Besides, this fixes some other issues in Learner.syncWithLeader():
ZOOKEEPER-3023: Flaky test: Zab1_0Test#testNormalFollowerRunWithDiff
ZOOKEEPER-4643: Committed txns lost when follower crashes after updating currentEpoch
ZOOKEEPER-4646: Committed txns lost when follower crashes after replying NEWLEADER ack
ZOOKEEPER-4685: Leader shutdown due to follower replies PROPOSAL ack before NEWLEADER ack in Synchronization phase

We have leveraged the TLA+ specifications of ZooKeeper and verified the correctness of this fix.

@AlphaCanisMajoris
Copy link
Contributor Author

AlphaCanisMajoris commented Mar 28, 2024

When the follower processes the NEWLEADER message, it only persists and processes the txns that the leader asks it to COMMIT. This guarantees no loss of committed txns because the follower syncs the leader's committed history before replying NEWLEADER ack (ZOOKEEPER-4646).

Besides, this fix removes the redundant acks for the PROPOSALs of committed txns in SYNCHRONIZATION phase. In fact, there is no need to reply PROPOSAL acks for these committed txns. Because these txns have been committed by the leader, their PROPOSAL acks are redundant for the leader (See Leader#processAck).

@AlphaCanisMajoris
Copy link
Contributor Author

AlphaCanisMajoris commented Mar 30, 2024

Analysis of #2111 and #1848:

#2111 can avoid ZOOKEEPER-4643, but it cannot fix other issues like ZOOKEEPER-4394: the NullPointerException problem when the follower receives COMMIT after replying NEWLEADER ack in syncWithLeader().

In #1848, when receiving NEWLEADER, the follower only persists and processes the packets according to "packetsCommitted". It still keeps the outstanding proposals in "packetsNotCommitted" to avoid NullPointerException when receiving COMMIT packet(s) right after replying NEWLEADER ack (ZOOKEEPER-4394).

Take #2111 and #1848 into consideration, this fix avoids a group of issues in SYNCHRONIZATION, including ZOOKEEPER-4643, ZOOKEEPER-4646, ZOOKEEPER-4685, ZOOKEEPER-4394 and the problem of flaky tests ZOOKEEPER-3023.

@AlphaCanisMajoris
Copy link
Contributor Author

Add a test case: Zab1_0Test#testNormalFollowerRun_ProcessCommitInSyncAfterAckNewLeader. This checks the case when the follower receives PROPOSAL, NEWLEADER, COMMIT & UPTODATE in order during Learner.syncWithLeader().

@eolivelli
Copy link
Contributor

One curiosity:

We have leveraged the TLA+ specifications of ZooKeeper and verified the correctness of this fix.

Who is 'we' ?
Can you share your tests?

ZOOKEEPER-4394: NullPointerException when follower receives COMMIT after replying NEWLEADER ack
ZOOKEEPER-4643: Committed txns lost when follower crashes after updating currentEpoch
ZOOKEEPER-4646: Committed txns lost when follower crashes after replying NEWLEADER ack
ZOOKEEPER-4685: Leader shutdown due to follower replies PROPOSAL ack before NEWLEADER ack in Synchronization phase
ZOOKEEPER-3023: Flaky tests: Zab1_0Test#testNormalFollowerRunWithDiff
@AlphaCanisMajoris
Copy link
Contributor Author

One curiosity:

We have leveraged the TLA+ specifications of ZooKeeper and verified the correctness of this fix.

Who is 'we' ?

We are a research team dedicated to the verification of distributed systems using TLA+.
We participated in the development of the TLA+ specifications for Zab & ZooKeeper.

Based on ZooKeeper's existing specification, we refine the actions according to the code implementation to explore the possible interleaving of multi-node and multi-threading events with failures. Then we check the specification to see whether it violates some invariants (derived from Zab paper and code) with the model checker.

For example, when we specify Learner.syncWithLeader(..) in version 3.9.1, the model checking found the issue traces
of ZOOKEEPER-4643, ZOOKEEPER-4646, ZOOKEEPER-4394, ZOOKEEPER-4685 and ZOOKEEPER-3023.

When specifying Learner.syncWithLeader(..) in version 3.9.2 (fixed by #2111), the issue traces of ZOOKEEPER-4394 and ZOOKEEPER-3023 can be detected.

More details of the specifications can be found in this repo.

Can you share your tests?

We verify this fix with the TLA+ specification zk_pr_2152.

Some details in this specification: according to the fix in this PR, the logic when a follower receives NEWLEADER is specified into three actions:

  • FollowerProcessNEWLEADER_pr2152_1: Take snapshot if needed; persist and commit requests according to "packetsCommitted".
  • FollowerProcessNEWLEADER_pr2152_2: Update currentEpoch.
  • FollowerProcessNEWLEADER_pr2152_3: Reply NEWLEADER ack.

The verification statistics are provided here. No violation is found during the checking with various configurations.

Copy link
Member

@kezhuw kezhuw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I have left some comments though.

Sorry for late. It has been a while since #1848(2 years) and thought on #1925(nearly 1 year) for me to bring back.

@li4wang @yisong-yue @jonmv @hanm Sorry for ping you, but would you mind take a look for this ?

/*
* @see https://github.com/apache/zookeeper/pull/1848
* Persist and process the committed txns in "packetsNotCommitted"
* according to "packetsCommitted", which have been committed by
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is somewhat misunderstanding. The key is to log these committed ones, they are considered committed before election by the paper. All the reason we touch packetsNotCommitted here is to make sure it is not logRequest again in broadcast phase. I think it might be better to rename packetsNotCommitted to packetsNotLogged as @jeffrey-xiao did in #1930. "log" is a disk operation, "commit" is an agreement. What we want here should be "log committed txns agreed in election".

Coming into the implementation, new proposals could still be committed before NEWLEADER since LearnerHandler does not issue NEWLEADER right after these committed txns. But it does not harm us here as we are potentially to persist more but not less and new leader expect no ack for committed ones.

*
* @see https://issues.apache.org/jira/browse/ZOOKEEPER-4394
* Keep the outstanding proposals in "packetsNotCommitted" to avoid
* NullPointerException when the follower receives COMMIT packet(s)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given my comments above, then we should not clear packetsNotCommitted apparently. All txns not in packetsCommitted are proposals from new election.

long start = System.currentTimeMillis();
while (createSessionZxid != f.fzk.getLastProcessedZxid()
&& (System.currentTimeMillis() - start) < 50) {
Thread.sleep(1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When specifying Learner.syncWithLeader(..) in version 3.9.2 (fixed by #2111), the issue traces of ZOOKEEPER-4394 and ZOOKEEPER-3023 can be detected.

I did not see ZOOKEEPER-3023 after #2111. But if you are verifying this using TLA, this is doomed to failure. I am +1 to revert this to pre ZOOKEEPER-2678.

// Leader sends an outstanding proposal
long proposalZxid = ZxidUtils.makeZxid(1, 1001);
proposeSetData(qp, proposalZxid, "data2", 2);
oa.writeRecord(qp, null);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is what ZOOKEEPER-4394 tried to report. New proposals are issued before NEWLEADER. This is the gap between paper and implementation.

// The outstanding proposal has not been persisted yet
ZKDatabase zkDb2 = new ZKDatabase(new FileTxnSnapLog(logDir, snapDir));
long lastZxid = zkDb2.loadDataBase();
assertEquals("data1", new String(zkDb2.getData("/foo", stat, null)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems an extra enforce to the paper. But given that we are driving the test step by step and we are testing implementation, I am +1 on this. By the paper, we should not have this problem and assertions should still hold anyway as there are no new proposals from new leader before NEWLEADER.

@jonmv
Copy link
Contributor

jonmv commented Apr 3, 2024

Hi, this is great!

As luck would have it, I had to rebase my fork onto 3.9.2 a few weeks ago, and due to conflicts with #2111 I already reviewed that commit. My conclusion then was that the NPE this PR fixes still remained an issue, and I think there are additional fixes in #1925 related to shutdown of the ZK database/server, that would also lead to (mostly harmless) duplicate series in the transaction log.

I've dedicated some time for reviewing this PR, and will also make sure to mint a fresh PR with any outstanding fixes from #1925, in the hope that I may finally close that fork.

@jonmv
Copy link
Contributor

jonmv commented Apr 4, 2024

I think the changes in this PR look good 👍

From what I can tell, the union of #2111, this, and #2154 (which I just opened) should together cover #1925, which is now a battle-proven patch, used by vespa.ai for the last year and a half.

@AlphaCanisMajoris
Copy link
Contributor Author

Hi, @kezhuw. thanks a lot for your careful review! I've updated the code fix in this pr. Could u please take some time to check the new commit and see if there is anything else that needs improvement? Thanks!

@AlphaCanisMajoris
Copy link
Contributor Author

Hi @eolivelli. Sorry for bother you, but would you mind take a look at this pr and consider merging it?

This pr fixes a bunch of issues in follower's syncWithLeader() that stay unresolved for a long time. Besides, it can drive the following fixes like #2154.
It is built based on the latest fix #2111. It has been checked by the new added test case Zab1_0Test#testNormalFollowerRun_ProcessCommitInSyncAfterAckNewLeader and the TLA+ specification verification results.

@tsuna
Copy link

tsuna commented Jun 13, 2024

Will this fix be backported in the 3.8 train? We just hit this bug on one of our clusters, it's a shame we've had various fixes up for review for over 20 months, I would appreciate you guys pushing this through the finish line so this critical issue can be closed. Thanks!

@tsuna
Copy link

tsuna commented Jun 19, 2024

Hi folks, is there any interest in merging this fix soon and cutting a release?

@li4wang
Copy link
Contributor

li4wang commented Jul 4, 2024

I will setup some time to review it early next week.

@AlphaCanisMajoris
Copy link
Contributor Author

I will setup some time to review it early next week.

Hi @li4wang, thanks for your participation in reviewing this pr! Do you think it ok to merge this fix?

@AlphaCanisMajoris
Copy link
Contributor Author

Hi @kezhuw. Really appreciate your careful review of this pr the other days! Would u mind take some time to check the pr and see if it could be merged or not? Thanks a lot!

@kezhuw
Copy link
Member

kezhuw commented Aug 5, 2024

@AlphaCanisMajoris I think we need one more binding approval to gain confidence. Most of no trivial prs need two binding approvals by convention.

@AlphaCanisMajoris
Copy link
Contributor Author

@kezhuw Thanks for your review and approval! By the way, is it possible to invite or assign some developers familiar to ZooKeeper's design logic to review this fix so we can accelerate the merge progress and close this issue before cutting the next release? (Which is strongly desired by multiple developers >..<)

@tsuna
Copy link

tsuna commented Aug 8, 2024

I would appreciate the maintainers' attention here, this is a critical data corruption bug in ZK. We just ran into it again on one of our dev clusters.

@kezhuw
Copy link
Member

kezhuw commented Aug 9, 2024

@tsuna Which ZooKeeper version did you run ? I saw messages you posted on ZOOKEEPER-4394. I guess it is 3.8.2.

ZOOKEEPER-4785 has been landed in 3.8.4 and 3.9.2. Though, ZOOKEEPER-4394 was still left behind, but I think the txn loss ZOOKEEPER-4643 and ZOOKEEPER-4646 reported has already been fixed if I understand correctly. Hmm, there might be still paths to data inconsistency as ZOOKEEPER-4541 and ZOOKEEPER-4712 reported.

@eolivelli @anmolnar @symat @maoling @tisonkun @li4wang @cnauroth Could anyone please take a look at this ? After #2111(ZOOKEEPER-4785) merged, this is basically an alternative to #1930 (@jeffrey-xiao)

ZOOKEEPER-4394 describes the issue, following is my understandings:

  1. Learner::syncWithLeader assumes/demands no outstanding PROPOSAL before NEWLEADER.
  2. But leader requires only quorum acks of NEWLEADER to issue outstanding PROPOSALs. This means learner is possible to see uncommitted proposal on receiving NEWLEADER.
  3. The combination of above two breaks following up COMMIT.

Besides above, this patch reverts Zab1_0Test.testNormalFollowerRunWithDiff to prior ZOOKEEPER-2678 which I think it is a good as I expressed in ZOOKEEPER-3023.

@tsuna
Copy link

tsuna commented Aug 14, 2024

@tsuna Which ZooKeeper version did you run ? I saw messages you posted on ZOOKEEPER-4394. I guess it is 3.8.2.

Yes ZK 3.8.2.
I was under the impression, based on the earlier comment from @AlphaCanisMajoris, that the "fix for ZOOKEEPER-4643 cannot fix other issues like ZOOKEEPER-4394" hence why I have been waiting for a new ZK 3.8.x release with this fix merged in.

It's been so long at this point we are debating just patching the fix in and running a custom build. 🫤

@AlphaCanisMajoris
Copy link
Contributor Author

@kezhuw Thanks for your review and approval! By the way, is it possible to invite or assign some developers familiar to ZooKeeper's design logic to review this fix so we can accelerate the merge progress and close this issue before cutting the next release? (Which is strongly desired by multiple developers >..<)

Hi @kezhuw. Sorry to bother you again on the progress of this pr. Is it possible to assign someone to review this pr? Thanks!

// "packetsCommitted" later.
sock.setSoTimeout(self.tickTime * self.syncLimit);
self.setSyncMode(QuorumPeer.SyncMode.NONE);
zk.startupWithoutServing();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need this startupWithoutServing anymore. It was introduced in #1445 to process asynchronous log process. And now we process log synchronously.

This should solve problem @changruill raised in #2154 (comment)

@kezhuw
Copy link
Member

kezhuw commented Aug 26, 2024

@AlphaCanisMajoris Sorry for the long waiting. There is no such "assign" since most of us if not all are volunteers. But I think you cloud send a request of review along with brief detail of the proposal to dev mailing list.

@tsuna
Copy link

tsuna commented Sep 18, 2024

Is there a thread on the dev mailing list for this proposal? I don't see one.

@AlphaCanisMajoris
Copy link
Contributor Author

Is there a thread on the dev mailing list for this proposal? I don't see one.

Hi @anmolnar. Is it possible to merge this pr in the upcoming release of version 3.9.3? This critical issue has stayed unresolved for a long time, and it bothers developers a lot.

Copy link
Contributor

@anmolnar anmolnar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm.

@anmolnar
Copy link
Contributor

@kezhuw @AlphaCanisMajoris @tsuna Please close all outstanding PR's which are superseded by this patch.
Also please help organizing outstanding patches to wrap up the release. Thanks!

@@ -784,7 +807,9 @@ protected void syncWithLeader(long newLeaderZxid) throws Exception {
}
ack.setZxid(ZxidUtils.makeZxid(newEpoch, 0));
writePacket(ack, true);
zk.startServing();
sock.setSoTimeout(self.tickTime * self.syncLimit);
self.setSyncMode(QuorumPeer.SyncMode.NONE);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlphaCanisMajoris Shall we move this to right after self.setCurrentEpoch(newEpoch) ? I think self.setCurrentEpoch(newEpoch) is the signal that we are synced with new epoch leader and all left was waiting UPTODATE to open the gate. Given that self.getSyncMode() was used to determine whether to fully shutdown (a.k.a. clear memory database tree). I think we could promote this a bit earlier.

@kezhuw kezhuw changed the title ZOOKEEPER-4643, ZOOKEEPER-4646 & ZOOKEEPER-4394: Committed txns lost / NullPointerException in Learner.syncWithLeader() ZOOKEEPER-4394: Apply only committed requests in sync with leader before NEWLEADER ACK Sep 19, 2024
@anmolnar
Copy link
Contributor

Looks like cppunit tests are stuck. I'll merge the PR.

@anmolnar anmolnar merged commit 9fcb805 into apache:master Sep 19, 2024
13 of 14 checks passed
asfgit pushed a commit that referenced this pull request Sep 19, 2024
…ore NEWLEADER ACK

Reviewers: kezhuw, anmolnar, kezhuw
Author: AlphaCanisMajoris
Closes #2152 from AlphaCanisMajoris/ZK-4643

(cherry picked from commit 9fcb805)
Signed-off-by: Andor Molnar <andor@apache.org>
@anmolnar
Copy link
Contributor

Done. Thanks @AlphaCanisMajoris ! What is your jira id?

Please double check the branch-3.9 backport commit, there was a small conflict that I had to resolve.

@AlphaCanisMajoris
Copy link
Contributor Author

AlphaCanisMajoris commented Sep 19, 2024

Done. Thanks @AlphaCanisMajoris ! What is your jira id?

Please double check the branch-3.9 backport commit, there was a small conflict that I had to resolve.

OK I'll have a check.

@AlphaCanisMajoris
Copy link
Contributor Author

Besides, it seems that ZOOKEEPER-4685 has been resolved by this pr.

@anmolnar
Copy link
Contributor

anmolnar commented Sep 19, 2024

Done. Thanks @AlphaCanisMajoris ! What is your jira id?

Please double check the branch-3.9 backport commit, there was a small conflict that I had to resolve.

Reverted, because it broke the build.
https://ci-hadoop.apache.org/view/ZooKeeper/job/zookeeper-multi-branch-build/job/branch-3.9/465/

@AlphaCanisMajoris Would you please create separate pull request for 3.9 branch?

@AlphaCanisMajoris
Copy link
Contributor Author

Done. Thanks @AlphaCanisMajoris ! What is your jira id?
Please double check the branch-3.9 backport commit, there was a small conflict that I had to resolve.

Reverted, because it broke the build. https://ci-hadoop.apache.org/view/ZooKeeper/job/zookeeper-multi-branch-build/job/branch-3.9/465/

@AlphaCanisMajoris Would you please create separate pull request for 3.9 branch?

OK I'll create a new pr for branch-3.9.

anmolnar pushed a commit that referenced this pull request Sep 20, 2024
…ore NEWLEADER ACK (#2188)

Reviewers: kezhuw, anmolnar, kezhuw
Author: AlphaCanisMajoris
Closes #2152 from AlphaCanisMajoris/ZK-4643
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants