ZOOKEEPER-2678: Discovery and Sync can take a very long time on large DB (3.4) #157

revans2 · 2017-01-26T15:23:01Z

This patch addresses recovery time when a leader is lost on a large DB.

It does this by not clearing the DB before leader election begins, and by avoiding taking a snapshot as part of the SYNC phase, specifically for a DIFF sync. It does this by buffering the proposals and commits just like the code currently does for proposals/commits sent after the NEWLEADER and before the UPTODATE messages.

If a SNAP is sent we cannot avoid writing out the full snapshot because there is no other way to make sure the disk DB is in sync with what is in memory. So any edits to the edit log before a background snapshot happened could possibly be applied on top of an incorrect snapshot.

This same optimization should work for TRUNC too, but I opted not to do it for TRUNC because TRUNC is rare and TRUNC by its very nature already forces the DB to be reread after the edit logs are modified. So it would still not be fast.

In practice this makes it so instead of taking 5+ mins for the cluster to recover from losing a leader it now takes about 3 seconds.

I am happy to port this to 3.5. if it looks good.

… DBs

fpj

I spent some time looking at this change, and I don't see a problem, it looks good. I have one minor request for change and one comment about tests. It would be good to check that our test cases already cover enough, and otherwise add more test cases.

fpj · 2017-01-28T18:15:06Z

src/java/main/org/apache/zookeeper/server/quorum/Learner.java

-            boolean snapshotTaken = false;
+            boolean isPreZAB1_0 = true;
+            //If we are not going to take the snapshot be sure the edits are not applied in memory
+            boolean writeToEditLog = !snapshotNeeded;


The changes here are using edit to refer to txns. I'd rather use txn to be consistent across the project. Specifically here, you're using EditLog to refer to the TxnLog, please change accordingly to have it consistent across the project.

Sorry about that I am still a bit new at the internal terminology. I will update it.

fpj · 2017-01-28T19:13:58Z

src/java/test/org/apache/zookeeper/server/quorum/Zab1_0Test.java

@@ -839,6 +839,13 @@ public void converseWithFollower(InputArchive ia, OutputArchive oa,
                    Assert.assertEquals(1, f.self.getAcceptedEpoch());
                    Assert.assertEquals(1, f.self.getCurrentEpoch());

+                    //Wait for the edits to be written out


I need to think some more whether it makes any sense to add test cases for this. The test cases we already have probably cover this enough given that there is no real change of behavior.

This change here is necessary, though. We don't really care about time in general in our tests because we can never be sure of the timing we will get across runs and with different settings.

fpj · 2017-01-28T19:16:43Z

Thanks for the patch @revans2 . It makes sense to port this change to both 3.5 and master.

revans2 · 2017-01-30T14:27:49Z

@fpj Thanks for the review I will update the comments and start porting it to other lines.

revans2 · 2017-01-30T16:07:45Z

@fpj I addressed your review comments, and I also found another race in the ZAB test that I addressed. Apparently the log line I removed was taking enough time for the transactions to be fully flushed. When I removed it the test would occasionally fail.

Please also take a look at #158 and #159 for master and branch 3.5.

eribeiro · 2017-01-30T16:58:48Z

Hey @revans2, FYI. I was able to apply the both #158 and #159 without any explicit conflict on branch-3.5 and master and #157 on branch-3.4 (but not on the others cited previously). So... you really don't need 3 PR, just 2: one for branch-3.4 and other for branch-3.5 and master, right?

If I am right then putting a comment to apply either #158 or #159 (up to you) to both branch-3.5 and master should be enough, IMHO.

revans2 · 2017-01-30T18:31:04Z

@eribeiro will do. It is a clean cherry pick between master and 3.5

afine

The change made sense to me, although I think it would be great to test that we are only snapshotting at the appropriate times

afine · 2017-01-31T00:20:25Z

src/java/test/org/apache/zookeeper/server/quorum/Zab1_0Test.java

+                    //Wait for the transactions to be written out. The thread that writes them out
+                    // does not send anything back when it is done.
+                    long start = System.currentTimeMillis();
+                    while (createSessionZxid != f.fzk.getLastProcessedZxid() && (System.currentTimeMillis() - start) < 50) {


just an idea, not sure if it is worth the effort and it may be outside the scope of this patch.

we could play with the test infrastructure here a little bit and do some dependency injection in createFollower that can let us track if db clearing and snapshotting occurs when expected.

It does seem a bit beyond the scope of this. But if you really want me to I can look into it.

it would be nice, a good way of actually validating everything is behaving as expected

OK Will see what I can do

afine · 2017-01-31T00:20:48Z

src/java/main/org/apache/zookeeper/server/quorum/Learner.java

@@ -321,13 +321,16 @@ protected void syncWithLeader(long newLeaderZxid) throws IOException, Interrupte
        QuorumPacket ack = new QuorumPacket(Leader.ACK, 0, null, null);
        QuorumPacket qp = new QuorumPacket();
        long newEpoch = ZxidUtils.getEpochFromZxid(newLeaderZxid);
-
-        readPacket(qp);   
+        //In the DIFF case we don't need to do a snapshot because the transactions will sync on top of any existing snapshot


nit: I think we generally put spaces after the //

hanm · 2017-01-31T18:18:41Z

src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java

+        //  * When a new quorum is established we can still apply the diff
+        //    on top of the same zkDb data
+        //  * If we fetch a new snapshot from leader, the zkDb will be
+        //    cleared anyway before loading the snapshot



There is one case we may still want to clear db here - when one of the ZooKeeper critical threads (such as * processors, session trackers) fail, ZooKeeper server will shutdown (see runFromConfig) and consequently invoke ZooKeeper#shutdown. In this case, I don't see a particular reason not to clear the db, though not doing it might be fine (as one could argue the server will be dead anyway), but I tend to lean towards the safe side on cleaning the db. One way to conditionally do that is to add a Boolean parameter to ZooKeeper#shutdown so we can have fine grained control over when to clear db in what code path.

revans2 · 2017-01-31T20:50:39Z

@hanm I addressed your review comments.

hanm · 2017-02-01T06:18:10Z

@revans2 the change looks good, thanks.

hanm · 2017-02-05T06:03:22Z

Had another look at the patch, specifically the changes on Learner. Looks good to me. +1.

revans2 · 2017-02-06T15:39:42Z

@afine I updated the test to spy on the LearnerZooKeeperServer instance and check if and when takeSnapshot was called. I think this fulfills your desires, but with tests there can always be more. So, if you do want more tests or other areas covered please let me know.

afine · 2017-02-06T21:44:29Z

Thanks @revans2

+1 lgtm

revans2 · 2017-02-09T15:35:38Z

Is there any more I need to do to get this merged in?

hanm · 2017-02-09T18:53:18Z

@revans2 No more work is required, the patch is ready, but I am not sure if this should be included in the upcoming 3.4.10 release. If not we will wait until 3.4.10 is out to merge this into branch-3.4. @rakeshadr Do you think this should be included in 3.4.10?

The PR to master #159 could be merged in, I'll have another look and merge it today.

… DBs (master) This is the master version of #157 Author: Robert (Bobby) Evans <evans@yahoo-inc.com> Reviewers: Flavio Junqueira <fpj@apache.org>, Edward Ribeiro <edward.ribeiro@gmail.com>, Abraham Fine <afine@apache.org>, Michael Han <hanm@apache.org> Closes #159 from revans2/ZOOKEEPER-2678-master and squashes the following commits: 69fbe19 [Robert (Bobby) Evans] ZOOKEEPER-2678: Addressed review comments a432642 [Robert (Bobby) Evans] ZOOKEEPER-2678: Improved test to verify snapshot times 742367e [Robert (Bobby) Evans] Addressed review comments f4c5b0e [Robert (Bobby) Evans] ZOOKEEPER-2678: Discovery and Sync can take a very long time on large DBs

rakeshadr · 2017-02-13T02:25:59Z

@hanm , mostly it makes sense to me as you all have done detailed code reviews and there is no open comments now. I hope the code path is thoroughly discussed/tested? I'd like to freeze the code changes asap.

revans2 · 2017-02-13T14:55:45Z

@rakeshadr If it makes you feel any better we have been running with an older version of this patch in production for a while. We have used it as part of a rolling upgrade at least 10 times in production where if it were not there we would have had some very painful outages.

I have also manually tested it at least 50 times shooting the leader under load (10,000 operations/second) on a 3.4 GB DB, watching it recover, and then validating the integrity of the DB to be sure we didn't get any corruption.

rakeshadr · 2017-02-14T16:54:59Z

Thanks @revans2 for the useful results. I spend sometime going through the changes and looks good to me. @hanm, please go ahead with merging this to branch-3.4.

… DB (3.4) This patch addresses recovery time when a leader is lost on a large DB. It does this by not clearing the DB before leader election begins, and by avoiding taking a snapshot as part of the SYNC phase, specifically for a DIFF sync. It does this by buffering the proposals and commits just like the code currently does for proposals/commits sent after the NEWLEADER and before the UPTODATE messages. If a SNAP is sent we cannot avoid writing out the full snapshot because there is no other way to make sure the disk DB is in sync with what is in memory. So any edits to the edit log before a background snapshot happened could possibly be applied on top of an incorrect snapshot. This same optimization should work for TRUNC too, but I opted not to do it for TRUNC because TRUNC is rare and TRUNC by its very nature already forces the DB to be reread after the edit logs are modified. So it would still not be fast. In practice this makes it so instead of taking 5+ mins for the cluster to recover from losing a leader it now takes about 3 seconds. I am happy to port this to 3.5. if it looks good. Author: Robert (Bobby) Evans <evans@yahoo-inc.com> Reviewers: Flavio Junqueira <fpj@apache.org>, Edward Ribeiro <edward.ribeiro@gmail.com>, Abraham Fine <afine@apache.org>, Michael Han <hanm@apache.org> Closes #157 from revans2/ZOOKEEPER-2678 and squashes the following commits: d079617 [Robert (Bobby) Evans] ZOOKEEPER-2678: Improved test to verify snapshot times dcbf325 [Robert (Bobby) Evans] Addressed review comments f57c384 [Robert (Bobby) Evans] ZOOKEEPER-2678: Fixed another race f705293 [Robert (Bobby) Evans] ZOOKEEPER-2678: Addressed review comments 5aa2562 [Robert (Bobby) Evans] ZOOKEEPER-2678: Discovery and Sync can take a very long time on large DBs

… DBs (master) This is the master version of #157 Author: Robert (Bobby) Evans <evans@yahoo-inc.com> Reviewers: Flavio Junqueira <fpj@apache.org>, Edward Ribeiro <edward.ribeiro@gmail.com>, Abraham Fine <afine@apache.org>, Michael Han <hanm@apache.org> Closes #159 from revans2/ZOOKEEPER-2678-master and squashes the following commits: 69fbe19 [Robert (Bobby) Evans] ZOOKEEPER-2678: Addressed review comments a432642 [Robert (Bobby) Evans] ZOOKEEPER-2678: Improved test to verify snapshot times 742367e [Robert (Bobby) Evans] Addressed review comments f4c5b0e [Robert (Bobby) Evans] ZOOKEEPER-2678: Discovery and Sync can take a very long time on large DBs

hanm · 2017-04-18T00:04:48Z

@revans2 Please close this pull request; it's merged.

… DBs (master) This is the master version of apache#157 Author: Robert (Bobby) Evans <evans@yahoo-inc.com> Reviewers: Flavio Junqueira <fpj@apache.org>, Edward Ribeiro <edward.ribeiro@gmail.com>, Abraham Fine <afine@apache.org>, Michael Han <hanm@apache.org> Closes apache#159 from revans2/ZOOKEEPER-2678-master and squashes the following commits: 69fbe19 [Robert (Bobby) Evans] ZOOKEEPER-2678: Addressed review comments a432642 [Robert (Bobby) Evans] ZOOKEEPER-2678: Improved test to verify snapshot times 742367e [Robert (Bobby) Evans] Addressed review comments f4c5b0e [Robert (Bobby) Evans] ZOOKEEPER-2678: Discovery and Sync can take a very long time on large DBs

fregatte123 · 2020-08-15T08:06:44Z

Suppose there is a situation
zxid_n is the largest zxid of Participant A (the leader has just resumed from downtime). Zxid_n has not been recognized by the quorum. Assuming Participant A is elected as the Leader, then if a follower appears to use DIFF to synchronize data with the Leader, Leader After sending the UPTODATE, the leader can already provide external access, but at this time, the latest zxid_n of the leader has not been supported by the quorum of the follower. At this time, if a client connects to the leader and sees zxid_n, then at this time both the leader and the follower are down. For some reason, the leader cannot be started, and the follower can start normally. At this time, a new leader can only be elected from the follower. Since the data of the follower when the follower uses the DIFF method to synchronize with the leader is still in the memory, it has not had time to persist, then this The newly elected leader does not have the data of zxid_n, but before zxid_n has been seen by the client on the old leader, there will be inconsistencies in the data view.
Is the above situation possible?

anmolnar · 2020-08-22T15:41:51Z

@fregatte123 You should post this question to dev@ or user@ list to reach audience. I'm not sure how many devs are monitoring closed PRs, probably not many.

… DBs (master) This is the master version of apache#157 Author: Robert (Bobby) Evans <evans@yahoo-inc.com> Reviewers: Flavio Junqueira <fpj@apache.org>, Edward Ribeiro <edward.ribeiro@gmail.com>, Abraham Fine <afine@apache.org>, Michael Han <hanm@apache.org> Closes apache#159 from revans2/ZOOKEEPER-2678-master and squashes the following commits: 69fbe19 [Robert (Bobby) Evans] ZOOKEEPER-2678: Addressed review comments a432642 [Robert (Bobby) Evans] ZOOKEEPER-2678: Improved test to verify snapshot times 742367e [Robert (Bobby) Evans] Addressed review comments f4c5b0e [Robert (Bobby) Evans] ZOOKEEPER-2678: Discovery and Sync can take a very long time on large DBs

ZOOKEEPER-2678: Discovery and Sync can take a very long time on large…

5aa2562

… DBs

fpj requested changes Jan 28, 2017

View reviewed changes

Robert (Bobby) Evans added 2 commits January 30, 2017 09:41

ZOOKEEPER-2678: Addressed review comments

f705293

ZOOKEEPER-2678: Fixed another race

f57c384

This was referenced Jan 30, 2017

ZOOKEEPER-2678: Discovery and Sync can take a very long time on large DBs (3.5) #158

Closed

ZOOKEEPER-2678: Discovery and Sync can take a very long time on large DBs (master) #159

Closed

revans2 changed the title ~~ZOOKEEPER-2678: Discovery and Sync can take a very long time on large DB~~ ZOOKEEPER-2678: Discovery and Sync can take a very long time on large DB (3.4) Jan 30, 2017

afine reviewed Jan 31, 2017

View reviewed changes

hanm reviewed Jan 31, 2017

View reviewed changes

Addressed review comments

dcbf325

ZOOKEEPER-2678: Improved test to verify snapshot times

d079617

merlimat mentioned this pull request Mar 5, 2017

Support multiple zookeeper quorum to store cluster-management-configu… apache/pulsar#196

Closed

revans2 closed this Apr 18, 2017

kezhuw mentioned this pull request Apr 5, 2022

ZOOKEEPER-3023: Sync and commit diff log entries before NEWLEADER ack #1848

Closed

jonmv mentioned this pull request Apr 4, 2024

ZOOKEEPER-4712: Fix partially shutdown of ZooKeeperServer and its processors #2154

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZOOKEEPER-2678: Discovery and Sync can take a very long time on large DB (3.4) #157

ZOOKEEPER-2678: Discovery and Sync can take a very long time on large DB (3.4) #157

revans2 commented Jan 26, 2017

fpj left a comment

fpj Jan 28, 2017

revans2 Jan 30, 2017

fpj Jan 28, 2017

fpj commented Jan 28, 2017

revans2 commented Jan 30, 2017

revans2 commented Jan 30, 2017

eribeiro commented Jan 30, 2017 •

edited

Loading

revans2 commented Jan 30, 2017

afine left a comment

afine Jan 31, 2017 •

edited

Loading

revans2 Jan 31, 2017

afine Feb 2, 2017

revans2 Feb 3, 2017

afine Jan 31, 2017

hanm Jan 31, 2017

revans2 commented Jan 31, 2017

hanm commented Feb 1, 2017

hanm commented Feb 5, 2017

revans2 commented Feb 6, 2017

afine commented Feb 6, 2017

revans2 commented Feb 9, 2017

hanm commented Feb 9, 2017

rakeshadr commented Feb 13, 2017

revans2 commented Feb 13, 2017

rakeshadr commented Feb 14, 2017

hanm commented Apr 18, 2017

fregatte123 commented Aug 15, 2020 •

edited

Loading

anmolnar commented Aug 22, 2020

ZOOKEEPER-2678: Discovery and Sync can take a very long time on large DB (3.4) #157

ZOOKEEPER-2678: Discovery and Sync can take a very long time on large DB (3.4) #157

Conversation

revans2 commented Jan 26, 2017

fpj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fpj commented Jan 28, 2017

revans2 commented Jan 30, 2017

revans2 commented Jan 30, 2017

eribeiro commented Jan 30, 2017 • edited Loading

revans2 commented Jan 30, 2017

afine left a comment

Choose a reason for hiding this comment

afine Jan 31, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

revans2 commented Jan 31, 2017

hanm commented Feb 1, 2017

hanm commented Feb 5, 2017

revans2 commented Feb 6, 2017

afine commented Feb 6, 2017

revans2 commented Feb 9, 2017

hanm commented Feb 9, 2017

rakeshadr commented Feb 13, 2017

revans2 commented Feb 13, 2017

rakeshadr commented Feb 14, 2017

hanm commented Apr 18, 2017

fregatte123 commented Aug 15, 2020 • edited Loading

anmolnar commented Aug 22, 2020

eribeiro commented Jan 30, 2017 •

edited

Loading

afine Jan 31, 2017 •

edited

Loading

fregatte123 commented Aug 15, 2020 •

edited

Loading