Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZOOKEEPER-2678: Discovery and Sync can take a very long time on large DB (3.4) #157

Closed
wants to merge 5 commits into from

Conversation

revans2
Copy link

@revans2 revans2 commented Jan 26, 2017

This patch addresses recovery time when a leader is lost on a large DB.

It does this by not clearing the DB before leader election begins, and by avoiding taking a snapshot as part of the SYNC phase, specifically for a DIFF sync. It does this by buffering the proposals and commits just like the code currently does for proposals/commits sent after the NEWLEADER and before the UPTODATE messages.

If a SNAP is sent we cannot avoid writing out the full snapshot because there is no other way to make sure the disk DB is in sync with what is in memory. So any edits to the edit log before a background snapshot happened could possibly be applied on top of an incorrect snapshot.

This same optimization should work for TRUNC too, but I opted not to do it for TRUNC because TRUNC is rare and TRUNC by its very nature already forces the DB to be reread after the edit logs are modified. So it would still not be fast.

In practice this makes it so instead of taking 5+ mins for the cluster to recover from losing a leader it now takes about 3 seconds.

I am happy to port this to 3.5. if it looks good.

Copy link
Contributor

@fpj fpj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent some time looking at this change, and I don't see a problem, it looks good. I have one minor request for change and one comment about tests. It would be good to check that our test cases already cover enough, and otherwise add more test cases.

boolean snapshotTaken = false;
boolean isPreZAB1_0 = true;
//If we are not going to take the snapshot be sure the edits are not applied in memory
boolean writeToEditLog = !snapshotNeeded;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes here are using edit to refer to txns. I'd rather use txn to be consistent across the project. Specifically here, you're using EditLog to refer to the TxnLog, please change accordingly to have it consistent across the project.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about that I am still a bit new at the internal terminology. I will update it.

@@ -839,6 +839,13 @@ public void converseWithFollower(InputArchive ia, OutputArchive oa,
Assert.assertEquals(1, f.self.getAcceptedEpoch());
Assert.assertEquals(1, f.self.getCurrentEpoch());

//Wait for the edits to be written out
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to think some more whether it makes any sense to add test cases for this. The test cases we already have probably cover this enough given that there is no real change of behavior.

This change here is necessary, though. We don't really care about time in general in our tests because we can never be sure of the timing we will get across runs and with different settings.

@fpj
Copy link
Contributor

fpj commented Jan 28, 2017

Thanks for the patch @revans2 . It makes sense to port this change to both 3.5 and master.

@revans2
Copy link
Author

revans2 commented Jan 30, 2017

@fpj Thanks for the review I will update the comments and start porting it to other lines.

@revans2 revans2 changed the title ZOOKEEPER-2678: Discovery and Sync can take a very long time on large DB ZOOKEEPER-2678: Discovery and Sync can take a very long time on large DB (3.4) Jan 30, 2017
@revans2
Copy link
Author

revans2 commented Jan 30, 2017

@fpj I addressed your review comments, and I also found another race in the ZAB test that I addressed. Apparently the log line I removed was taking enough time for the transactions to be fully flushed. When I removed it the test would occasionally fail.

Please also take a look at #158 and #159 for master and branch 3.5.

@eribeiro
Copy link
Contributor

eribeiro commented Jan 30, 2017

Hey @revans2, FYI. I was able to apply the both #158 and #159 without any explicit conflict on branch-3.5 and master and #157 on branch-3.4 (but not on the others cited previously). So... you really don't need 3 PR, just 2: one for branch-3.4 and other for branch-3.5 and master, right?

If I am right then putting a comment to apply either #158 or #159 (up to you) to both branch-3.5 and master should be enough, IMHO.

@revans2
Copy link
Author

revans2 commented Jan 30, 2017

@eribeiro will do. It is a clean cherry pick between master and 3.5

Copy link
Contributor

@afine afine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change made sense to me, although I think it would be great to test that we are only snapshotting at the appropriate times

//Wait for the transactions to be written out. The thread that writes them out
// does not send anything back when it is done.
long start = System.currentTimeMillis();
while (createSessionZxid != f.fzk.getLastProcessedZxid() && (System.currentTimeMillis() - start) < 50) {
Copy link
Contributor

@afine afine Jan 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just an idea, not sure if it is worth the effort and it may be outside the scope of this patch.

we could play with the test infrastructure here a little bit and do some dependency injection in createFollower that can let us track if db clearing and snapshotting occurs when expected.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does seem a bit beyond the scope of this. But if you really want me to I can look into it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be nice, a good way of actually validating everything is behaving as expected

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK Will see what I can do

@@ -321,13 +321,16 @@ protected void syncWithLeader(long newLeaderZxid) throws IOException, Interrupte
QuorumPacket ack = new QuorumPacket(Leader.ACK, 0, null, null);
QuorumPacket qp = new QuorumPacket();
long newEpoch = ZxidUtils.getEpochFromZxid(newLeaderZxid);

readPacket(qp);
//In the DIFF case we don't need to do a snapshot because the transactions will sync on top of any existing snapshot
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think we generally put spaces after the //

// * When a new quorum is established we can still apply the diff
// on top of the same zkDb data
// * If we fetch a new snapshot from leader, the zkDb will be
// cleared anyway before loading the snapshot

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one case we may still want to clear db here - when one of the ZooKeeper critical threads (such as * processors, session trackers) fail, ZooKeeper server will shutdown (see runFromConfig) and consequently invoke ZooKeeper#shutdown. In this case, I don't see a particular reason not to clear the db, though not doing it might be fine (as one could argue the server will be dead anyway), but I tend to lean towards the safe side on cleaning the db. One way to conditionally do that is to add a Boolean parameter to ZooKeeper#shutdown so we can have fine grained control over when to clear db in what code path.

@revans2
Copy link
Author

revans2 commented Jan 31, 2017

@hanm I addressed your review comments.

@hanm
Copy link
Contributor

hanm commented Feb 1, 2017

@revans2 the change looks good, thanks.

@hanm
Copy link
Contributor

hanm commented Feb 5, 2017

Had another look at the patch, specifically the changes on Learner. Looks good to me. +1.

@revans2
Copy link
Author

revans2 commented Feb 6, 2017

@afine I updated the test to spy on the LearnerZooKeeperServer instance and check if and when takeSnapshot was called. I think this fulfills your desires, but with tests there can always be more. So, if you do want more tests or other areas covered please let me know.

@afine
Copy link
Contributor

afine commented Feb 6, 2017

Thanks @revans2

+1 lgtm

@revans2
Copy link
Author

revans2 commented Feb 9, 2017

Is there any more I need to do to get this merged in?

@hanm
Copy link
Contributor

hanm commented Feb 9, 2017

@revans2 No more work is required, the patch is ready, but I am not sure if this should be included in the upcoming 3.4.10 release. If not we will wait until 3.4.10 is out to merge this into branch-3.4. @rakeshadr Do you think this should be included in 3.4.10?

The PR to master #159 could be merged in, I'll have another look and merge it today.

asfgit pushed a commit that referenced this pull request Feb 11, 2017
… DBs (master)

This is the master version of #157

Author: Robert (Bobby) Evans <evans@yahoo-inc.com>

Reviewers: Flavio Junqueira <fpj@apache.org>, Edward Ribeiro <edward.ribeiro@gmail.com>, Abraham Fine <afine@apache.org>, Michael Han <hanm@apache.org>

Closes #159 from revans2/ZOOKEEPER-2678-master and squashes the following commits:

69fbe19 [Robert (Bobby) Evans] ZOOKEEPER-2678: Addressed review comments
a432642 [Robert (Bobby) Evans] ZOOKEEPER-2678:  Improved test to verify snapshot times
742367e [Robert (Bobby) Evans] Addressed review comments
f4c5b0e [Robert (Bobby) Evans] ZOOKEEPER-2678: Discovery and Sync can take a very long time on large DBs
@rakeshadr
Copy link
Contributor

@hanm , mostly it makes sense to me as you all have done detailed code reviews and there is no open comments now. I hope the code path is thoroughly discussed/tested? I'd like to freeze the code changes asap.

@revans2
Copy link
Author

revans2 commented Feb 13, 2017

@rakeshadr If it makes you feel any better we have been running with an older version of this patch in production for a while. We have used it as part of a rolling upgrade at least 10 times in production where if it were not there we would have had some very painful outages.

I have also manually tested it at least 50 times shooting the leader under load (10,000 operations/second) on a 3.4 GB DB, watching it recover, and then validating the integrity of the DB to be sure we didn't get any corruption.

@rakeshadr
Copy link
Contributor

Thanks @revans2 for the useful results. I spend sometime going through the changes and looks good to me. @hanm, please go ahead with merging this to branch-3.4.

asfgit pushed a commit that referenced this pull request Feb 14, 2017
… DB (3.4)

This patch addresses recovery time when a leader is lost on a large DB.

It does this by not clearing the DB before leader election begins, and by avoiding taking a snapshot as part of the SYNC phase, specifically for a DIFF sync. It does this by buffering the proposals and commits just like the code currently does for proposals/commits sent after the NEWLEADER and before the UPTODATE messages.

If a SNAP is sent we cannot avoid writing out the full snapshot because there is no other way to make sure the disk DB is in sync with what is in memory.  So any edits to the edit log before a background snapshot happened could possibly be applied on top of an incorrect snapshot.

This same optimization should work for TRUNC too, but I opted not to do it for TRUNC because TRUNC is rare and TRUNC by its very nature already forces the DB to be reread after the edit logs are modified.  So it would still not be fast.

In practice this makes it so instead of taking 5+ mins for the cluster to recover from losing a leader it now takes about 3 seconds.

I am happy to port this to 3.5. if it looks good.

Author: Robert (Bobby) Evans <evans@yahoo-inc.com>

Reviewers: Flavio Junqueira <fpj@apache.org>, Edward Ribeiro <edward.ribeiro@gmail.com>, Abraham Fine <afine@apache.org>, Michael Han <hanm@apache.org>

Closes #157 from revans2/ZOOKEEPER-2678 and squashes the following commits:

d079617 [Robert (Bobby) Evans] ZOOKEEPER-2678:  Improved test to verify snapshot times
dcbf325 [Robert (Bobby) Evans] Addressed review comments
f57c384 [Robert (Bobby) Evans] ZOOKEEPER-2678: Fixed another race
f705293 [Robert (Bobby) Evans] ZOOKEEPER-2678: Addressed review comments
5aa2562 [Robert (Bobby) Evans] ZOOKEEPER-2678: Discovery and Sync can take a very long time on large DBs
asfgit pushed a commit that referenced this pull request Feb 16, 2017
… DBs (master)

This is the master version of #157

Author: Robert (Bobby) Evans <evans@yahoo-inc.com>

Reviewers: Flavio Junqueira <fpj@apache.org>, Edward Ribeiro <edward.ribeiro@gmail.com>, Abraham Fine <afine@apache.org>, Michael Han <hanm@apache.org>

Closes #159 from revans2/ZOOKEEPER-2678-master and squashes the following commits:

69fbe19 [Robert (Bobby) Evans] ZOOKEEPER-2678: Addressed review comments
a432642 [Robert (Bobby) Evans] ZOOKEEPER-2678:  Improved test to verify snapshot times
742367e [Robert (Bobby) Evans] Addressed review comments
f4c5b0e [Robert (Bobby) Evans] ZOOKEEPER-2678: Discovery and Sync can take a very long time on large DBs
@hanm
Copy link
Contributor

hanm commented Apr 18, 2017

@revans2 Please close this pull request; it's merged.

@revans2 revans2 closed this Apr 18, 2017
lvfangmin pushed a commit to lvfangmin/zookeeper that referenced this pull request Jun 17, 2018
… DBs (master)

This is the master version of apache#157

Author: Robert (Bobby) Evans <evans@yahoo-inc.com>

Reviewers: Flavio Junqueira <fpj@apache.org>, Edward Ribeiro <edward.ribeiro@gmail.com>, Abraham Fine <afine@apache.org>, Michael Han <hanm@apache.org>

Closes apache#159 from revans2/ZOOKEEPER-2678-master and squashes the following commits:

69fbe19 [Robert (Bobby) Evans] ZOOKEEPER-2678: Addressed review comments
a432642 [Robert (Bobby) Evans] ZOOKEEPER-2678:  Improved test to verify snapshot times
742367e [Robert (Bobby) Evans] Addressed review comments
f4c5b0e [Robert (Bobby) Evans] ZOOKEEPER-2678: Discovery and Sync can take a very long time on large DBs
@fregatte123
Copy link

fregatte123 commented Aug 15, 2020

Suppose there is a situation
zxid_n is the largest zxid of Participant A (the leader has just resumed from downtime). Zxid_n has not been recognized by the quorum. Assuming Participant A is elected as the Leader, then if a follower appears to use DIFF to synchronize data with the Leader, Leader After sending the UPTODATE, the leader can already provide external access, but at this time, the latest zxid_n of the leader has not been supported by the quorum of the follower. At this time, if a client connects to the leader and sees zxid_n, then at this time both the leader and the follower are down. For some reason, the leader cannot be started, and the follower can start normally. At this time, a new leader can only be elected from the follower. Since the data of the follower when the follower uses the DIFF method to synchronize with the leader is still in the memory, it has not had time to persist, then this The newly elected leader does not have the data of zxid_n, but before zxid_n has been seen by the client on the old leader, there will be inconsistencies in the data view.
Is the above situation possible?

@anmolnar
Copy link
Contributor

@fregatte123 You should post this question to dev@ or user@ list to reach audience. I'm not sure how many devs are monitoring closed PRs, probably not many.

RokLenarcic pushed a commit to RokLenarcic/zookeeper that referenced this pull request Sep 3, 2022
… DBs (master)

This is the master version of apache#157

Author: Robert (Bobby) Evans <evans@yahoo-inc.com>

Reviewers: Flavio Junqueira <fpj@apache.org>, Edward Ribeiro <edward.ribeiro@gmail.com>, Abraham Fine <afine@apache.org>, Michael Han <hanm@apache.org>

Closes apache#159 from revans2/ZOOKEEPER-2678-master and squashes the following commits:

69fbe19 [Robert (Bobby) Evans] ZOOKEEPER-2678: Addressed review comments
a432642 [Robert (Bobby) Evans] ZOOKEEPER-2678:  Improved test to verify snapshot times
742367e [Robert (Bobby) Evans] Addressed review comments
f4c5b0e [Robert (Bobby) Evans] ZOOKEEPER-2678: Discovery and Sync can take a very long time on large DBs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants