Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZOOKEEPER-4712: Fix partially shutdown of ZooKeeperServer and its processors #2154

Merged
merged 9 commits into from
Sep 20, 2024

Conversation

jonmv
Copy link
Contributor

@jonmv jonmv commented Apr 4, 2024

This PR fixes the shutdown errors that were added in #157, and also avoids a common NPE during ZK shutdown from a learner, when the leader shuts down (first commit).
Together with #2111 and #2152, this should cover all the fixes in #1925.

We've had the forked ZK from #1925 running embedded in hundreds, if not thousands, of ZK clusters, with rolling restarts most days, and we've had zero cases of inconsistent data since we patched—one or a few cases per week before that.
(We still sometimes see ephemeral nodes remain after the leader is brutally taken down, i.e., with Runtime.halt(), but this looks different; it seems clearing out client sessions, and their ephemeral nodes, simply isn't done when death is too sudden.)

@AlphaCanisMajoris
Copy link
Contributor

LGTM.

Together with #2111 and #2152, this should cover all the fixes in #1925.

Yes I believe so.

BTW, could u please recommit this pr, since it wasn't built successfully.

@tsuna
Copy link

tsuna commented Jun 13, 2024

Will this fix be backported in the 3.8 train? We just hit this bug on one of our clusters, it's a shame we've had various fixes up for review for over 20 months, I would appreciate you guys pushing this through the finish line so this critical issue can be closed. Thanks!

@changruill
Copy link

One more thing, the resources (Threads/Processors...) created in startupWithServerState(State.INITIAL) won't be released in shutdown, cause of canShutdown does not contain the condition state == State.INITIAL. This LEAK would occur before ZooKeeperServer.state changes to State.RUNNING (follower read a UPTODATE packet).

@tsuna
Copy link

tsuna commented Jun 19, 2024

This is a pretty small, targeted fix now, is there anything controversial about it or would it be possible to merge it and cut a release soon?
The change needs to be rebased.

@jonmv jonmv force-pushed the jonmv/ZOOKEEPER-4541-take-2 branch from 8085a35 to 5d7aa33 Compare August 12, 2024 12:47
@jonmv
Copy link
Contributor Author

jonmv commented Aug 12, 2024

I reworked the Learner-socket-close code to be a bit less prone to misuse, while handling a conflict with the first commit, just now.

@kezhuw kezhuw self-requested a review August 12, 2024 13:36
@jonmv
Copy link
Contributor Author

jonmv commented Aug 28, 2024

d2ee4dd fixes the last comment by @changruill.

@tsuna
Copy link

tsuna commented Sep 18, 2024

Hello, would it be possible to move forward with this PR and merge it and cut a release? Are there any outstanding concerns or objections?

Copy link
Contributor

@anmolnar anmolnar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm.

Comment on lines +904 to +906
if (sock == null) { // Closing before establishing the connection is a noop
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Why have you moved the null check inside the lock?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Close could be called from different threads, and sockBeingClosed ensures memory visibility for sock, as it's set after sock is assigned in connectToLeader.
The only other thread I can see that closes the learner, right now, is the sync processor, which is initialised after sock is assigned, so it works as-was, but I still prefer to be explicit about this.

@anmolnar
Copy link
Contributor

@jonmv @tsuna Please close any outstanding PR/Jira ticket that you think is already superseded by something else.
Also please follow the e-mail thread on the dev list which tries to wrap up what's needed for the release. Thanks.

@kezhuw kezhuw changed the title Jonmv/zookeeper 4541 take 2 ZOOKEEPER-4712: Fix partially shutdown of ZooKeeperServer and its processors Sep 19, 2024
Copy link
Member

@kezhuw kezhuw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 in general.

I left inline one comment about the overriding method.

@@ -46,7 +48,7 @@ public void processRequest(Request si) {
learner.writePacket(qp, false);
} catch (IOException e) {
LOG.warn("Closing connection to leader, exception during packet send", e);
learner.closeSockSync();
learner.closeSocket();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on this.

This is the leftover of ZOOKEEPER-4409. We should respect zookeeper.learner.closeSocketAsync here.

@kezhuw
Copy link
Member

kezhuw commented Sep 19, 2024

Hi @jonmv, I added a test case for this pr in jonmv#1. Could you please take a look ?

jonmv and others added 2 commits September 20, 2024 08:18
ZOOKEEPER-4712: Add test case to assert request processors got shutdown in ZooKeeperServer::shutdown
@@ -97,7 +97,7 @@ public Boolean answer(InvocationOnMock invocation) throws Throwable {
}
});

ZKDatabase database = new ZKDatabase(null);
ZKDatabase database = new ZKDatabase(mock(FileTxnSnapLog.class));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise: NPE during fast-forward during shutdown.

@jonmv
Copy link
Contributor Author

jonmv commented Sep 20, 2024

A seemingly unrelated unit test failed in the PR jenkins job: it sleeps for a while, while waiting for some state that wasn't reached within timeout; it doesn't depend on shutdown logic, and the test also runs fine locally.

// cleared anyway before loading the snapshot
try {
// This will fast-forward the database to the last recorded transaction
zkDb.fastForwardDataBase();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw its access to metrics after unregisterMetrics. It probably be a good to order it before shutdownComponents, that is assuming it require full functional server components.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metrics used by the FileTxnSnapLog are not "owned" by the ZooKeeperServer, or its children, and not unregistered here, so that shouldn't be a problem.
On the other hand, as the parent ZooKeeperServer is a dependency of its child classes, I would generally shut down all their components before the parent, and I consider the zkDb to be owned by the parent in this case, which suggests the current order of shutdown is correct. That said, "shutdown is hard" 😂

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metrics used by the FileTxnSnapLog are not "owned" by the ZooKeeperServer, or its children, and not unregistered here, so that shouldn't be a problem.

It is true that running multiple ZooKeeper instances in one JVM probably be hard. It is false that it unregisters metrics during shutdown.

I am ok to keep it unchanged.

Copy link
Member

@kezhuw kezhuw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I left a comments about the order of zkDb.fastForwardDataBase() with shutdownComponents.

@anmolnar anmolnar merged commit bc9afbf into apache:master Sep 20, 2024
14 checks passed
asfgit pushed a commit that referenced this pull request Sep 20, 2024
…cessors

Reviewers: anmolnar, kezhuw, kezhuw, kezhuw
Author: jonmv
Closes #2154 from jonmv/jonmv/ZOOKEEPER-4541-take-2

(cherry picked from commit bc9afbf)
Signed-off-by: Andor Molnar <andor@apache.org>
@anmolnar
Copy link
Contributor

Merged. Thanks @jonmv !

@anmolnar
Copy link
Contributor

@jonmv What is your Jira id?

@jonmv
Copy link
Contributor Author

jonmv commented Sep 21, 2024

@jonmv What is your Jira id?

It's also "jonmv".

@jonmv jonmv deleted the jonmv/ZOOKEEPER-4541-take-2 branch September 23, 2024 07:53
@anmolnar
Copy link
Contributor

@jonmv What is your Jira id?

It's also "jonmv".

All done. Thank you!

@bbarnes52
Copy link

Hey @jonmv thanks for the fix. Can https://issues.apache.org/jira/browse/ZOOKEEPER-4502 (referenced in #1925) be closed now that this is merged?

@kezhuw
Copy link
Member

kezhuw commented Oct 29, 2024

I have closed ZOOKEEPER-4502.

@bbarnes52 Thank you for your information!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants