-
Notifications
You must be signed in to change notification settings - Fork 7.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZOOKEEPER-4712: Fix partially shutdown of ZooKeeperServer and its processors #2154
Conversation
zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/SendAckRequestProcessor.java
Outdated
Show resolved
Hide resolved
Will this fix be backported in the 3.8 train? We just hit this bug on one of our clusters, it's a shame we've had various fixes up for review for over 20 months, I would appreciate you guys pushing this through the finish line so this critical issue can be closed. Thanks! |
One more thing, the resources (Threads/Processors...) created in |
This is a pretty small, targeted fix now, is there anything controversial about it or would it be possible to merge it and cut a release soon? |
… down from Learner
8085a35
to
5d7aa33
Compare
I reworked the Learner-socket-close code to be a bit less prone to misuse, while handling a conflict with the first commit, just now. |
d2ee4dd fixes the last comment by @changruill. |
Hello, would it be possible to move forward with this PR and merge it and cut a release? Are there any outstanding concerns or objections? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm.
if (sock == null) { // Closing before establishing the connection is a noop | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Why have you moved the null check inside the lock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Close could be called from different threads, and sockBeingClosed
ensures memory visibility for sock
, as it's set after sock
is assigned in connectToLeader
.
The only other thread I can see that closes the learner, right now, is the sync processor, which is initialised after sock
is assigned, so it works as-was, but I still prefer to be explicit about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 in general.
I left inline one comment about the overriding method.
@@ -46,7 +48,7 @@ public void processRequest(Request si) { | |||
learner.writePacket(qp, false); | |||
} catch (IOException e) { | |||
LOG.warn("Closing connection to leader, exception during packet send", e); | |||
learner.closeSockSync(); | |||
learner.closeSocket(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on this.
This is the leftover of ZOOKEEPER-4409. We should respect zookeeper.learner.closeSocketAsync
here.
zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/LearnerZooKeeperServer.java
Outdated
Show resolved
Hide resolved
…wn in ZooKeeperServer::shutdown
ZOOKEEPER-4712: Add test case to assert request processors got shutdown in ZooKeeperServer::shutdown
zookeeper-server/src/main/java/org/apache/zookeeper/server/ZooKeeperServerMain.java
Show resolved
Hide resolved
@@ -97,7 +97,7 @@ public Boolean answer(InvocationOnMock invocation) throws Throwable { | |||
} | |||
}); | |||
|
|||
ZKDatabase database = new ZKDatabase(null); | |||
ZKDatabase database = new ZKDatabase(mock(FileTxnSnapLog.class)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise: NPE during fast-forward during shutdown.
A seemingly unrelated unit test failed in the PR jenkins job: it sleeps for a while, while waiting for some state that wasn't reached within timeout; it doesn't depend on shutdown logic, and the test also runs fine locally. |
// cleared anyway before loading the snapshot | ||
try { | ||
// This will fast-forward the database to the last recorded transaction | ||
zkDb.fastForwardDataBase(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw its access to metrics after unregisterMetrics
. It probably be a good to order it before shutdownComponents
, that is assuming it require full functional server components.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metrics used by the FileTxnSnapLog
are not "owned" by the ZooKeeperServer
, or its children, and not unregistered here, so that shouldn't be a problem.
On the other hand, as the parent ZooKeeperServer
is a dependency of its child classes, I would generally shut down all their components before the parent, and I consider the zkDb
to be owned by the parent in this case, which suggests the current order of shutdown is correct. That said, "shutdown is hard" 😂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metrics used by the FileTxnSnapLog are not "owned" by the ZooKeeperServer, or its children, and not unregistered here, so that shouldn't be a problem.
It is true that running multiple ZooKeeper instances in one JVM probably be hard. It is false that it unregisters metrics during shutdown.
I am ok to keep it unchanged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I left a comments about the order of zkDb.fastForwardDataBase()
with shutdownComponents
.
zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/LeaderZooKeeperServer.java
Outdated
Show resolved
Hide resolved
…cessors Reviewers: anmolnar, kezhuw, kezhuw, kezhuw Author: jonmv Closes #2154 from jonmv/jonmv/ZOOKEEPER-4541-take-2 (cherry picked from commit bc9afbf) Signed-off-by: Andor Molnar <andor@apache.org>
Merged. Thanks @jonmv ! |
@jonmv What is your Jira id? |
It's also "jonmv". |
All done. Thank you! |
Hey @jonmv thanks for the fix. Can https://issues.apache.org/jira/browse/ZOOKEEPER-4502 (referenced in #1925) be closed now that this is merged? |
I have closed ZOOKEEPER-4502. @bbarnes52 Thank you for your information! |
This PR fixes the shutdown errors that were added in #157, and also avoids a common NPE during ZK shutdown from a learner, when the leader shuts down (first commit).
Together with #2111 and #2152, this should cover all the fixes in #1925.
We've had the forked ZK from #1925 running embedded in hundreds, if not thousands, of ZK clusters, with rolling restarts most days, and we've had zero cases of inconsistent data since we patched—one or a few cases per week before that.
(We still sometimes see ephemeral nodes remain after the leader is brutally taken down, i.e., with
Runtime.halt()
, but this looks different; it seems clearing out client sessions, and their ephemeral nodes, simply isn't done when death is too sudden.)