Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shutting down broker hangs in ModularLoadManagerImpl.disableBroker #15643

Closed
lhotari opened this issue May 18, 2022 · 7 comments
Closed

Shutting down broker hangs in ModularLoadManagerImpl.disableBroker #15643

lhotari opened this issue May 18, 2022 · 7 comments

Comments

@lhotari
Copy link
Member

lhotari commented May 18, 2022

Tests sporadically get stuck in the broker shutdown sequence. This could also be a production code issue.

Relevant stack traces:

"main" #1 prio=5 os_prio=0 cpu=44417.24ms elapsed=3524.83s tid=0x00007fa768024380 nid=0xac3 waiting on condition  [0x00007fa76d57d000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@17.0.3/Native Method)
	- parking to wait for  <0x00000000c2e54210> (a java.util.concurrent.CompletableFuture$Signaller)
	at java.util.concurrent.locks.LockSupport.park(java.base@17.0.3/LockSupport.java:211)
	at java.util.concurrent.CompletableFuture$Signaller.block(java.base@17.0.3/CompletableFuture.java:1864)
	at java.util.concurrent.ForkJoinPool.unmanagedBlock(java.base@17.0.3/ForkJoinPool.java:3463)
	at java.util.concurrent.ForkJoinPool.managedBlock(java.base@17.0.3/ForkJoinPool.java:3434)
	at java.util.concurrent.CompletableFuture.waitingGet(java.base@17.0.3/CompletableFuture.java:1898)
	at java.util.concurrent.CompletableFuture.join(java.base@17.0.3/CompletableFuture.java:2117)
	at org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl.disableBroker(ModularLoadManagerImpl.java:621)
	at org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerWrapper.disableBroker(ModularLoadManagerWrapper.java:47)
	at org.apache.pulsar.broker.service.BrokerService.unloadNamespaceBundlesGracefully(BrokerService.java:896)
	at org.apache.pulsar.broker.service.BrokerService.unloadNamespaceBundlesGracefully(BrokerService.java:887)
	at org.apache.pulsar.broker.service.BrokerService.closeAsync(BrokerService.java:732)
	at org.apache.pulsar.broker.PulsarService.closeAsync(PulsarService.java:452)
	at org.apache.pulsar.broker.PulsarService.close(PulsarService.java:374)
	at org.apache.pulsar.functions.worker.PulsarFunctionTlsTest.tearDown(PulsarFunctionTlsTest.java:182)
"pulsar-modular-load-manager-3622-1" #6477 prio=5 os_prio=0 cpu=1.22ms elapsed=3122.86s tid=0x00007fa73c12a470 nid=0x108a0 waiting on condition  [0x00007fa6fba45000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@17.0.3/Native Method)
	- parking to wait for  <0x00000000c2e558f8> (a java.util.concurrent.CompletableFuture$Signaller)
	at java.util.concurrent.locks.LockSupport.park(java.base@17.0.3/LockSupport.java:211)
	at java.util.concurrent.CompletableFuture$Signaller.block(java.base@17.0.3/CompletableFuture.java:1864)
	at java.util.concurrent.ForkJoinPool.unmanagedBlock(java.base@17.0.3/ForkJoinPool.java:3463)
	at java.util.concurrent.ForkJoinPool.managedBlock(java.base@17.0.3/ForkJoinPool.java:3434)
	at java.util.concurrent.CompletableFuture.waitingGet(java.base@17.0.3/CompletableFuture.java:1898)
	at java.util.concurrent.CompletableFuture.get(java.base@17.0.3/CompletableFuture.java:2072)
	at org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl.updateAllBrokerData(ModularLoadManagerImpl.java:506)
	at org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl.updateAll(ModularLoadManagerImpl.java:480)
	at org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl$$Lambda$611/0x00000008012f8d00.run(Unknown Source)
	at java.util.concurrent.Executors$RunnableAdapter.call(java.base@17.0.3/Executors.java:539)
	at java.util.concurrent.FutureTask.run(java.base@17.0.3/FutureTask.java:264)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(java.base@17.0.3/ScheduledThreadPoolExecutor.java:304)

Full thread dump in jstack.review

This seems to be related to a race condition in the shutdown sequence.

// shutdown loadmanager before shutting down the broker
executorServicesShutdown.shutdown(loadManagerExecutor);
LoadManager loadManager = this.loadManager.get();
if (loadManager != null) {
loadManager.stop();
}
List<CompletableFuture<Void>> asyncCloseFutures = new ArrayList<>();
if (this.brokerService != null) {
CompletableFuture<Void> brokerCloseFuture = this.brokerService.closeAsync();
if (this.transactionMetadataStoreService != null) {
asyncCloseFutures.add(brokerCloseFuture.whenComplete((__, ___) -> {
// close transactionMetadataStoreService after the broker has been closed
this.transactionMetadataStoreService.close();
this.transactionMetadataStoreService = null;
}));
} else {
asyncCloseFutures.add(brokerCloseFuture);
}
this.brokerService = null;
}

In #10365 was moved to run the load manager stop before the broker service close. This seems to cause issues in broker shutdown when unloadNamespaceBundlesGracefully calls load manager's disableBroker method:

public void unloadNamespaceBundlesGracefully(int maxConcurrentUnload, boolean closeWithoutWaitingClientDisconnect) {
try {
log.info("Unloading namespace-bundles...");
// make broker-node unavailable from the cluster
if (pulsar.getLoadManager() != null && pulsar.getLoadManager().get() != null) {
try {
pulsar.getLoadManager().get().disableBroker();
} catch (PulsarServerException.NotFoundException ne) {
log.warn("Broker load-manager znode doesn't exist ", ne);
// still continue and release bundle ownership as broker's registration node doesn't exist.
}
}

@lhotari
Copy link
Member Author

lhotari commented May 18, 2022

One observation is that LockManagerImpl's asyncClose method doesn't wait for locks to be released:

return FutureUtils.collect(
locks.values().stream()
.map(ResourceLock::release)
.collect(Collectors.toList()))
.thenApply(x -> null);

This could cause a race.

@lhotari
Copy link
Member Author

lhotari commented May 18, 2022

@github-actions
Copy link

The issue had no activity for 30 days, mark with Stale label.

@lhotari
Copy link
Member Author

lhotari commented Aug 26, 2022

recent stack trace:

"main" #1 prio=5 os_prio=0 cpu=27971.85ms elapsed=3533.18s tid=0x00007f9d68026e70 nid=0xaca waiting on condition  [0x00007f9d6e471000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@17.0.4/Native Method)
	- parking to wait for  <0x000010003a8208b0> (a java.util.concurrent.CompletableFuture$Signaller)
	at java.util.concurrent.locks.LockSupport.park(java.base@17.0.4/LockSupport.java:211)
	at java.util.concurrent.CompletableFuture$Signaller.block(java.base@17.0.4/CompletableFuture.java:1864)
	at java.util.concurrent.ForkJoinPool.unmanagedBlock(java.base@17.0.4/ForkJoinPool.java:3463)
	at java.util.concurrent.ForkJoinPool.managedBlock(java.base@17.0.4/ForkJoinPool.java:3434)
	at java.util.concurrent.CompletableFuture.waitingGet(java.base@17.0.4/CompletableFuture.java:1898)
	at java.util.concurrent.CompletableFuture.join(java.base@17.0.4/CompletableFuture.java:2117)
	at org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl.disableBroker(ModularLoadManagerImpl.java:624)
	at org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerWrapper.disableBroker(ModularLoadManagerWrapper.java:48)
	at org.apache.pulsar.broker.service.BrokerService.unloadNamespaceBundlesGracefully(BrokerService.java:900)
	at org.apache.pulsar.broker.service.BrokerService.unloadNamespaceBundlesGracefully(BrokerService.java:891)
	at org.apache.pulsar.broker.service.BrokerService.closeAsync(BrokerService.java:736)
	at org.apache.pulsar.broker.PulsarService.closeAsync(PulsarService.java:459)
	at org.apache.pulsar.broker.PulsarService.close(PulsarService.java:383)
	at org.apache.pulsar.functions.worker.PulsarFunctionTlsTest.tearDown(PulsarFunctionTlsTest.java:189)
	at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(java.base@17.0.4/Native Method)
	at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(java.base@17.0.4/NativeMethodAccessorImpl.java:77)
	at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(java.base@17.0.4/DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(java.base@17.0.4/Method.java:568)
	at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:132)
	at org.testng.internal.MethodInvocationHelper.invokeMethodConsideringTimeout(MethodInvocationHelper.java:61)
	at org.testng.internal.ConfigInvoker.invokeConfigurationMethod(ConfigInvoker.java:366)
	at org.testng.internal.ConfigInvoker.invokeConfigurations(ConfigInvoker.java:320)
	at org.testng.internal.TestInvoker.runConfigMethods(TestInvoker.java:701)
	at org.testng.internal.TestInvoker.runAfterGroupsConfigurations(TestInvoker.java:677)
	at org.testng.internal.TestInvoker.invokeMethod(TestInvoker.java:661)
	at org.testng.internal.TestInvoker.invokeTestMethod(TestInvoker.java:174)
	at org.testng.internal.MethodRunner.runInSequence(MethodRunner.java:46)
	at org.testng.internal.TestInvoker$MethodInvocationAgent.invoke(TestInvoker.java:822)
	at org.testng.internal.TestInvoker.invokeTestMethods(TestInvoker.java:147)
	at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:146)
	at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:128)
	at org.testng.TestRunner$$Lambda$175/0x0000000800d9ac08.accept(Unknown Source)
	at java.util.ArrayList.forEach(java.base@17.0.4/ArrayList.java:1511)
	at org.testng.TestRunner.privateRun(TestRunner.java:764)
	at org.testng.TestRunner.run(TestRunner.java:585)
	at org.testng.SuiteRunner.runTest(SuiteRunner.java:384)
	at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:378)
	at org.testng.SuiteRunner.privateRun(SuiteRunner.java:337)
	at org.testng.SuiteRunner.run(SuiteRunner.java:286)
	at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:53)
	at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:96)
	at org.testng.TestNG.runSuitesSequentially(TestNG.java:1218)
	at org.testng.TestNG.runSuitesLocally(TestNG.java:1140)
	at org.testng.TestNG.runSuites(TestNG.java:1069)
	at org.testng.TestNG.run(TestNG.java:1037)
	at org.apache.maven.surefire.testng.TestNGExecutor.run(TestNGExecutor.java:135)
	at org.apache.maven.surefire.testng.TestNGDirectoryTestSuite.executeSingleClass(TestNGDirectoryTestSuite.java:112)
	at org.apache.maven.surefire.testng.TestNGDirectoryTestSuite.executeLazy(TestNGDirectoryTestSuite.java:123)
	at org.apache.maven.surefire.testng.TestNGDirectoryTestSuite.execute(TestNGDirectoryTestSuite.java:90)
	at org.apache.maven.surefire.testng.TestNGProvider.invoke(TestNGProvider.java:146)
	at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
	at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
	at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
	at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)

at https://github.com/apache/pulsar/runs/8026627215?check_suite_focus=true#step:11:77

@github-actions
Copy link

The issue had no activity for 30 days, mark with Stale label.

@github-actions github-actions bot added the Stale label Sep 27, 2022
@lhotari
Copy link
Member Author

lhotari commented Jan 2, 2023

This might be fixed by #19055

@lhotari
Copy link
Member Author

lhotari commented Jan 2, 2023

fixed by #15755

@lhotari lhotari closed this as completed Jan 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant