Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky-test: BrokerRegistryIntegrationTest.testRecoverFromNodeDeletion #23365

Closed
1 of 2 tasks
lhotari opened this issue Sep 28, 2024 · 1 comment · Fixed by #23371
Closed
1 of 2 tasks

Flaky-test: BrokerRegistryIntegrationTest.testRecoverFromNodeDeletion #23365

lhotari opened this issue Sep 28, 2024 · 1 comment · Fixed by #23371

Comments

@lhotari
Copy link
Member

lhotari commented Sep 28, 2024

Search before asking

  • I searched in the issues and found nothing similar.

Example failure

https://github.com/apache/pulsar/actions/runs/11074101345/job/30796634094?pr=23362#step:11:1686

Exception stacktrace

  Error:  org.apache.pulsar.broker.loadbalance.extensions.BrokerRegistryIntegrationTest.testRecoverFromNodeDeletion  Time elapsed: 5.037 s  <<< FAILURE!
  org.awaitility.core.ConditionTimeoutException: Assertion condition defined as a org.apache.pulsar.broker.loadbalance.extensions.BrokerRegistryIntegrationTest lists don't have the same size expected [1] but found [0] within 3 seconds.
  	at org.awaitility.core.ConditionAwaiter.await(ConditionAwaiter.java:167)
  	at org.awaitility.core.AssertionCondition.await(AssertionCondition.java:119)
  	at org.awaitility.core.AssertionCondition.await(AssertionCondition.java:31)
  	at org.awaitility.core.ConditionFactory.until(ConditionFactory.java:985)
  	at org.awaitility.core.ConditionFactory.untilAsserted(ConditionFactory.java:769)
  	at org.apache.pulsar.broker.loadbalance.extensions.BrokerRegistryIntegrationTest.testRecoverFromNodeDeletion(BrokerRegistryIntegrationTest.java:78)
  	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
  	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  	at java.base/java.lang.reflect.Method.invoke(Method.java:569)
  	at org.testng.internal.invokers.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:139)
  	at org.testng.internal.invokers.InvokeMethodRunnable.runOne(InvokeMethodRunnable.java:47)
  	at org.testng.internal.invokers.InvokeMethodRunnable.call(InvokeMethodRunnable.java:76)
  	at org.testng.internal.invokers.InvokeMethodRunnable.call(InvokeMethodRunnable.java:11)
  	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
  	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
  	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
  	at java.base/java.lang.Thread.run(Thread.java:840)
  Caused by: java.lang.AssertionError: lists don't have the same size expected [1] but found [0]
  	at org.testng.Assert.fail(Assert.java:110)
  	at org.testng.Assert.failNotEquals(Assert.java:1577)
  	at org.testng.Assert.assertEqualsImpl(Assert.java:149)
  	at org.testng.Assert.assertEquals(Assert.java:131)
  	at org.testng.Assert.assertEquals(Assert.java:1418)
  	at org.testng.Assert.assertEquals(Assert.java:1382)
  	at org.testng.Assert.assertEquals(Assert.java:1629)
  	at org.testng.Assert.assertEquals(Assert.java:1605)
  	at org.apache.pulsar.broker.loadbalance.extensions.BrokerRegistryIntegrationTest.lambda$testRecoverFromNodeDeletion$1(BrokerRegistryIntegrationTest.java:78)
  	at org.awaitility.core.AssertionCondition.lambda$new$0(AssertionCondition.java:53)
  	at org.awaitility.core.ConditionAwaiter$ConditionPoller.call(ConditionAwaiter.java:248)
  	at org.awaitility.core.ConditionAwaiter$ConditionPoller.call(ConditionAwaiter.java:235)
  	... 4 more

Are you willing to submit a PR?

  • I'm willing to submit a PR!
@BewareMyPower
Copy link
Contributor

The direct cause is that after #23349, the ServiceUnitStateTableViewImpl#flush will call TableView#refresh to refresh the internal cache.

However, there is a bug with TableViewImpl that the refresh on an empty topic will be stuck. This bug can be reproduced by the following patch:

diff --git a/pulsar-broker/src/test/java/org/apache/pulsar/client/impl/TableViewTest.java b/pulsar-broker/src/test/java/org/apache/pulsar/client/impl/TableViewTest.java
index 61ab4de8a3..5448751160 100644
--- a/pulsar-broker/src/test/java/org/apache/pulsar/client/impl/TableViewTest.java
+++ b/pulsar-broker/src/test/java/org/apache/pulsar/client/impl/TableViewTest.java
@@ -173,6 +173,9 @@ public class TableViewTest extends MockedPulsarServiceBaseTest {
         TableView<byte[]> tv = pulsarClient.newTableView(Schema.BYTES)
                 .topic(topic)
                 .create();
+        // Verify refresh can handle the case when the topic is empty
+        tv.refreshAsync().get(3, TimeUnit.SECONDS);
+
         // 2. Add a listen action to provide the test environment.
         // The listen action will be triggered when there are incoming messages every time.
         // This is a sync operation, so sleep in the listen action can slow down the reading rate of messages.

There is another possible cause that BrokerRegisteryImpl#registerAsync is called in the load manager thread when it detects the node is deleted. However, this thread could be blocked by some blocking calls like the flush method above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants