Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky-test: AdvertisedListenersMultiBrokerLeaderElectionTest.shouldProvideConsistentAnswerToTopicLookupsUsingAdminApi #22239

Closed
1 of 2 tasks
lhotari opened this issue Mar 11, 2024 · 8 comments · Fixed by #22250 or #22249

Comments

@lhotari
Copy link
Member

lhotari commented Mar 11, 2024

Search before asking

  • I searched in the issues and found nothing similar.

Example failure

https://github.com/apache/pulsar/actions/runs/8226863359/job/22495219489?pr=22230#step:11:1567

Exception stacktrace

  Error:  Tests run: 8, Failures: 1, Errors: 0, Skipped: 4, Time elapsed: 230.373 s <<< FAILURE! - in org.apache.pulsar.broker.loadbalance.AdvertisedListenersMultiBrokerLeaderElectionTest
  Error:  org.apache.pulsar.broker.loadbalance.AdvertisedListenersMultiBrokerLeaderElectionTest.shouldProvideConsistentAnswerToTopicLookupsUsingAdminApi  Time elapsed: 60.865 s  <<< FAILURE!
  org.testng.internal.thread.ThreadTimeoutException: Method org.apache.pulsar.broker.loadbalance.MultiBrokerLeaderElectionTest.shouldProvideConsistentAnswerToTopicLookupsUsingAdminApi() didn't finish within the time-out 60000
  	at java.base@17.0.10/jdk.internal.misc.Unsafe.park(Native Method)
  	at java.base@17.0.10/java.util.concurrent.locks.LockSupport.park(LockSupport.java:211)
  	at java.base@17.0.10/java.util.concurrent.FutureTask.awaitDone(FutureTask.java:447)
  	at java.base@17.0.10/java.util.concurrent.FutureTask.get(FutureTask.java:190)
  	at app//org.apache.pulsar.broker.loadbalance.MultiBrokerLeaderElectionTest.shouldProvideConsistentAnswerToTopicLookupsUsingAdminApi(MultiBrokerLeaderElectionTest.java:134)
  	at java.base@17.0.10/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  	at java.base@17.0.10/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
  	at java.base@17.0.10/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  	at java.base@17.0.10/java.lang.reflect.Method.invoke(Method.java:568)
  	at app//org.testng.internal.invokers.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:139)
  	at app//org.testng.internal.invokers.InvokeMethodRunnable.runOne(InvokeMethodRunnable.java:47)
  	at app//org.testng.internal.invokers.InvokeMethodRunnable.call(InvokeMethodRunnable.java:76)
  	at app//org.testng.internal.invokers.InvokeMethodRunnable.call(InvokeMethodRunnable.java:11)
  	at java.base@17.0.10/java.util.concurrent.FutureTask.run(FutureTask.java:264)
  	at java.base@17.0.10/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
  	at java.base@17.0.10/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
  	at java.base@17.0.10/java.lang.Thread.run(Thread.java:840)

logs at https://gist.github.com/lhotari/85729c36d42d5d296dfb39178314aa53 .

Notice! This might be a real issue since the test is also testing that leader election doesn't starve.

Are you willing to submit a PR?

  • I'm willing to submit a PR!
@lhotari
Copy link
Member Author

lhotari commented Mar 12, 2024

@lhotari
Copy link
Member Author

lhotari commented Mar 12, 2024

I'm able to reproduce the issue by running the test in Docker with 1 CPU.

Using shell script functions from https://github.com/lhotari/pulsar-contributor-toolbox/blob/master/functions/pulsar-contributor-toolbox-functions.sh and running

ptbx_run_test_in_docker --cpus=1 -pl pulsar-broker -Dtest=AdvertisedListenersMultiBrokerLeaderElectionTest

I'll check what causes the test to take longer than 60 seconds so that it times out.

@lhotari
Copy link
Member Author

lhotari commented Mar 12, 2024

I did some profiling with async-profiler and it seems mainly about Mockito that shows up in profiling.

@lhotari
Copy link
Member Author

lhotari commented Mar 12, 2024

I created #22250 to address the performance issue caused by Mockito in AdvertisedListenersMultiBrokerLeaderElectionTest . /cc @liangyepianzhou

@lhotari
Copy link
Member Author

lhotari commented Mar 12, 2024

It's possible that OTel integration is somehow impacting the slowness together with the Mockito overhead.
OTel integration to Topic lookup metrics was added in #22058 and Otel integration in #22010.

I happen to see these on the logs so that's why Otel came to mind. I didn't see Otel show up in CPU profiling.

[WARNING] [stderr] Mar 12, 2024 1:01:59 PM io.opentelemetry.sdk.OpenTelemetrySdk shutdown
[WARNING] [stderr] INFO: Multiple shutdown calls
[WARNING] [stderr] Mar 12, 2024 1:01:59 PM io.opentelemetry.sdk.OpenTelemetrySdk shutdown
[WARNING] [stderr] INFO: Multiple shutdown calls
[WARNING] [stderr] Mar 12, 2024 1:01:59 PM io.opentelemetry.sdk.OpenTelemetrySdk shutdown
[WARNING] [stderr] INFO: Multiple shutdown calls
[WARNING] [stderr] Mar 12, 2024 1:01:59 PM io.opentelemetry.sdk.OpenTelemetrySdk shutdown
[WARNING] [stderr] INFO: Multiple shutdown calls
[WARNING] [stderr] Mar 12, 2024 1:01:59 PM io.opentelemetry.sdk.OpenTelemetrySdk shutdown
[WARNING] [stderr] INFO: Multiple shutdown calls
[WARNING] [stderr] Mar 12, 2024 1:01:59 PM io.opentelemetry.sdk.OpenTelemetrySdk shutdown
[WARNING] [stderr] INFO: Multiple shutdown calls

@liangyepianzhou
Copy link
Contributor

@lhotari Thanks for the update.

@lhotari
Copy link
Member Author

lhotari commented Mar 12, 2024

@lhotari Thanks for the update.

@liangyepianzhou please also review #22251 , that's a PR that will reduce the logging overhead in tests. Noticed that in profiling.

@lhotari lhotari reopened this Mar 12, 2024
@lhotari
Copy link
Member Author

lhotari commented Mar 12, 2024

#22249 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment