Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent test failure on BacklogQuotaManagerTest.testAheadProducerOnHoldTimeout #33

Closed
merlimat opened this issue Sep 22, 2016 · 1 comment
Assignees
Labels
area/test type/bug The PR fixed a bug or issue reported a bug
Milestone

Comments

@merlimat
Copy link
Contributor

Seen intermittently in travis builds. Unfortunately the exception stack trace is not being printed fully.

com.yahoo.pulsar.client.admin.internal.BrokerStatsImpl$$EnhancerByMockitoWithCGLIB$$7141231b@443bf7aa
Tests run: 361, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 825.674 sec <<< FAILURE! - in TestSuite
testAheadProducerOnHoldTimeout(com.yahoo.pulsar.broker.service.BacklogQuotaManagerTest)  Time elapsed: 8.691 sec  <<< FAILURE!
io.netty.util.IllegalReferenceCountException: refCnt: 0, decrement: 1
    at com.yahoo.pulsar.broker.service.BacklogQuotaManagerTest.testAheadProducerOnHoldTimeout(BacklogQuotaManagerTest.java:504)
@merlimat merlimat added this to the 1.15 milestone Sep 22, 2016
@merlimat merlimat self-assigned this Sep 22, 2016
@merlimat
Copy link
Contributor Author

Got a stack trace of one failed run out of 100 passed :

Got exception : io.netty.util.IllegalReferenceCountException: refCnt: 0, decrement: 1
io.netty.util.IllegalReferenceCountException: refCnt: 0, decrement: 1
    at io.netty.buffer.AbstractReferenceCountedByteBuf.release(AbstractReferenceCountedByteBuf.java:101)
    at com.yahoo.pulsar.client.impl.ProducerImpl.lambda$closeAsync$0(ProducerImpl.java:376)
    at java.lang.Iterable.forEach(Iterable.java:75)
    at com.yahoo.pulsar.client.impl.ProducerImpl.closeAsync(ProducerImpl.java:376)
    at com.yahoo.pulsar.client.impl.PulsarClientImpl.lambda$closeAsync$4(PulsarClientImpl.java:314)
    at java.util.ArrayList.forEach(ArrayList.java:1249)
    at com.yahoo.pulsar.client.impl.PulsarClientImpl.closeAsync(PulsarClientImpl.java:314)
    at com.yahoo.pulsar.client.impl.PulsarClientImpl.close(PulsarClientImpl.java:287)
    at com.yahoo.pulsar.broker.service.BacklogQuotaManagerTest.testAheadProducerOnHoldTimeout(BacklogQuotaManagerTest.java:506)
    at sun.reflect.GeneratedMethodAccessor134.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:84)
    at org.testng.internal.Invoker.invokeMethod(Invoker.java:714)
    at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:901)
    at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1231)
    at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:127)
    at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:111)
    at org.testng.TestRunner.privateRun(TestRunner.java:767)
    at org.testng.TestRunner.run(TestRunner.java:617)
    at org.testng.SuiteRunner.runTest(SuiteRunner.java:334)
    at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:329)
    at org.testng.SuiteRunner.privateRun(SuiteRunner.java:291)
    at org.testng.SuiteRunner.run(SuiteRunner.java:240)
    at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:52)
    at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:86)
    at org.testng.TestNG.runSuitesSequentially(TestNG.java:1224)
    at org.testng.TestNG.runSuitesLocally(TestNG.java:1149)
    at org.testng.TestNG.run(TestNG.java:1057)
    at org.apache.maven.surefire.testng.TestNGExecutor.run(TestNGExecutor.java:132)
    at org.apache.maven.surefire.testng.TestNGDirectoryTestSuite.executeSingleClass(TestNGDirectoryTestSuite.java:112)
    at org.apache.maven.surefire.testng.TestNGDirectoryTestSuite.execute(TestNGDirectoryTestSuite.java:99)
    at org.apache.maven.surefire.testng.TestNGProvider.invoke(TestNGProvider.java:147)
    at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:290)
    at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:242)
    at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:121)
Tests run: 100, Failures: 1, Errors: 0, Skipped: 32, Time elapsed: 689.745 sec <<< FAILURE! - in com.yahoo.pulsar.broker.service.BacklogQuotaManagerTest

merlimat added a commit to merlimat/pulsar that referenced this issue Sep 22, 2016
Fixes apache#33: Intermittent test failure on BacklogQuotaManagerTest.testAheadProducerOnHoldTimeout
@merlimat merlimat added the type/bug The PR fixed a bug or issue reported a bug label Sep 22, 2016
sijie pushed a commit to sijie/pulsar that referenced this issue Mar 4, 2018
* Create pulsar-functions module (#1)

* Create pulsar-functions module

* rename `sdk` package to `api`

* Added the first cut of the Java interface for Pulsar functions (#2)

* Added the first cut of the pulsar submit

* Working
massakam pushed a commit to massakam/pulsar that referenced this issue Aug 6, 2020
wolfstudy pushed a commit that referenced this issue Oct 30, 2020
…eadlock and blocking IO on ZK thread (#8304)

### Motivation
We have been seeing broker restarts due to zk-session timeout and that's because of #4635 and 
```
"pulsar-ordered-OrderedExecutor-4-0-EventThread" #33 daemon prio=5 os_prio=0 cpu=36314.97ms elapsed=698.44s tid=0x00007f8114029790 nid=0x2a31 waiting on condition  [0x00007f8170575000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at jdk.internal.misc.Unsafe.park(java.base@11.0.6/Native Method)
        - parking to wait for  <0x00001001a1c1aa50> (a java.util.concurrent.CompletableFuture$Signaller)
        at java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.6/LockSupport.java:234)
        at java.util.concurrent.CompletableFuture$Signaller.block(java.base@11.0.6/CompletableFuture.java:1798)
        at java.util.concurrent.ForkJoinPool.managedBlock(java.base@11.0.6/ForkJoinPool.java:3128)
        at java.util.concurrent.CompletableFuture.timedGet(java.base@11.0.6/CompletableFuture.java:1868)
        at java.util.concurrent.CompletableFuture.get(java.base@11.0.6/CompletableFuture.java:2021)
        at org.apache.pulsar.zookeeper.ZooKeeperCache.getData(ZooKeeperCache.java:293)
        at org.apache.pulsar.zookeeper.ZooKeeperCache.getData(ZooKeeperCache.java:238)
        at org.apache.pulsar.zookeeper.ZkIsolatedBookieEnsemblePlacementPolicy.getBlacklistedBookies(ZkIsolatedBookieEnsemblePlacementPolicy.java:150)
        at org.apache.pulsar.zookeeper.ZkIsolatedBookieEnsemblePlacementPolicy.newEnsemble(ZkIsolatedBookieEnsemblePlacementPolicy.java:123)
        at org.apache.bookkeeper.client.BookieWatcherImpl.newEnsemble(BookieWatcherImpl.java:233)
        at org.apache.bookkeeper.client.LedgerCreateOp.initiate(LedgerCreateOp.java:141)
        at org.apache.bookkeeper.client.BookKeeper.asyncCreateLedger(BookKeeper.java:831)
        at org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl.asyncCreateLedger(ManagedLedgerImpl.java:3063)
        at org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl.ledgerClosed(ManagedLedgerImpl.java:1378)
        - locked <0x0000100035d21d60> (a org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl)
        at org.apache.bookkeeper.mledger.impl.OpAddEntry.closeComplete(OpAddEntry.java:188)
        at org.apache.bookkeeper.client.LedgerHandle$5.lambda$safeRun$0(LedgerHandle.java:556)
        at org.apache.bookkeeper.client.LedgerHandle$5$$Lambda$935/0x00007f7f30252908.accept(Unknown Source)
  
```

The main reason of such zk-session timeout and broker-restart is all zk-event threads get blocked on ZK-Cache, [zk-session-wather](https://github.com/apache/pulsar/blob/master/pulsar-zookeeper-utils/src/main/java/org/apache/pulsar/zookeeper/ZooKeeperSessionWatcher.java) can't complete keep-alive and eventually zk-session timeout. Zk-thread gets blocked on `ZooKeeperCache` because [zk-cache](https://github.com/apache/pulsar/blob/master/pulsar-zookeeper-utils/src/main/java/org/apache/pulsar/zookeeper/ZooKeeperCache.java#L111) expires cache-entry at every 5 mins and Zk-thread misses the cache and tries to retrieve from zk. once, all zk-thread are blocked on Zk-Cache to get data from zk, it creates a deadlock and broker sees zk-session timeout.

One of the solutions is to keep entries always available in cache, avoid cache miss and refresh zk-cache in background. this solution will make sure that zk-thread will not see cache-miss and will not be blocked. 

### Modification
- Refresh Zk-Cache in background without invalidating cache data so, it avoids zk-cache once entry is loaded at first time.

### Result
It wil fix #4635
huangdx0726 pushed a commit to huangdx0726/pulsar that referenced this issue Nov 13, 2020
…eadlock and blocking IO on ZK thread (apache#8304)

### Motivation
We have been seeing broker restarts due to zk-session timeout and that's because of apache#4635 and 
```
"pulsar-ordered-OrderedExecutor-4-0-EventThread" apache#33 daemon prio=5 os_prio=0 cpu=36314.97ms elapsed=698.44s tid=0x00007f8114029790 nid=0x2a31 waiting on condition  [0x00007f8170575000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at jdk.internal.misc.Unsafe.park(java.base@11.0.6/Native Method)
        - parking to wait for  <0x00001001a1c1aa50> (a java.util.concurrent.CompletableFuture$Signaller)
        at java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.6/LockSupport.java:234)
        at java.util.concurrent.CompletableFuture$Signaller.block(java.base@11.0.6/CompletableFuture.java:1798)
        at java.util.concurrent.ForkJoinPool.managedBlock(java.base@11.0.6/ForkJoinPool.java:3128)
        at java.util.concurrent.CompletableFuture.timedGet(java.base@11.0.6/CompletableFuture.java:1868)
        at java.util.concurrent.CompletableFuture.get(java.base@11.0.6/CompletableFuture.java:2021)
        at org.apache.pulsar.zookeeper.ZooKeeperCache.getData(ZooKeeperCache.java:293)
        at org.apache.pulsar.zookeeper.ZooKeeperCache.getData(ZooKeeperCache.java:238)
        at org.apache.pulsar.zookeeper.ZkIsolatedBookieEnsemblePlacementPolicy.getBlacklistedBookies(ZkIsolatedBookieEnsemblePlacementPolicy.java:150)
        at org.apache.pulsar.zookeeper.ZkIsolatedBookieEnsemblePlacementPolicy.newEnsemble(ZkIsolatedBookieEnsemblePlacementPolicy.java:123)
        at org.apache.bookkeeper.client.BookieWatcherImpl.newEnsemble(BookieWatcherImpl.java:233)
        at org.apache.bookkeeper.client.LedgerCreateOp.initiate(LedgerCreateOp.java:141)
        at org.apache.bookkeeper.client.BookKeeper.asyncCreateLedger(BookKeeper.java:831)
        at org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl.asyncCreateLedger(ManagedLedgerImpl.java:3063)
        at org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl.ledgerClosed(ManagedLedgerImpl.java:1378)
        - locked <0x0000100035d21d60> (a org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl)
        at org.apache.bookkeeper.mledger.impl.OpAddEntry.closeComplete(OpAddEntry.java:188)
        at org.apache.bookkeeper.client.LedgerHandle$5.lambda$safeRun$0(LedgerHandle.java:556)
        at org.apache.bookkeeper.client.LedgerHandle$5$$Lambda$935/0x00007f7f30252908.accept(Unknown Source)
  
```

The main reason of such zk-session timeout and broker-restart is all zk-event threads get blocked on ZK-Cache, [zk-session-wather](https://github.com/apache/pulsar/blob/master/pulsar-zookeeper-utils/src/main/java/org/apache/pulsar/zookeeper/ZooKeeperSessionWatcher.java) can't complete keep-alive and eventually zk-session timeout. Zk-thread gets blocked on `ZooKeeperCache` because [zk-cache](https://github.com/apache/pulsar/blob/master/pulsar-zookeeper-utils/src/main/java/org/apache/pulsar/zookeeper/ZooKeeperCache.java#L111) expires cache-entry at every 5 mins and Zk-thread misses the cache and tries to retrieve from zk. once, all zk-thread are blocked on Zk-Cache to get data from zk, it creates a deadlock and broker sees zk-session timeout.

One of the solutions is to keep entries always available in cache, avoid cache miss and refresh zk-cache in background. this solution will make sure that zk-thread will not see cache-miss and will not be blocked. 

### Modification
- Refresh Zk-Cache in background without invalidating cache data so, it avoids zk-cache once entry is loaded at first time.

### Result
It wil fix apache#4635
flowchartsman pushed a commit to flowchartsman/pulsar that referenced this issue Nov 17, 2020
…eadlock and blocking IO on ZK thread (apache#8304)

### Motivation
We have been seeing broker restarts due to zk-session timeout and that's because of apache#4635 and 
```
"pulsar-ordered-OrderedExecutor-4-0-EventThread" apache#33 daemon prio=5 os_prio=0 cpu=36314.97ms elapsed=698.44s tid=0x00007f8114029790 nid=0x2a31 waiting on condition  [0x00007f8170575000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at jdk.internal.misc.Unsafe.park(java.base@11.0.6/Native Method)
        - parking to wait for  <0x00001001a1c1aa50> (a java.util.concurrent.CompletableFuture$Signaller)
        at java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.6/LockSupport.java:234)
        at java.util.concurrent.CompletableFuture$Signaller.block(java.base@11.0.6/CompletableFuture.java:1798)
        at java.util.concurrent.ForkJoinPool.managedBlock(java.base@11.0.6/ForkJoinPool.java:3128)
        at java.util.concurrent.CompletableFuture.timedGet(java.base@11.0.6/CompletableFuture.java:1868)
        at java.util.concurrent.CompletableFuture.get(java.base@11.0.6/CompletableFuture.java:2021)
        at org.apache.pulsar.zookeeper.ZooKeeperCache.getData(ZooKeeperCache.java:293)
        at org.apache.pulsar.zookeeper.ZooKeeperCache.getData(ZooKeeperCache.java:238)
        at org.apache.pulsar.zookeeper.ZkIsolatedBookieEnsemblePlacementPolicy.getBlacklistedBookies(ZkIsolatedBookieEnsemblePlacementPolicy.java:150)
        at org.apache.pulsar.zookeeper.ZkIsolatedBookieEnsemblePlacementPolicy.newEnsemble(ZkIsolatedBookieEnsemblePlacementPolicy.java:123)
        at org.apache.bookkeeper.client.BookieWatcherImpl.newEnsemble(BookieWatcherImpl.java:233)
        at org.apache.bookkeeper.client.LedgerCreateOp.initiate(LedgerCreateOp.java:141)
        at org.apache.bookkeeper.client.BookKeeper.asyncCreateLedger(BookKeeper.java:831)
        at org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl.asyncCreateLedger(ManagedLedgerImpl.java:3063)
        at org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl.ledgerClosed(ManagedLedgerImpl.java:1378)
        - locked <0x0000100035d21d60> (a org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl)
        at org.apache.bookkeeper.mledger.impl.OpAddEntry.closeComplete(OpAddEntry.java:188)
        at org.apache.bookkeeper.client.LedgerHandle$5.lambda$safeRun$0(LedgerHandle.java:556)
        at org.apache.bookkeeper.client.LedgerHandle$5$$Lambda$935/0x00007f7f30252908.accept(Unknown Source)
  
```

The main reason of such zk-session timeout and broker-restart is all zk-event threads get blocked on ZK-Cache, [zk-session-wather](https://github.com/apache/pulsar/blob/master/pulsar-zookeeper-utils/src/main/java/org/apache/pulsar/zookeeper/ZooKeeperSessionWatcher.java) can't complete keep-alive and eventually zk-session timeout. Zk-thread gets blocked on `ZooKeeperCache` because [zk-cache](https://github.com/apache/pulsar/blob/master/pulsar-zookeeper-utils/src/main/java/org/apache/pulsar/zookeeper/ZooKeeperCache.java#L111) expires cache-entry at every 5 mins and Zk-thread misses the cache and tries to retrieve from zk. once, all zk-thread are blocked on Zk-Cache to get data from zk, it creates a deadlock and broker sees zk-session timeout.

One of the solutions is to keep entries always available in cache, avoid cache miss and refresh zk-cache in background. this solution will make sure that zk-thread will not see cache-miss and will not be blocked. 

### Modification
- Refresh Zk-Cache in background without invalidating cache data so, it avoids zk-cache once entry is loaded at first time.

### Result
It wil fix apache#4635
hangc0276 pushed a commit to hangc0276/pulsar that referenced this issue May 26, 2021
* Update README.md

* Update README.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/test type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

No branches or pull requests

1 participant