Thread leak in TribeNode when a cluster is offline #19370

escheie · 2016-07-11T19:56:35Z

Elasticsearch version: 2.3.4
JVM version: 1.8.0_91
OS version: RedHat 6.5

We are using the TribeNode feature to enable search across a number of geographically distributed ElasticSearch clusters. Occasionally when we take one of these clusters completely offline, we find that our TribeNode hits the following exception:

java.lang.OutOfMemoryError: unable to create new native thread
        at java.lang.Thread.start0(Native Method)
        at java.lang.Thread.start(Thread.java:714)
        at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:950)
        at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368)
        at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:85)
        at org.elasticsearch.threadpool.ThreadPool$ThreadedRunnable.run(ThreadPool.java:676)
        at org.elasticsearch.threadpool.ThreadPool$LoggingRunnable.run(ThreadPool.java:640)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

This exception is thrown because of thread exhaustion due to the TribeNode creating a new thread every couple of seconds. Below is the stack trace of the leaked threads:

java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
       java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
       java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867)
       java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
       java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214)
       java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
       org.elasticsearch.common.util.concurrent.KeyedLock.acquire(KeyedLock.java:75)
       org.elasticsearch.transport.netty.NettyTransport.disconnectFromNode(NettyTransport.java:1063)
       org.elasticsearch.transport.TransportService.disconnectFromNode(TransportService.java:274)
       org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$2$1.doRun(UnicastZenPing.java:258)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
       java.lang.Thread.run(Thread.java:745)

Steps to reproduce:
Create TribeNode configuration where one cluster is offline. Its not enough that the processes are shutdown and the machine is online, the nodes specified in the discovery.zen.ping.unicast.hosts for the offline cluster must be offline and not respond to ping/connection attempts. Here is a simple configuration I was able to use to reproduce the problem.


---
cluster.name: "thread-leak-test"
node.name: "thread-leak-node"
http.port: "9201"
http.host: "127.0.0.1"
tribe:
  online-cluster:
    cluster.name: "online-cluster"
    discovery.zen.ping.unicast.hosts:
    - "localhost"
  offline-cluster:
    cluster.name: "offline-cluster"
    discovery.zen.ping.unicast.hosts:
    - "10.10.10.10"

Start the Tribe node. Observe that the number of threads continue to grow unbounded (ps -m <pid> | wc -l) until the OutOfMemoryError: unable to create new native thread exceptions are thrown.

This issue appears similar to the problem described in #8057.

The text was updated successfully, but these errors were encountered:

tlrx · 2016-08-01T11:31:50Z

@escheie Thanks for reporting! I reproduced it and will come back with a potential fix.

escheie · 2016-08-05T00:23:25Z

Thanks tlrx for looking into the issue. I've found that setting discovery.zen.ping.unicast.concurrent_connects to 1 (default value is 10) limits the number of threads that pile up as the lock gets released to the threads waiting on disconnect every 30s between connection timeouts. When it was 10, the connect threads could theoretically hold the lock forever, preventing the disconnect threads from ever getting a chance to complete.

tlrx · 2016-08-05T07:14:44Z

@escheie True. But I think the comment made is the #19719 makes sense and changing the connection timeout for pings will help. But I still need to think about it again.

escheie · 2016-08-05T17:49:02Z

With more testing overnight, I found that setting the "discovery.zen.ping.unicast.concurrent_connects" to 1 only works if TRACE logging is enabled for discovery. Seems that little extra time the connect thread spends logging gives the other threads performing the disconnect a chance to get the lock. Would a shorter connect timeout help if it is still more than the interval that connects are attempted which appear to be every 1.5 seconds?

Looks like the KeyedLock used by NettyTransport in 2.x and TcpTransport in 5.x supports a fair option so that threads are able to acquire the lock in the order they request it. This fair option is currently set to false. If threads are able to obtain the lock in they order they request it, then that should ensure the disconnect threads get a chance to run between connection attempts. I suppose enabling the fair option though would result in a significant performance penalty, so probably not an option.

escheie · 2016-08-05T19:18:24Z

I've confirmed that enabling the "fair" flag in the KeyedLock does prevent the number of threads from growing unbounded. The maximum number of threads that pile up equals (discovery.zen.ping.unicast.concurrent_connects * connect_timeout)/(disconnect frequency) = 10*30/3 = 100. This number can be reduced by lowering discovery.zen.ping.unicast.concurrent_connects in the configuration or if the connect_timeout is also lowered as proposed.

Since it looks like the KeyedLock is only used during connect and disconnect and not for connection lookup, enabling the fair flag may not impact performance as I previously feared.

tlrx · 2016-08-24T10:34:27Z

Thanks @escheie ! Your effort and investigation are great.

I do feel like the issue happens because we try to disconnect from nodes even if we never succeed to connect to them, and UnicastZenPing blindly piles up thread for disconnecting them (and these threads try to acquire a lock and slow dows ping threads too).

I proposed a new fix #19719, I'm wondering if it works for you too.

The `UnicastZenPing` shows it's age and is the result of many small changes. The current state of affairs is confusing and is hard to reason about. This PR cleans it up (while following the same original intentions). Highlights of the changes are: 1) Clear 3 round flow - no interleaving of scheduling. 2) The previous implementation did a best effort attempt to wait for ongoing pings to be sent and completed. The pings were guaranteed to complete because each used the total ping duration as a timeout. This did make it hard to reason about the total ping duration and the flow of the code. All of this is removed now and ping should just complete within the given duration or not be counted (note that it was very handy for testing, but I move the needed sync logic to the test). 3) Because of (2) the pinging scheduling changed a bit, to give a chance for the last round to complete. We now ping at the beginning, 1/3 and 2/3 of the duration. 4) To offset for (3) a bit, incoming ping requests are now added to on going ping collections. 5) UnicastZenPing never establishes full blown connections (but does reuse them if there). Relates to #22120 6) Discovery host providers are only used once per pinging round. Closes #21739 7) Usage of the ability to open a connection without connecting to a node ( #22194 ) and shorter connection timeouts helps with connections piling up. Closes #19370 8) Beefed up testing and sped them up. 9) removed light profile from production code

clintongormley added >bug help wanted adoptme :Tribe Node labels Jul 12, 2016

tlrx self-assigned this Aug 1, 2016

tlrx removed the help wanted adoptme label Aug 1, 2016

tlrx mentioned this issue Aug 1, 2016

Avoid zen pinging threads to pile up #19719

Closed

bleskes mentioned this issue Dec 20, 2016

Simplify Unicast Zen Ping #22277

Merged

bleskes closed this as completed in #22277 Dec 21, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thread leak in TribeNode when a cluster is offline #19370

Thread leak in TribeNode when a cluster is offline #19370

escheie commented Jul 11, 2016

tlrx commented Aug 1, 2016

escheie commented Aug 5, 2016

tlrx commented Aug 5, 2016

escheie commented Aug 5, 2016

escheie commented Aug 5, 2016

tlrx commented Aug 24, 2016

Thread leak in TribeNode when a cluster is offline #19370

Thread leak in TribeNode when a cluster is offline #19370

Comments

escheie commented Jul 11, 2016

tlrx commented Aug 1, 2016

escheie commented Aug 5, 2016

tlrx commented Aug 5, 2016

escheie commented Aug 5, 2016

escheie commented Aug 5, 2016

tlrx commented Aug 24, 2016