Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread leak in TribeNode when a cluster is offline #19370

Closed
escheie opened this issue Jul 11, 2016 · 6 comments
Closed

Thread leak in TribeNode when a cluster is offline #19370

escheie opened this issue Jul 11, 2016 · 6 comments
Assignees
Labels

Comments

@escheie
Copy link

escheie commented Jul 11, 2016

Elasticsearch version: 2.3.4
JVM version: 1.8.0_91
OS version: RedHat 6.5

We are using the TribeNode feature to enable search across a number of geographically distributed ElasticSearch clusters. Occasionally when we take one of these clusters completely offline, we find that our TribeNode hits the following exception:

java.lang.OutOfMemoryError: unable to create new native thread
        at java.lang.Thread.start0(Native Method)
        at java.lang.Thread.start(Thread.java:714)
        at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:950)
        at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368)
        at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:85)
        at org.elasticsearch.threadpool.ThreadPool$ThreadedRunnable.run(ThreadPool.java:676)
        at org.elasticsearch.threadpool.ThreadPool$LoggingRunnable.run(ThreadPool.java:640)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

This exception is thrown because of thread exhaustion due to the TribeNode creating a new thread every couple of seconds. Below is the stack trace of the leaked threads:

java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
       java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
       java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867)
       java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
       java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214)
       java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
       org.elasticsearch.common.util.concurrent.KeyedLock.acquire(KeyedLock.java:75)
       org.elasticsearch.transport.netty.NettyTransport.disconnectFromNode(NettyTransport.java:1063)
       org.elasticsearch.transport.TransportService.disconnectFromNode(TransportService.java:274)
       org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$2$1.doRun(UnicastZenPing.java:258)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
       java.lang.Thread.run(Thread.java:745)

Steps to reproduce:
Create TribeNode configuration where one cluster is offline. Its not enough that the processes are shutdown and the machine is online, the nodes specified in the discovery.zen.ping.unicast.hosts for the offline cluster must be offline and not respond to ping/connection attempts. Here is a simple configuration I was able to use to reproduce the problem.


---
cluster.name: "thread-leak-test"
node.name: "thread-leak-node"
http.port: "9201"
http.host: "127.0.0.1"
tribe:
  online-cluster:
    cluster.name: "online-cluster"
    discovery.zen.ping.unicast.hosts:
    - "localhost"
  offline-cluster:
    cluster.name: "offline-cluster"
    discovery.zen.ping.unicast.hosts:
    - "10.10.10.10"

Start the Tribe node. Observe that the number of threads continue to grow unbounded (ps -m <pid> | wc -l) until the OutOfMemoryError: unable to create new native thread exceptions are thrown.

This issue appears similar to the problem described in #8057.

@tlrx
Copy link
Member

tlrx commented Aug 1, 2016

@escheie Thanks for reporting! I reproduced it and will come back with a potential fix.

@escheie
Copy link
Author

escheie commented Aug 5, 2016

Thanks tlrx for looking into the issue. I've found that setting discovery.zen.ping.unicast.concurrent_connects to 1 (default value is 10) limits the number of threads that pile up as the lock gets released to the threads waiting on disconnect every 30s between connection timeouts. When it was 10, the connect threads could theoretically hold the lock forever, preventing the disconnect threads from ever getting a chance to complete.

@tlrx
Copy link
Member

tlrx commented Aug 5, 2016

@escheie True. But I think the comment made is the #19719 makes sense and changing the connection timeout for pings will help. But I still need to think about it again.

@escheie
Copy link
Author

escheie commented Aug 5, 2016

With more testing overnight, I found that setting the "discovery.zen.ping.unicast.concurrent_connects" to 1 only works if TRACE logging is enabled for discovery. Seems that little extra time the connect thread spends logging gives the other threads performing the disconnect a chance to get the lock. Would a shorter connect timeout help if it is still more than the interval that connects are attempted which appear to be every 1.5 seconds?

Looks like the KeyedLock used by NettyTransport in 2.x and TcpTransport in 5.x supports a fair option so that threads are able to acquire the lock in the order they request it. This fair option is currently set to false. If threads are able to obtain the lock in they order they request it, then that should ensure the disconnect threads get a chance to run between connection attempts. I suppose enabling the fair option though would result in a significant performance penalty, so probably not an option.

@escheie
Copy link
Author

escheie commented Aug 5, 2016

I've confirmed that enabling the "fair" flag in the KeyedLock does prevent the number of threads from growing unbounded. The maximum number of threads that pile up equals (discovery.zen.ping.unicast.concurrent_connects * connect_timeout)/(disconnect frequency) = 10*30/3 = 100. This number can be reduced by lowering discovery.zen.ping.unicast.concurrent_connects in the configuration or if the connect_timeout is also lowered as proposed.

Since it looks like the KeyedLock is only used during connect and disconnect and not for connection lookup, enabling the fair flag may not impact performance as I previously feared.

@tlrx
Copy link
Member

tlrx commented Aug 24, 2016

Thanks @escheie ! Your effort and investigation are great.

I do feel like the issue happens because we try to disconnect from nodes even if we never succeed to connect to them, and UnicastZenPing blindly piles up thread for disconnecting them (and these threads try to acquire a lock and slow dows ping threads too).

I proposed a new fix #19719, I'm wondering if it works for you too.

bleskes added a commit that referenced this issue Dec 21, 2016
The `UnicastZenPing` shows it's age and is the result of many small changes. The current state of affairs is confusing and is hard to reason about. This PR cleans it up (while following the same original intentions). Highlights of the changes are:

1) Clear 3 round flow - no interleaving of scheduling.
2) The previous implementation did a best effort attempt to wait for ongoing pings to be sent and completed. The pings were guaranteed to complete because each used the total ping duration as a timeout. This did make it hard to reason about the total ping duration and the flow of the code. All of this is removed now and ping should just complete within the given duration or not be counted (note that it was very handy for testing, but I move the needed sync logic to the test).
3) Because of (2) the pinging scheduling changed a bit, to give a chance for the last round to complete. We now ping at the beginning, 1/3 and 2/3 of the duration.
4) To offset for (3) a bit, incoming ping requests are now added to on going ping collections.
5) UnicastZenPing never establishes full blown connections (but does reuse them if there). Relates to #22120
6) Discovery host providers are only used once per pinging round. Closes #21739
7) Usage of the ability to open a connection without connecting to a node ( #22194 ) and shorter connection timeouts helps with connections piling up. Closes #19370
8) Beefed up testing and sped them up.
9) removed light profile from production code
bleskes added a commit that referenced this issue Dec 23, 2016
The `UnicastZenPing` shows it's age and is the result of many small changes. The current state of affairs is confusing and is hard to reason about. This PR cleans it up (while following the same original intentions). Highlights of the changes are:

1) Clear 3 round flow - no interleaving of scheduling.
2) The previous implementation did a best effort attempt to wait for ongoing pings to be sent and completed. The pings were guaranteed to complete because each used the total ping duration as a timeout. This did make it hard to reason about the total ping duration and the flow of the code. All of this is removed now and ping should just complete within the given duration or not be counted (note that it was very handy for testing, but I move the needed sync logic to the test).
3) Because of (2) the pinging scheduling changed a bit, to give a chance for the last round to complete. We now ping at the beginning, 1/3 and 2/3 of the duration.
4) To offset for (3) a bit, incoming ping requests are now added to on going ping collections.
5) UnicastZenPing never establishes full blown connections (but does reuse them if there). Relates to #22120
6) Discovery host providers are only used once per pinging round. Closes #21739
7) Usage of the ability to open a connection without connecting to a node ( #22194 ) and shorter connection timeouts helps with connections piling up. Closes #19370
8) Beefed up testing and sped them up.
9) removed light profile from production code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants