[BUG] Too many get snapshot calls causing generic threadpool to be busy completely #1788

shwetathareja · 2021-12-22T14:01:34Z

Describe the bug
Too many get snapshot calls for custom repo which has more than 70k snapshots caused generic threadpool to be exhausted completely. They were in indefinite wait state, looks like a deadlock.

Observed this behavior in 7.10 and 7.1 but the stack traces were different.

In 7.10 the code to fetch repository is under Future.get blocking call and the repositoriesService.getRepositoryData executes internal method in generic threadpool which was exhausted causing the deadlock
threadPool.generic().execute(ActionRunnable.wrap(listener, this::doGetRepositoryData));

This issue would exist in OpenSearch as well, haven't tried explicit repro yet.

In 7.1 though, there was no deadlock but the get snapshot calls were taking really long to finish keeping the whole threadpool busy.

Expected behavior
The TransportGetSnapshotAction shouldn't block wait on the
repositoriesService.getRepositoryData and move to async processing

Repository.getRepositoryData was made async in 0acba44 and blocking step for the
Blocking call was removed from snapshot status API TransportSnapshotsStatusAction here - 1cde4a6

Due to this, pending tasks were stuck for hours on master.

7.10

All the 128 thread of generic threadpool were busy waiting

"elasticsearch[dd229155bf9aa56a13617a858a541448][generic][T#234]" #6225 daemon prio=5 os_prio=0 cpu=308056.43ms elapsed=274126.86s tid=0x00007fd52424b000 nid=0x32ed waiting on condition  [0x00007fd33a257000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@11.0.6/Native Method)
	- parking to wait for  <0x00000004f9b75ec0> (a org.elasticsearch.common.util.concurrent.BaseFuture$Sync)
	at java.util.concurrent.locks.LockSupport.park(java.base@11.0.6/LockSupport.java:194)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11.0.6/AbstractQueuedSynchronizer.java:885)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(java.base@11.0.6/AbstractQueuedSynchronizer.java:1039)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base@11.0.6/AbstractQueuedSynchronizer.java:1345)
	at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:259)
	at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:87)
	at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:56)
	at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:37)
	at org.elasticsearch.action.support.PlainActionFuture.get(PlainActionFuture.java:33)
	at org.elasticsearch.action.admin.cluster.snapshots.get.TransportGetSnapshotsAction.masterOperation(TransportGetSnapshotsAction.java:114)
	at org.elasticsearch.action.admin.cluster.snapshots.get.TransportGetSnapshotsAction.masterOperation(TransportGetSnapshotsAction.java:67)
	at org.elasticsearch.action.support.master.TransportMasterNodeAction.masterOperation(TransportMasterNodeAction.java:100)
	at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.lambda$doStart$3(TransportMasterNodeAction.java:173)
	at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$$Lambda$3646/0x00000008018b2040.accept(Unknown Source)
	at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:73)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:752)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.6/ThreadPoolExecutor.java:1128)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.6/ThreadPoolExecutor.java:628)
	at java.lang.Thread.run(java.base@11.0.6/Thread.java:834)

   Locked ownable synchronizers:
	- <0x00000004739d8ce0> (a java.util.concurrent.ThreadPoolExecutor$Worker)

7.1

All 128 threads either were waiting for connection for S3 or running at the conscrypt library code.

"elasticsearch[e7e76d2dd2e0ed6731c77440058decee][generic][T#367]" #1145 daemon prio=5 os_prio=0 cpu=703963.76ms elapsed=48932.59s tid=0x00007f89f4294000 nid=0x5297 waiting on condition  [0x00007f894f7cb000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@11.0.6/Native Method)
	- parking to wait for  <0x00000005ccf7c690> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
	at java.util.concurrent.locks.LockSupport.parkUntil(java.base@11.0.6/LockSupport.java:275)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUntil(java.base@11.0.6/AbstractQueuedSynchronizer.java:2166)
	at org.apache.http.pool.AbstractConnPool.getPoolEntryBlocking(AbstractConnPool.java:377)
	at org.apache.http.pool.AbstractConnPool.access$200(AbstractConnPool.java:69)
	at org.apache.http.pool.AbstractConnPool$2.get(AbstractConnPool.java:245)
	- locked <0x000000058546b608> (a org.apache.http.pool.AbstractConnPool$2)
	at org.apache.http.pool.AbstractConnPool$2.get(AbstractConnPool.java:193)
	at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:304)
	at org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:280)
	at jdk.internal.reflect.GeneratedMethodAccessor191.invoke(Unknown Source)
	at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(java.base@11.0.6/DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(java.base@11.0.6/Method.java:566)
	at com.amazonaws.http.conn.ClientConnectionRequestFactory$Handler.invoke(ClientConnectionRequestFactory.java:70)
	at com.amazonaws.http.conn.$Proxy112.get(Unknown Source)
	at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:190)
	at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
	at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
	at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1323)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1139)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:796)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:764)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:738)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:698)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:680)
	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:544)
	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:524)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5054)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5000)
	at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1486)
	at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1341)
	at org.elasticsearch.repositories.s3.S3BlobContainer.lambda$readBlob$1(S3BlobContainer.java:98)
	at org.elasticsearch.repositories.s3.S3BlobContainer$$Lambda$3012/0x0000000801697c40.run(Unknown Source)
	at java.security.AccessController.doPrivileged(java.base@11.0.6/Native Method)
	at org.elasticsearch.repositories.s3.SocketAccess.doPrivileged(SocketAccess.java:42)
	at org.elasticsearch.repositories.s3.S3BlobContainer.readBlob(S3BlobContainer.java:98)
	at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.readBlob(ChecksumBlobStoreFormat.java:101)
	at org.elasticsearch.repositories.blobstore.BlobStoreFormat.read(BlobStoreFormat.java:93)
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.getSnapshotInfo(BlobStoreRepository.java:718)
	at org.elasticsearch.snapshots.SnapshotsService.snapshots(SnapshotsService.java:239)
	at org.elasticsearch.action.admin.cluster.snapshots.get.TransportGetSnapshotsAction.masterOperation(TransportGetSnapshotsAction.java:135)
	at org.elasticsearch.action.admin.cluster.snapshots.get.TransportGetSnapshotsAction.masterOperation(TransportGetSnapshotsAction.java:54)
	at org.elasticsearch.action.support.master.TransportMasterNodeAction.masterOperation(TransportMasterNodeAction.java:127)
	at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$2.doRun(TransportMasterNodeAction.java:208)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:760)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.6/ThreadPoolExecutor.java:1128)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.6/ThreadPoolExecutor.java:628)
	at java.lang.Thread.run(java.base@11.0.6/Thread.java:834)

"elasticsearch[e7e76d2dd2e0ed6731c77440058decee][generic][T#368]" #1147 daemon prio=5 os_prio=0 cpu=702831.23ms elapsed=48912.59s tid=0x00007f89f4131800 nid=0x52ae runnable  [0x00007f8972ce9000]
   java.lang.Thread.State: RUNNABLE
	at org.conscrypt.NativeCrypto.SSL_read(Native Method)
	at org.conscrypt.NativeSsl.read(NativeSsl.java:409)
	at org.conscrypt.ConscryptFileDescriptorSocket$SSLInputStream.read(ConscryptFileDescriptorSocket.java:548)
	- locked <0x00000005a1a28020> (a java.lang.Object)
	at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137)
	at org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153)
	at org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:280)
	at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:138)
	at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56)
	at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259)
	at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163)
	at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:157)
	at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273)
	at com.amazonaws.http.protocol.SdkHttpRequestExecutor.doReceiveResponse(SdkHttpRequestExecutor.java:82)
	at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125)
	at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272)
	at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
	at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
	at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1323)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1139)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:796)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:764)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:738)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:698)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:680)
	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:544)
	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:524)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5054)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5000)
	at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1486)
	at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1341)
	at org.elasticsearch.repositories.s3.S3BlobContainer.lambda$readBlob$1(S3BlobContainer.java:98)
	at org.elasticsearch.repositories.s3.S3BlobContainer$$Lambda$3012/0x0000000801697c40.run(Unknown Source)
	at java.security.AccessController.doPrivileged(java.base@11.0.6/Native Method)
	at org.elasticsearch.repositories.s3.SocketAccess.doPrivileged(SocketAccess.java:42)
	at org.elasticsearch.repositories.s3.S3BlobContainer.readBlob(S3BlobContainer.java:98)
	at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.readBlob(ChecksumBlobStoreFormat.java:101)
	at org.elasticsearch.repositories.blobstore.BlobStoreFormat.read(BlobStoreFormat.java:93)
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.getSnapshotInfo(BlobStoreRepository.java:718)
	at org.elasticsearch.snapshots.SnapshotsService.snapshots(SnapshotsService.java:239)
	at org.elasticsearch.action.admin.cluster.snapshots.get.TransportGetSnapshotsAction.masterOperation(TransportGetSnapshotsAction.java:135)
	at org.elasticsearch.action.admin.cluster.snapshots.get.TransportGetSnapshotsAction.masterOperation(TransportGetSnapshotsAction.java:54)
	at org.elasticsearch.action.support.master.TransportMasterNodeAction.masterOperation(TransportMasterNodeAction.java:127)
	at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$2.doRun(TransportMasterNodeAction.java:208)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:760)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.6/ThreadPoolExecutor.java:1128)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.6/ThreadPoolExecutor.java:628)
	at java.lang.Thread.run(java.base@11.0.6/Thread.java:834)

The text was updated successfully, but these errors were encountered:

shwetathareja · 2021-12-22T14:36:44Z

Also, we should support timeout and cancelling of get snapshot calls if they are taking more than the configure timeout.

shwetathareja · 2021-12-23T05:54:52Z

During the issue, GET Snapshot API calls were taking ~1.5 hours

xuezhou25 · 2021-12-29T01:39:24Z

Is anyone currently working on this issue? Or I can take a look.

shwetathareja · 2022-01-03T10:14:18Z

@xuezhou25: feel free to take a stab at it.

shwetathareja · 2022-01-24T11:01:00Z

@xuezhou25

we should look into making the code not to block on response from getRepositoryData here

OpenSearch/server/src/main/java/org/opensearch/action/admin/cluster/snapshots/get/TransportGetSnapshotsAction.java

Line 141 in 3ce377f

repositoryData = PlainActionFuture.get(fut -> repositoriesService.getRepositoryData(repository, fut));

and pass a listener to it instead of this fut object.

You can check this PR 1cde4a6 on how this call was switched to async listener in TransportSnapshotsStatusAction class. code ref:

OpenSearch/server/src/main/java/org/opensearch/action/admin/cluster/snapshots/status/TransportSnapshotsStatusAction.java

Line 316 in 3ce377f

repositoriesService.getRepositoryData(repositoryName, repositoryDataListener);

Above change wouldn't address the problem of getSnapshots taking really, for this we should look into 2 ways:
a. Timing out the request
b. Cancelling the long running request to free up the threads.

piyushdaftary · 2022-02-07T23:17:12Z

Using the get snapshots API can be expensive as the API requires a read from repository for each index/shards in each snapshot.

Here the main problem I do see with long running get snapshots call is that it is blocking threads of GENERIC threadpool which in cascade impacting other functionality of OpenSearch cluster.

In my opinion to address the issue we should explore following 3 options:

Limit the number of threads get snapshots API can use, so that it don't block other modues of OpenSearch using GENERIC threadpool
Make the get snapshots api to use SNAPSHOTS threadpool instead of GENERIC
Maintain a Metadeta file at repository level with all snapshots meta information, so that get snapshots operation is just a single file read operation.

Timeout/Cancellation can be good option in some cases, but it won't solve the purpose of those who want to get snapshot info of really large repository without impacting other functionality.

@shwetathareja @xuezhou25 : WDYT ?

shwetathareja · 2022-03-08T07:45:54Z

@piyushdaftary : Right now there is a deadlock in the code due to blocking call. Even if you move to different "snapshot" threadpool, the blocking wait would still be there and in case snapshot threadpool is exhausted then those threads would remain waiting. This bug has to be fixed either ways.

Regarding having a different threadpool for snapshot. It is not critical operation, hence didn't suggest having a separate threadpool for it. The current "Snapshot" threadpool is used by RepositoryService and I dont think it is a good idea to club get snapshot API with that as it can effect actual snapshots being taken.

Also "generic" threadpool caters to most of the APIs and creating threadpool per APIs would not be a good decision.

anasalkouz · 2022-04-01T16:25:43Z

@xuezhou25 are you still actively working on this issue?

Bukhtawar · 2023-04-30T10:11:51Z

All cluster PUT settings calls are getting stuck due to this. The publication of the cluster state relies on this thread pool, so its critical we address this sooner.

Source  |  Priority  |  TimeInQueue 
put-mapping [team-owl-2023.04.25/ZJVmPnKfRq2xJG0NaTdYbA] | HIGH | 1.5d 
cluster_update_settings | IMMEDIATE | 16.9h 
cluster_update_settings | IMMEDIATE | 16.9h 
cluster_update_settings | IMMEDIATE | 16.9h 
cluster_update_settings | IMMEDIATE | 16.9h 
cluster_update_settings | IMMEDIATE | 16.8h 
cluster_update_settings | IMMEDIATE | 16.8h 
cluster_update_settings | IMMEDIATE | 16.7h 
cluster_update_settings | IMMEDIATE | 16.7h 
cluster_update_settings | IMMEDIATE | 16.7h 
cluster_update_settings | IMMEDIATE | 16.7h 
cluster_update_settings | IMMEDIATE | 16.6h 
cluster_update_settings | IMMEDIATE | 16.6h 
cluster_update_settings | IMMEDIATE | 16.6h 
cluster_update_settings | IMMEDIATE | 16.5h 
cluster_update_settings | IMMEDIATE | 16.5h 
cluster_update_settings | IMMEDIATE | 16.5h 
cluster_update_settings | IMMEDIATE | 16.4h 
cluster_update_settings | IMMEDIATE | 16.4h 
cluster_update_settings | IMMEDIATE | 16.4h 
cluster_update_settings | IMMEDIATE | 16.3h 
cluster_update_settings | IMMEDIATE | 16.3h 
cluster_update_settings | IMMEDIATE | 16.3h 
cluster_update_settings | IMMEDIATE | 16.2h 
cluster_update_settings | IMMEDIATE | 16.2h 
cluster_update_settings | IMMEDIATE | 16.2h 
cluster_update_settings | IMMEDIATE | 16.2h 
cluster_update_settings | IMMEDIATE | 16.1h 
cluster_update_settings | IMMEDIATE | 16.1h

The problem with this state is it can stall all write traffic if there is a dynamic mapping update

"elasticsearch[d2462efc415af6277ab11128322f721d][generic][T#764]" #154517 daemon prio=5 os_prio=0 cpu=229285.84ms elapsed=244877.47s tid=0x00007f1424727340 nid=0x28e waiting on condition  [0x00007f12f4996000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@11.0.17/Native Method)
	- parking to wait for  <0x0000000540505870> (a org.elasticsearch.common.util.concurrent.BaseFuture$Sync)
	at java.util.concurrent.locks.LockSupport.park(java.base@11.0.17/LockSupport.java:194)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11.0.17/AbstractQueuedSynchronizer.java:885)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(java.base@11.0.17/AbstractQueuedSynchronizer.java:1039)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base@11.0.17/AbstractQueuedSynchronizer.java:1345)
	at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:259)
	at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:87)
	at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:56)
	at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:37)
	at org.elasticsearch.action.support.PlainActionFuture.get(PlainActionFuture.java:33)
	at org.elasticsearch.action.admin.cluster.snapshots.get.TransportGetSnapshotsAction.masterOperation(TransportGetSnapshotsAction.java:114)
	at org.elasticsearch.action.admin.cluster.snapshots.get.TransportGetSnapshotsAction.masterOperation(TransportGetSnapshotsAction.java:67)
	at org.elasticsearch.action.support.master.TransportMasterNodeAction.masterOperation(TransportMasterNodeAction.java:100)
	at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.lambda$doStart$3(TransportMasterNodeAction.java:173)
	at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$$Lambda$3847/0x0000000801915040.accept(Unknown Source)
	at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:73)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:752)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.17/ThreadPoolExecutor.java:1128)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.17/ThreadPoolExecutor.java:628)
	at java.lang.Thread.run(java.base@11.0.17/Thread.java:829)

   Locked ownable synchronizers:
	- <0x0000000404a1e888> (a java.util.concurrent.ThreadPoolExecutor$Worker)

"elasticsearch[d2462efc415af6277ab11128322f721d][generic][T#769]" #154522 daemon prio=5 os_prio=0 cpu=226656.17ms elapsed=244877.47s tid=0x00007f13dc4c8290 nid=0x291 waiting on condition  [0x00007f12afbfa000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@11.0.17/Native Method)
	- parking to wait for  <0x000000054050ff18> (a org.elasticsearch.common.util.concurrent.BaseFuture$Sync)
	at java.util.concurrent.locks.LockSupport.park(java.base@11.0.17/LockSupport.java:194)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11.0.17/AbstractQueuedSynchronizer.java:885)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(java.base@11.0.17/AbstractQueuedSynchronizer.java:1039)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base@11.0.17/AbstractQueuedSynchronizer.java:1345)
	at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:259)
	at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:87)
	at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:56)
	at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:37)
	at org.elasticsearch.action.support.PlainActionFuture.get(PlainActionFuture.java:33)
	at org.elasticsearch.action.admin.cluster.snapshots.get.TransportGetSnapshotsAction.masterOperation(TransportGetSnapshotsAction.java:114)
	at org.elasticsearch.action.admin.cluster.snapshots.get.TransportGetSnapshotsAction.masterOperation(TransportGetSnapshotsAction.java:67)
	at org.elasticsearch.action.support.master.TransportMasterNodeAction.masterOperation(TransportMasterNodeAction.java:100)
	at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.lambda$doStart$3(TransportMasterNodeAction.java:173)
	at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$$Lambda$3847/0x0000000801915040.accept(Unknown Source)
	at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:73)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:752)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.17/ThreadPoolExecutor.java:1128)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.17/ThreadPoolExecutor.java:628)
	at java.lang.Thread.run(java.base@11.0.17/Thread.java:829)

   Locked ownable synchronizers:
	- <0x0000000404a1e8e8> (a java.util.concurrent.ThreadPoolExecutor$Worker)

amkhar · 2023-06-01T10:38:28Z

Starting points to explore :
Code paths taking too much of CPU/memory.

getRepositoryData and inside it RepositoryData.snapshotsFromXContent


cacheRepositoryData(
                        BytesReference.bytes(loaded.snapshotsToXContent(XContentFactory.jsonBuilder(), Version.CURRENT)),
                        genToLoad
                    );

Note : this analysis is based on a setup with 100 node cluster, 4K indices and 400 snapshots.

Attaching flame graph for the same.

cached-code-steady-state-alloc-26-c1.txt
cached-code-steady-state-cpu-26-c1.txt

indrajohn7 · 2023-06-01T12:08:44Z

Had an initial deep dive on this issue, below are a few findings:

This code path: https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/repositories/blobstore/BlobStoreRepository.java#L1610 is iterating over the index-N snapshot objects in the blob repo to retrieve the latest generation ID of the snapshot. This keep on happening when the latest generation throws exception while performing an I/O and the loop continues. If there are incremental updates in the latest snapshot repo gen and if the repo is deleted or corrupted in an incremental fashion, then the while loop is repeated indefinitely and the GENERIC threadpool would be stuck.
Moving this to a listener based async call: https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/action/admin/cluster/snapshots/get/TransportGetSnapshotsAction.java#L143.

Currently the above code path is blocking the GENERIC threadpool.

Load the repository data: https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/repositories/blobstore/BlobStoreRepository.java#L1638

The snapshot Xcontent parsing takes most of the CPU samples.

Cache the repository data: https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/repositories/blobstore/BlobStoreRepository.java#L1642

The serialisation and compression of the respositorydata object: https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/repositories/blobstore/BlobStoreRepository.java#L1692

Can the doGetRepositoData call move to the SNAPSHOT Threadpool in stead of the GENERIC one?

indrajohn7 · 2023-06-09T14:41:40Z

Reproduction steps:

3 dedicated master with 2 data node cluster.
500K shards with 1000 indices.
Too many concurrent get snapshot calls.
Flamegraph suggests that 27% CPU samples are occupied in get_snapshot code path.
16% samples are from BlobStoreRepository .doGetRepositoryData(): https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/repositories/blobstore/BlobStoreRepository.java#L1598
11% samples are from TransportGetSnapshotsAction.snapshots() : https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/action/admin/cluster/snapshots/get/TransportGetSnapshotsAction.java#L180

There are ~5 - 10k pending tasks in the queue at the same time, while the snapshot and index creation are processing in parallel.
11709069 7.1m NORMAL update snapshot state 11709070 7.1m NORMAL update snapshot state 11709071 7.1m NORMAL update snapshot state 11709072 7.1m NORMAL update snapshot state 11709073 7.1m NORMAL update snapshot state 11706573 7.1m NORMAL update snapshot state 11705322 7.1m NORMAL update snapshot state 11706574 7.1m NORMAL update snapshot state

shwetathareja added bug Something isn't working untriaged labels Dec 22, 2021

shwetathareja mentioned this issue Dec 22, 2021

[BUG] Master spending hours to process a pending task due to generic threadpool completely busy #1790

Open

anasalkouz added distributed framework and removed untriaged labels Dec 28, 2021

anasalkouz assigned xuezhou25 Dec 29, 2021

anasalkouz added Migration:In Progress and removed Migration:In Progress labels Mar 17, 2023

amkhar mentioned this issue May 11, 2023

[Improvement] TransportCleanupRepositoryAction.cleanupRepo takes too much of cluster manager memory #7516

Open

shwetathareja assigned indrajohn7 and unassigned xuezhou25 Jun 12, 2023

indrajohn7 mentioned this issue Jun 22, 2023

[Optimization] Moving get snapshot requests to listener based async calls #8216

Closed

6 tasks

This was referenced Jun 29, 2023

Moving get snapshot requests to listener based async calls #8347

Closed

Moving get snapshot requests to listener based async calls #8377

Merged

indrajohn7 mentioned this issue Jul 17, 2023

[Optimisation] get snapshot calls caching optimisation #8713

Open

shwetathareja closed this as completed in #8377 Aug 8, 2023

This was referenced Aug 8, 2023

Backport/backport 8377 to 2.x #9164

Closed

Backport/backport 8377 to 2.x #9167

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Too many get snapshot calls causing generic threadpool to be busy completely #1788

[BUG] Too many get snapshot calls causing generic threadpool to be busy completely #1788

shwetathareja commented Dec 22, 2021 •

edited by adnapibar

Loading

shwetathareja commented Dec 22, 2021

shwetathareja commented Dec 23, 2021

xuezhou25 commented Dec 29, 2021

shwetathareja commented Jan 3, 2022

shwetathareja commented Jan 24, 2022

piyushdaftary commented Feb 7, 2022

shwetathareja commented Mar 8, 2022

anasalkouz commented Apr 1, 2022

Bukhtawar commented Apr 30, 2023 •

edited

Loading

amkhar commented Jun 1, 2023

indrajohn7 commented Jun 1, 2023

indrajohn7 commented Jun 9, 2023 •

edited

Loading

[BUG] Too many get snapshot calls causing generic threadpool to be busy completely #1788

[BUG] Too many get snapshot calls causing generic threadpool to be busy completely #1788

Comments

shwetathareja commented Dec 22, 2021 • edited by adnapibar Loading

shwetathareja commented Dec 22, 2021

shwetathareja commented Dec 23, 2021

xuezhou25 commented Dec 29, 2021

shwetathareja commented Jan 3, 2022

shwetathareja commented Jan 24, 2022

piyushdaftary commented Feb 7, 2022

shwetathareja commented Mar 8, 2022

anasalkouz commented Apr 1, 2022

Bukhtawar commented Apr 30, 2023 • edited Loading

amkhar commented Jun 1, 2023

indrajohn7 commented Jun 1, 2023

indrajohn7 commented Jun 9, 2023 • edited Loading

Reproduction steps:

shwetathareja commented Dec 22, 2021 •

edited by adnapibar

Loading

Bukhtawar commented Apr 30, 2023 •

edited

Loading

indrajohn7 commented Jun 9, 2023 •

edited

Loading