Moving get snapshot requests to listener based async calls #8347

indrajohn7 · 2023-06-29T12:08:30Z

Description

This draft PR is to discuss the optimisation changes for listener based get_snapshot calls handled through async based model.

The TransportGetSnapshotAction shouldn't block wait on the
repositoriesService.getRepositoryData and move to async processing

Due to this, pending tasks were stuck for hours on master.

Reproduction:

700 shards.
Concurrent create_snapshot() calls and PUT mapping requests.
140 concurrent get_snapshot calls
It keeps all the generic threadpool busy and the pending_tasks queue piles up.

"opensearch[c007f9bc9cbee0de09eb93767897e305][generic][T#24]" #131865 daemon prio=5 os_prio=0 cpu=1319.41ms elapsed=20.31s tid=0x00007effe406a2b0 nid=0x1d2f waiting on condition  [0x00007effa9b92000]
   java.lang.Thread.State: WAITING (parking)
        at jdk.internal.misc.Unsafe.park(java.base@11.0.18/Native Method)
        - parking to wait for  <0x00000006ac9752d8> (a org.opensearch.common.util.concurrent.BaseFuture$Sync)
        at java.util.concurrent.locks.LockSupport.park(java.base@11.0.18/LockSupport.java:194)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11.0.18/AbstractQueuedSynchronizer.java:885)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(java.base@11.0.18/AbstractQueuedSynchronizer.java:1039)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base@11.0.18/AbstractQueuedSynchronizer.java:1345)
        at org.opensearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:272)
        at org.opensearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:104)
        at org.opensearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:74)
        at org.opensearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:55)
        at org.opensearch.action.support.PlainActionFuture.get(PlainActionFuture.java:51)
        at org.opensearch.action.admin.cluster.snapshots.get.TransportGetSnapshotsAction.clusterManagerOperation(TransportGetSnapshotsAction.java:143)
        at org.opensearch.action.admin.cluster.snapshots.get.TransportGetSnapshotsAction.clusterManagerOperation(TransportGetSnapshotsAction.java:82)
        at org.opensearch.action.support.clustermanager.TransportClusterManagerNodeAction.masterOperation(TransportClusterManagerNodeAction.java:144)
        at org.opensearch.action.support.clustermanager.TransportClusterManagerNodeAction.clusterManagerOperation(TransportClusterManagerNodeAction.java:153)
        at org.opensearch.action.support.clustermanager.TransportClusterManagerNodeAction$AsyncSingleAction.lambda$doStart$3(TransportClusterManagerNodeAction.java:269)
        at org.opensearch.action.support.clustermanager.TransportClusterManagerNodeAction$AsyncSingleAction$$Lambda$5035/0x00000008015ff440.accept(Unknown Source)
        at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:88)
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:815)
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.18/ThreadPoolExecutor.java:1128)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.18/ThreadPoolExecutor.java:628)
        at java.lang.Thread.run(java.base@11.0.18/Thread.java:829)

Too many pending tasks queue:

5252  1.1d NORMAL update snapshot after shards started [false] or node configuration changed [true]
10377   12h NORMAL cluster_reroute(reroute after starting shards)
13056  7.2h NORMAL cluster_reroute(reroute after starting shards)
11170 10.3h NORMAL cluster_reroute(reroute after starting shards)
15289 11.7m HIGH   shard-failed
15283 11.7m HIGH   shard-failed
15269   28m HIGH   shard-failed
9039 14.3h NORMAL cluster_reroute(reroute after starting shards)

% curl localhost:9200/_cat/pending_tasks | wc -l
 % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                Dload  Upload   Total   Spent    Left  Speed
100  411k  100  411k    0     0  1674k      0 --:--:-- --:--:-- --:--:-- 1680k
**10022**

Pending tasks queue piles up to > 10K.

With Fix:

With the similar cluster configurations:
The pending task queue never piles up for more than 10 - 20 count with similar number of concurrent create/ get snapshot calls and PUT mapping requests.
All the generic threadpools are also not blocked with the get_snapshot() calls.

% grep -irn "getRepositoryData" ./output.jstk | wc -l
0
...

% grep -irn "getRepositoryData" ./output.jstk| wc -l 
2
...

% grep -irn "getRepositoryData" ./output.jstk| wc -l 
6

Use of `SNAPSHOT` Threadpool

Use SNAPSHOT threadpool in stead of GENERIC threadpool for the get_snapshot calls.
The create snapshot calls use the SNAPSHOTS threadpool, the same can be expedited here as well unblocking the generic threadpool.

Observations:

This has a ripple effect on Latency. With too many get_snapshot concurrent calls, it was observed that there is always a single entry in the thread dump for the SNAPSHOT threadpool, processing processing the get_snapshot request.
- For 10 concurrent GET calls, where the generic threadpool response time is ~1 secs, with listener based async model the SNAPSHOT threadpool is being latent, where the latency is ~100 - ~110 secs, which is a real impact to the performance metrics.
  *get_snapshot response:

        % tail -f nohup.out         
          % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                         Dload  Upload   Total   Spent    Left  Speed
          0     0    0     0    0     0      0      0 --:--:--  0:01:49 --:--:--     02023-06-13t08-50-54.5a5cb21d-8563-46ca-9779-4ee58443fa91     SUCCESS 1686646257 08:50:57 1686646259 08:50:59  1.2s   2    2 0    2
        2023-06-13t09-21-44.d9879b36-162b-7c62-f194-240b355be809     SUCCESS 1686648104 09:21:44 1686648104 09:21:44 200ms   2    2 0    2
        2023-06-13t10-21-44.bd67e503-9d29-d63e-0efd-c4f3763891fe     SUCCESS 1686651704 10:21:44 1686651711 10:21:51  6.8s  34  322 0  322
        2023-06-13t11-21-44.16b7bd58-7ca0-de6a-6537-cdd99236cc17

This is because the number of SNAPSHOT threadpools are lesser compared to the generic one.

final int genericThreadPoolMax = boundedBy(4 * allocatedProcessors, 128, 512);
builders.put(Names.GENERIC, new ScalingExecutorBuilder(Names.GENERIC, 4, genericThreadPoolMax, TimeValue.timeValueSeconds(30)));
builders.put(Names.SNAPSHOT, new ScalingExecutorBuilder(Names.SNAPSHOT, 1, halfProcMaxAt5, TimeValue.timeValueMinutes(5)));

Hence use of SNAPSHOT threadpool would not be recommended here as it may block the create snapshot calls as well if its shared across get_snapshot requests.

Related Issues

Resolves #1788

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed per the DCO using --signoff
Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: INDRAJIT BANERJEE <indrajohn7@gmail.com>

github-actions · 2023-06-29T12:16:48Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/18693/
CommitID: c98845a
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

Moving get snapshot requests to listener based async calls

c98845a

Signed-off-by: INDRAJIT BANERJEE <indrajohn7@gmail.com>

indrajohn7 mentioned this pull request Jun 29, 2023

[Optimization] Moving get snapshot requests to listener based async calls #8216

Closed

6 tasks

andrross mentioned this pull request Jul 5, 2023

Moving get snapshot requests to listener based async calls #8377

Merged

6 tasks

indrajohn7 closed this Jul 6, 2023

indrajohn7 deleted the get_snapshot_listener branch July 6, 2023 05:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Moving get snapshot requests to listener based async calls #8347

Moving get snapshot requests to listener based async calls #8347

indrajohn7 commented Jun 29, 2023

github-actions bot commented Jun 29, 2023

Moving get snapshot requests to listener based async calls #8347

Moving get snapshot requests to listener based async calls #8347

Conversation

indrajohn7 commented Jun 29, 2023

Description

Reproduction:

With Fix:

Use of SNAPSHOT Threadpool

Related Issues

Check List

github-actions bot commented Jun 29, 2023

Gradle Check (Jenkins) Run Completed with:

Use of `SNAPSHOT` Threadpool