-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Too many get snapshot calls causing generic threadpool to be busy completely #1788
Comments
Also, we should support timeout and cancelling of get snapshot calls if they are taking more than the configure timeout. |
During the issue, GET Snapshot API calls were taking ~1.5 hours |
Is anyone currently working on this issue? Or I can take a look. |
@xuezhou25: feel free to take a stab at it. |
You can check this PR 1cde4a6 on how this call was switched to async listener in TransportSnapshotsStatusAction class. code ref: Line 316 in 3ce377f
|
Using the get snapshots API can be expensive as the API requires a read from repository for each index/shards in each snapshot. Here the main problem I do see with long running get snapshots call is that it is blocking threads of In my opinion to address the issue we should explore following 3 options:
Timeout/Cancellation can be good option in some cases, but it won't solve the purpose of those who want to get snapshot info of really large repository without impacting other functionality. @shwetathareja @xuezhou25 : WDYT ? |
@piyushdaftary : Right now there is a deadlock in the code due to blocking call. Even if you move to different "snapshot" threadpool, the blocking wait would still be there and in case snapshot threadpool is exhausted then those threads would remain waiting. This bug has to be fixed either ways. Regarding having a different threadpool for snapshot. It is not critical operation, hence didn't suggest having a separate threadpool for it. The current "Snapshot" threadpool is used by RepositoryService and I dont think it is a good idea to club get snapshot API with that as it can effect actual snapshots being taken. Also "generic" threadpool caters to most of the APIs and creating threadpool per APIs would not be a good decision. |
@xuezhou25 are you still actively working on this issue? |
All cluster PUT settings calls are getting stuck due to this. The publication of the cluster state relies on this thread pool, so its critical we address this sooner.
The problem with this state is it can stall all write traffic if there is a dynamic mapping update
|
Starting points to explore :
Note : this analysis is based on a setup with 100 node cluster, 4K indices and 400 snapshots. Attaching flame graph for the same. cached-code-steady-state-alloc-26-c1.txt |
Had an initial deep dive on this issue, below are a few findings:
|
Reproduction steps:
|
Describe the bug
Too many get snapshot calls for custom repo which has more than 70k snapshots caused generic threadpool to be exhausted completely. They were in indefinite wait state, looks like a deadlock.
Observed this behavior in 7.10 and 7.1 but the stack traces were different.
In 7.10 the code to fetch repository is under Future.get blocking call and the repositoriesService.getRepositoryData executes internal method in generic threadpool which was exhausted causing the deadlock
threadPool.generic().execute(ActionRunnable.wrap(listener, this::doGetRepositoryData));
This issue would exist in OpenSearch as well, haven't tried explicit repro yet.
In 7.1 though, there was no deadlock but the get snapshot calls were taking really long to finish keeping the whole threadpool busy.
Expected behavior
The TransportGetSnapshotAction shouldn't block wait on the
repositoriesService.getRepositoryData
and move to async processingRepository.getRepositoryData was made async in 0acba44 and blocking step for the
Blocking call was removed from snapshot status API TransportSnapshotsStatusAction here - 1cde4a6
Due to this, pending tasks were stuck for hours on master.
7.10
All the 128 thread of generic threadpool were busy waiting
7.1
All 128 threads either were waiting for connection for S3 or running at the conscrypt library code.
The text was updated successfully, but these errors were encountered: