-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimise snapshot deletion to speed up snapshot deletion and creation #15568
Optimise snapshot deletion to speed up snapshot deletion and creation #15568
Conversation
There are existing UTs and ITs that covers the changed code. |
❌ Gradle check result for 33b0dd1: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Ashish Singh <ssashish@amazon.com>
33b0dd1
to
05757e2
Compare
❌ Gradle check result for 05757e2: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
…imisations Signed-off-by: Ashish Singh <ssashish@amazon.com>
❌ Gradle check result for 7329e67: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
…imisations Signed-off-by: Ashish Singh <ssashish@amazon.com>
❌ Gradle check result for 345a277: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❌ Gradle check result for e777412: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Flaky tests - #15600 |
❌ Gradle check result for 739557a: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
The backport to
To backport manually, run these commands in your terminal: # Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/OpenSearch/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/OpenSearch/backport-2.x
# Create a new branch
git switch --create backport/backport-15568-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 3fc0139ca68a1ff843ec1492c3cd52c2c4c67f02
# Push it to GitHub
git push --set-upstream origin backport/backport-15568-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/OpenSearch/backport-2.x Then, create a pull request where the |
…opensearch-project#15568) --------- Signed-off-by: Ashish Singh <ssashish@amazon.com>
…opensearch-project#15568) --------- Signed-off-by: Ashish Singh <ssashish@amazon.com>
…opensearch-project#15568) --------- Signed-off-by: Ashish Singh <ssashish@amazon.com>
…opensearch-project#15568) --------- Signed-off-by: Ashish Singh <ssashish@amazon.com>
…opensearch-project#15568) --------- Signed-off-by: Ashish Singh <ssashish@amazon.com>
…opensearch-project#15568) --------- Signed-off-by: Ashish Singh <ssashish@amazon.com>
Description
Snapshot creation is distributed in nature. The snapshot creation operation is performed by the Data node holding the primary shard. Hence the total snapshot creation work is shared amongst all the data nodes in the cluster. On the contrary, the snapshot deletion is handled solely by active cluster manager. This can lead to excessively slow snapshots deletion when there are relative higher number of primary shards in the cluster.
In this PR, we have tried fixing this problem by creating a separate thread that is responsible for performing snapshot deletion or old shard gen cleanup during snapshot creation. The thread count has been set as 4x the number of allocated processor. The thread count is bounded between 64 and 256 to ensure that we have sufficient threads to get the deletion done and not too many threads that they start eating up from the connections of other remote store operations that may happen on the same cluster.
Check List
[ ] API changes companion pull request created, if applicable.[ ] Public documentation issue/PR created, if applicable.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.